| layout | page |
|---|---|
| title | Data |
| subtitle |
This Twitter dataset contains tens of millions tweets related to COVID-19 collected starting on February 6, 2020.
Each gzipped file contains data from a single day. Each line of the file contains a single JSON record. You should read the file one line at a time and parse that JSON line to obtain information about one tweet.
Each tweet record will have the following fields:
tweet_id: An integer value. The ID of the tweet. You will use this ID to download the tweet from Twitter.user_id: An integer value. The user ID of the author of this tweet. If this is a retweet, this is the user who retweeted the tweet.date: A string value. The date the tweet was posted in the standard Twitter date format, e.g. "Wed Feb 12 04:59:55 +0000 2020"keywords: A list. Contains COVID-19 related keywords that we used to identify this tweet.location: A dictionary. The location of this tweet. If a location is known, it can includecountry,state, andcity.
An example tweet record:
{"tweet_id":1243171774055014401,"user_id":852581594674208768,"date":"Thu Mar 26 13:43:42 +0000 2020","keywords":["covid"],"location":{"country":"United States","state":"Maryland","city":"Baltimore"}}
To obtain the original tweets, use the Twitter Hydrator, which takes the tweet_id and downloads the corresponding tweet (if it is available.)
We occasionally have missing data due to downloading issues. You can observe missing data by gaps in the dates within the file.
We use the Twitter public keyword streaming API to download all tweets containing COVID-19 related keywords. The keywords included in this collection are: ``` coronavirus wuhan 2019ncov sars mers 2019-ncov wuflu COVID-19 COVID19 COVID covid-19 covid19 covid SARS2 SARSCOV19 ```We also include tweets that contain these keywords as hashtags, e.g. #covid19.
We create the dataset using the following process.
- We match (case-insensitive) every downloaded tweet against the above keywords, including if they appear as hashtags.
- We inferred the location of the tweet using Carmen, a geolocation toolkit. Carmen provides three levels of information: country, state and city. If the tweet has a
placeorcoordinatesfield, Carmen returns this information. Otherwise, Carmen infers the location from the profile field.
@misc{huang_xiaolei_2020_3735015,
author = {Huang, Xiaolei and Jamison, Amelia and Broniatowski, David and Quinn, Sandra and Dredze, Mark},
title = {Coronavirus Twitter Data: A collection of COVID-19 tweets with automated annotations},
month = {Mar},
year = {2020},
note = {http://twitterdata.covid19dataresources.org/index},
publisher = {Zenodo},
doi = {10.5281/zenodo.3735015},
url = {https://doi.org/10.5281/zenodo.3735015}
}