-
Notifications
You must be signed in to change notification settings - Fork 3
News APIs and Datasets
Ben Steel edited this page Dec 15, 2021
·
10 revisions
Sources for news
- Free and open web crawler dataset
- Index of URLs available
- Only representative sample of pages are included due to cost constraints
- Crawler from Common Crawl focused on news crawling
- Updates more frequently for breaking news
- Uses Google News sitemap and RSS feeds so only new news is added, unless issue 41 is resolved
- Started in 2016, so only news since 2016 is included
- Unknown site coverage, no index available
- More information on this dataset can be found here
- Corpus containing news articles from 1994 to 2011
- English language sources from variety of countries: Agence France-Presse, Associated Press, Central News Agency of Taiwan, Los Angeles Times/Washington Post, Washington Post/Bloomberg, New York Times, Xinhua News Agency
- Collected using mostly Newswire service
- Longitudinal financial news dataset
- Issued takedown notice on Github, still available on request from author
- Initial paper here
- Pulled from news archives directly
- Latitudinal political news dataset
- Presented political news classifier
- Full source available on request from me
- Pulled from Common Crawl
- Limited to 100 free requests a day
- Limited to 1000 free requests a month
- Can only search articles a month old*
- 100 requests a day*
* with free tier