Skip to content

News APIs and Datasets

Ben Steel edited this page Dec 15, 2021 · 10 revisions

Sources for news

Datasets

  • Free and open web crawler dataset
  • Index of URLs available
  • Only representative sample of pages are included due to cost constraints
  • Crawler from Common Crawl focused on news crawling
  • Updates more frequently for breaking news
  • Uses Google News sitemap and RSS feeds so only new news is added, unless issue 41 is resolved
  • Started in 2016, so only news since 2016 is included
  • Unknown site coverage, no index available
  • More information on this dataset can be found here
  • Corpus containing news articles from 1994 to 2011
  • English language sources from variety of countries: Agence France-Presse, Associated Press, Central News Agency of Taiwan, Los Angeles Times/Washington Post, Washington Post/Bloomberg, New York Times, Xinhua News Agency
  • Collected using mostly Newswire service
  • Longitudinal financial news dataset
  • Issued takedown notice on Github, still available on request from author
  • Initial paper here
  • Pulled from news archives directly
  • Latitudinal political news dataset
  • Presented political news classifier
  • Full source available on request from me
  • Pulled from Common Crawl

Search Engines

  • Limited to 100 free requests a day
  • Limited to 1000 free requests a month

News APIs

  • Can only search articles a month old*
  • 100 requests a day*

* with free tier