Skip to content

fleet-ai/paper-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Paper Crawler

Collects scientific papers with verified accessible data for RL training environments.

What It Does

  1. Searches OpenAlex for papers mentioning data repositories (Zenodo, GitHub, OSF)
  2. Verifies data actually exists - not just code repos
  3. Saves papers with confirmed downloadable data

Output

output/papers.json  # 983 papers with verified data
output/papers.csv   # Simplified format

Each paper has: title, DOI, category, verified data link, source, date.

Usage

python3 crawler.py

Files

File Purpose
crawler.py Main crawler
data_detector.py Verifies repos contain actual data files
storage.py Deduplication + resume
config.py API key, date cutoff, patterns

Data Verification

The key innovation: checks that repos contain data, not just code.

  • GitHub: Fetches file tree, looks for .csv, .h5, .npy, etc.
  • Zenodo: Checks record has downloadable data files
  • OSF: Verifies project is public with data

Stats

Source Papers
Zenodo 925
GitHub 34
OSF 14
Other 10
Total 983

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages