Paper Crawler

Collects scientific papers with verified accessible data for RL training environments.

What It Does

Searches OpenAlex for papers mentioning data repositories (Zenodo, GitHub, OSF)
Verifies data actually exists - not just code repos
Saves papers with confirmed downloadable data

Output

output/papers.json  # 983 papers with verified data
output/papers.csv   # Simplified format

Each paper has: title, DOI, category, verified data link, source, date.

Usage

python3 crawler.py

Files

File	Purpose
`crawler.py`	Main crawler
`data_detector.py`	Verifies repos contain actual data files
`storage.py`	Deduplication + resume
`config.py`	API key, date cutoff, patterns

Data Verification

The key innovation: checks that repos contain data, not just code.

GitHub: Fetches file tree, looks for .csv, .h5, .npy, etc.
Zenodo: Checks record has downloadable data files
OSF: Verifies project is public with data

Stats

Source	Papers
Zenodo	925
GitHub	34
OSF	14
Other	10
Total	983

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Paper Crawler

What It Does

Output

Usage

Files

Data Verification

Stats

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
output		output
.gitignore		.gitignore
README.md		README.md
config.py		config.py
crawler.py		crawler.py
data_detector.py		data_detector.py
storage.py		storage.py

Folders and files

Latest commit

History

Repository files navigation

Paper Crawler

What It Does

Output

Usage

Files

Data Verification

Stats

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages