Collects scientific papers with verified accessible data for RL training environments.
- Searches OpenAlex for papers mentioning data repositories (Zenodo, GitHub, OSF)
- Verifies data actually exists - not just code repos
- Saves papers with confirmed downloadable data
output/papers.json # 983 papers with verified data
output/papers.csv # Simplified format
Each paper has: title, DOI, category, verified data link, source, date.
python3 crawler.py| File | Purpose |
|---|---|
crawler.py |
Main crawler |
data_detector.py |
Verifies repos contain actual data files |
storage.py |
Deduplication + resume |
config.py |
API key, date cutoff, patterns |
The key innovation: checks that repos contain data, not just code.
- GitHub: Fetches file tree, looks for
.csv,.h5,.npy, etc. - Zenodo: Checks record has downloadable data files
- OSF: Verifies project is public with data
| Source | Papers |
|---|---|
| Zenodo | 925 |
| GitHub | 34 |
| OSF | 14 |
| Other | 10 |
| Total | 983 |