A distributed microservice-based web crawler for fetching open-source research papers from PubMed, arXiv, bioRxiv, and medRxiv. Includes a modern web interface for easy paper collection.
- Multi-source scraping from PubMed, arXiv, bioRxiv, and medRxiv
- Web-based user interface with real-time progress tracking
- Intelligent paper classification (Research, Review, Clinical Trial, etc.)
- Advanced filtering by year, country, paper type, and language
- Multiple export formats: CSV, JSON, Parquet, TXT
- RESTful API with async background jobs
- Command-line interface for scripting
# Clone the repository
git clone https://github.com/yourusername/PubMed-Scraper.git
cd PubMed-Scraper
# Install dependencies
pip install -r requirements.txt
# Install Flask for web interface
pip install flaskThe easiest way to use PubMed Scraper is through the web interface:
python web_app.pyOpen http://localhost:5000 in your browser. The interface allows you to:
- Enter search queries
- Select data sources (PubMed, arXiv, bioRxiv, or all)
- Filter by country (USA, India, China, Japan)
- Set maximum number of papers
- Download results as JSON or CSV
import asyncio
from src.crawlers import CrawlerFactory, FilterParams
async def main():
# Configure filters
filters = FilterParams(
year_start=2020,
year_end=2024,
max_results=100,
)
# Scrape from PubMed
crawler = CrawlerFactory.get("pubmed")
papers = []
async with crawler:
async for paper in crawler.crawl("cancer biomarkers", filters):
papers.append(paper)
print(f"Found {len(papers)} papers")
asyncio.run(main())# Basic search
python -m src.cli scrape "cancer biomarkers" --max 100
# Multi-source with filters
python -m src.cli scrape "machine learning" \
--source pubmed \
--source arxiv \
--from 2020 \
--to 2024 \
--format csvPubMed-Scraper/
|-- src/
| |-- crawlers/
| | |-- base/
| | |-- pubmed/
| | |-- arxiv/
| | +-- biorxiv/
| |-- processors/
| |-- export/
| |-- gateway/
| +-- shared/
|-- templates/
|-- static/
|-- web_app.py
|-- requirements.txt
+-- tests/
| Type | Description |
|---|---|
| research_article | Original research paper |
| review | Literature review |
| systematic_review | Systematic review |
| meta_analysis | Meta-analysis |
| clinical_trial | Clinical trial |
| randomized_controlled_trial | RCT |
| case_report | Case report |
| preprint | Preprint (not peer-reviewed) |
| Format | Extension | Best For |
|---|---|---|
| CSV | .csv | Excel, spreadsheets |
| JSON | .json | APIs, full data structure |
| Parquet | .parquet | Big data analytics |
| TXT | .txt | Human reading |
Create a .env file from .env.example:
# PubMed API (optional, for higher rate limits)
PUBMED_API_KEY=your-api-key
NCBI_EMAIL=your-email@example.com# Start all services
docker-compose up -d
# Scale workers
docker-compose up -d --scale worker=4
# Stop services
docker-compose down# Run all tests
pytest
# With coverage
pytest --cov=src --cov-report=html| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Web interface |
/api/scrape |
POST | Start scraping job |
/api/status/<job_id> |
GET | Get job status |
/api/papers/<job_id> |
GET | Get scraped papers |
/api/download/<job_id>/<format> |
GET | Download results |
- Python 3.10+
- Flask (for web interface)
- httpx (for async HTTP)
- See requirements.txt for full list
LGPL-3.0 - see LICENSE for details.