PubMed Scraper

A distributed microservice-based web crawler for fetching open-source research papers from PubMed, arXiv, bioRxiv, and medRxiv. Includes a modern web interface for easy paper collection.

Features

Multi-source scraping from PubMed, arXiv, bioRxiv, and medRxiv
Web-based user interface with real-time progress tracking
Intelligent paper classification (Research, Review, Clinical Trial, etc.)
Advanced filtering by year, country, paper type, and language
Multiple export formats: CSV, JSON, Parquet, TXT
RESTful API with async background jobs
Command-line interface for scripting

Quick Start

Installation

# Clone the repository
git clone https://github.com/yourusername/PubMed-Scraper.git
cd PubMed-Scraper

# Install dependencies
pip install -r requirements.txt

# Install Flask for web interface
pip install flask

Web Interface

The easiest way to use PubMed Scraper is through the web interface:

python web_app.py

Open http://localhost:5000 in your browser. The interface allows you to:

Enter search queries
Select data sources (PubMed, arXiv, bioRxiv, or all)
Filter by country (USA, India, China, Japan)
Set maximum number of papers
Download results as JSON or CSV

Python SDK

import asyncio
from src.crawlers import CrawlerFactory, FilterParams

async def main():
    # Configure filters
    filters = FilterParams(
        year_start=2020,
        year_end=2024,
        max_results=100,
    )

    # Scrape from PubMed
    crawler = CrawlerFactory.get("pubmed")
    papers = []
    
    async with crawler:
        async for paper in crawler.crawl("cancer biomarkers", filters):
            papers.append(paper)
    
    print(f"Found {len(papers)} papers")

asyncio.run(main())

CLI Usage

# Basic search
python -m src.cli scrape "cancer biomarkers" --max 100

# Multi-source with filters
python -m src.cli scrape "machine learning" \
  --source pubmed \
  --source arxiv \
  --from 2020 \
  --to 2024 \
  --format csv

Project Structure

PubMed-Scraper/
|-- src/
|   |-- crawlers/          
|   |   |-- base/          
|   |   |-- pubmed/        
|   |   |-- arxiv/     
|   |   +-- biorxiv/    
|   |-- processors/     
|   |-- export/       
|   |-- gateway/       
|   +-- shared/         
|-- templates/       
|-- static/            
|-- web_app.py            
|-- requirements.txt     
+-- tests/

Supported Paper Types

Type	Description
research_article	Original research paper
review	Literature review
systematic_review	Systematic review
meta_analysis	Meta-analysis
clinical_trial	Clinical trial
randomized_controlled_trial	RCT
case_report	Case report
preprint	Preprint (not peer-reviewed)

Export Formats

Format	Extension	Best For
CSV	.csv	Excel, spreadsheets
JSON	.json	APIs, full data structure
Parquet	.parquet	Big data analytics
TXT	.txt	Human reading

Configuration

Create a .env file from .env.example:

# PubMed API (optional, for higher rate limits)
PUBMED_API_KEY=your-api-key
NCBI_EMAIL=your-email@example.com

Docker Deployment

# Start all services
docker-compose up -d

# Scale workers
docker-compose up -d --scale worker=4

# Stop services
docker-compose down

Testing

# Run all tests
pytest

# With coverage
pytest --cov=src --cov-report=html

API Endpoints

Endpoint	Method	Description
`/`	GET	Web interface
`/api/scrape`	POST	Start scraping job
`/api/status/<job_id>`	GET	Get job status
`/api/papers/<job_id>`	GET	Get scraped papers
`/api/download/<job_id>/<format>`	GET	Download results

Requirements

Python 3.10+
Flask (for web interface)
httpx (for async HTTP)
See requirements.txt for full list

License

LGPL-3.0 - see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
docker/services		docker/services
src		src
static		static
templates		templates
tests		tests
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
PubMed Scraper.png		PubMed Scraper.png
README.md		README.md
collect_500_more.py		collect_500_more.py
collect_open_access.py		collect_open_access.py
docker-compose.yml		docker-compose.yml
download_pdfs.py		download_pdfs.py
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
test_scrape.py		test_scrape.py
web_app.py		web_app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PubMed Scraper

Features

Quick Start

Installation

Web Interface

Python SDK

CLI Usage

Project Structure

Supported Paper Types

Export Formats

Configuration

Docker Deployment

Testing

API Endpoints

Requirements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PubMed Scraper

Features

Quick Start

Installation

Web Interface

Python SDK

CLI Usage

Project Structure

Supported Paper Types

Export Formats

Configuration

Docker Deployment

Testing

API Endpoints

Requirements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages