Automated extraction of viral sequence features, mutations, and epitopes from PubMed literature using LLMs to accelerate outbreak response and genomic analysis
This is a project from the NIAID BRC AI Codeathon 2025, taking place November 12-14, 2025 at Argonne National Laboratory.
Event Website: https://niaid-brc-codeathons.github.io/
Project Details: https://niaid-brc-codeathons.github.io/projects/pubmed-miner/
The NIAID Bioinformatics Resource Centers (BRCs) invite researchers, data scientists, and developers to a three-day AI Codeathon focused on improving Findability, Accessibility, Interoperability, and Reusability (FAIR-ness) of BRC data and tools using artificial intelligence (AI) and large language models (LLMs).
Indresh Singh (Team Leader)
A Streamlit app to search PubMed for review articles, fetch PMC full text (when available), and run an LLM-powered extractor to mine mutation/protein findings with grounded snippets. Supports multiple LLM backends: Gemini, OpenAI, Anthropic, Groq, and custom endpoints.
For developers who want to run the app directly with Python and have full control over the environment.
For users who want a consistent, isolated environment that works the same on any system.
- Python 3.11+ (tested with Python 3.11.8)
- Operating System: Windows, macOS, or Linux
- Memory: 4GB+ RAM recommended
- Internet: Required for API calls and PMC fetching
- Docker 20.10+ and Docker Compose 2.0+
- Git for cloning the repository
- Memory: 4GB+ RAM recommended
- Internet: Required for API calls and PMC fetching
python --version
# or
python3 --versionWindows:
- Download from python.org
- Choose Python 3.11+ (latest stable)
- Check "Add Python to PATH" during installation
macOS:
# Using Homebrew (recommended)
brew install python@3.11
# Or download from python.orgLinux (Ubuntu/Debian):
sudo apt update
sudo apt install python3.11 python3.11-venv python3.11-pipLinux (CentOS/RHEL):
sudo yum install python3.11 python3.11-venv python3.11-pipgit clone https://github.com/Shubhs0411/Pubmed_Miner.git
cd Pubmed_Miner# Using Python 3.11+ (replace with your Python version)
python3.11 -m venv .venv
# or
python -m venv .venvWindows (PowerShell):
.\.venv\Scripts\Activate.ps1Windows (Command Prompt):
.\.venv\Scripts\activate.batmacOS/Linux:
source .venv/bin/activate# You should see (.venv) in your prompt
which python
# Should show: /path/to/Pubmed_Miner/.venv/bin/pythondeactivate# Install core dependencies
pip install -r requirements.txt
# Optional: Install additional ML libraries if needed (not required)Option 1: Gemini (Recommended)
- Go to Google AI Studio
- Sign in with your Google account
- Click "Get API key" → "Create API key"
- Copy the generated key
Option 2: Groq
- Go to Groq Console
- Sign in or create an account
- Create a new API key
- Copy the generated key
Option 3: OpenAI
- Go to OpenAI API Keys
- Sign in with your OpenAI account
- Click "Create new secret key"
- Copy the generated key
Option 4: Anthropic (Claude)
- Go to Anthropic Console
- Sign in with your Anthropic account
- Create a new API key
- Copy the generated key
- Go to NCBI API Key Registration
- Sign in to your NCBI account (create one if needed)
- Go to "API Key Management" → "Create API Key"
- Copy the generated key
Why NCBI API key is required?
- Without it: 3 requests/second, 10 requests/minute
- With it: 10 requests/second, 50 requests/minute
- Result: Faster PMC fetching, fewer rate limit errors
- Required for: Reliable operation and better performance
Option 1: Enter in the Web UI (Recommended)
- Start the app and enter your API keys in the sidebar
- Changes take effect immediately
- Easy to switch between different keys
Option 2: Create .env file Create a .env file in the project root:
# Required: NCBI API key for PubMed search
NCBI_API_KEY="your_ncbi_api_key_here"
# LLM: Choose ONE backend (or provide multiple and choose in UI)
GEMINI_API_KEY="your_gemini_api_key_here"
# OR
# GROQ_API_KEY="your_groq_api_key_here"
# OR
# OPENAI_API_KEY="your_openai_api_key_here"
# OR
# ANTHROPIC_API_KEY="your_anthropic_api_key_here"
# Optional: Rate limiting (adjust if you hit quotas)
GEMINI_RPM=10
GEMINI_TPM=180000
PAPER_PAUSE_SEC=3.0
# Optional: Contact info for NCBI
CONTACT_EMAIL="your_email@example.com"
# Optional: Custom LLM endpoint (for hackathon/proxy setups)
# CUSTOM_LLM_URL="http://localhost:8080/v1/completions"
# CUSTOM_LLM_MODEL="gpt4o"
# CUSTOM_LLM_TIMEOUT=120Note:
- API keys entered in the web UI take priority over
.envfile- The app will use
.envas a fallback if no key is entered in the UI- You can export these as environment variables instead of using
.envif you prefer
# Method 1: Using the root shim (recommended)
streamlit run app.py
# Method 2: Direct execution
streamlit run app/app.pyOpen the URL that Streamlit prints (usually http://localhost:8501).
The app should display:
- ✅ PubMed Review Miner title
- ✅ LLM Settings sidebar
- ✅ Query input area
- ✅ Date range selector
git clone https://github.com/Shubhs0411/Pubmed_Miner.git
cd Pubmed_MinerCreate a .env file with your API keys:
# Copy the example and edit
cp env.example .env
# Edit .env with your API keysRequired API Keys:
- NCBI_API_KEY: Get from NCBI API Key Registration
- LLM API Key (choose one or provide multiple and select in UI):
- GEMINI_API_KEY: Get from Google AI Studio
- GROQ_API_KEY: Get from Groq Console
- OPENAI_API_KEY: Get from OpenAI API Keys
- ANTHROPIC_API_KEY: Get from Anthropic Console
- CUSTOM_LLM_URL: For custom/hackathon endpoints
Note: All environment variables from .env are automatically loaded. The Docker setup supports all LLM backends.
# Build and start the application
docker-compose up --build
# Or run in background
docker-compose up -d --build- Open your browser to:
http://localhost:8501 - The app will be running in a containerized environment
# Stop containers
docker-compose down
# Stop and remove volumes
docker-compose down -v# Build the image
docker build -t pubmed-miner .
# Run the container
docker run -p 8501:8501 --env-file .env pubmed-miner
# View logs
docker-compose logs -f
# Rebuild after changes
docker-compose up --build --force-recreate
# Clean up
docker-compose down -v --rmi all- ✅ Consistent environment across all systems
- ✅ No Python version conflicts
- ✅ Easy deployment and scaling
- ✅ Isolated dependencies
- ✅ Production-ready setup
- ✅ All LLM backends supported (Gemini, OpenAI, Anthropic, Groq, Custom)
- ✅ Health checks enabled for monitoring
Pubmed_Miner/
├── app/ # Streamlit UI module
│ ├── __init__.py # Package initialization
│ └── app.py # Main Streamlit application
├── app.py # Root shim (redirects to app/app.py)
├── llm/ # LLM backends module
│ ├── __init__.py # Package initialization
│ ├── gemini.py # Google Gemini integration
│ ├── groq.py # Groq API integration
│ ├── openai.py # OpenAI API integration
│ ├── anthropic.py # Anthropic API integration
│ ├── custom.py # Custom LLM endpoint
│ ├── prompts.py # Prompt templates
│ ├── utils.py # Shared utilities
│ └── unified.py # Unified LLM interface
├── pipeline/ # Batch processing module
│ ├── __init__.py # Package initialization
│ ├── batch_analyze.py # Batch fetch + LLM analysis
│ └── csv_export.py # CSV export utilities
├── services/ # External API services
│ ├── __init__.py # Package initialization
│ ├── pmc.py # PMC full-text fetching
│ └── pubmed.py # PubMed search & metadata
├── venv/ # Virtual environment (excluded from git)
├── .env # Environment variables (excluded from git)
├── .gitignore # Git ignore patterns
├── README.md # This file
└── requirements.txt # Python dependencies
app/ - User Interface
app.py: Streamlit web interface with search, selection, and results display- Handles user interactions and data visualization
llm/ - Language Model Integration
gemini.py: Google Gemini API integrationgroq.py: Groq API integration (Llama models)openai.py: OpenAI API integration (GPT-4o)anthropic.py: Anthropic API integration (Claude)custom.py: Custom/backend-agnostic LLM endpointprompts.py: Prompt templates and pattern descriptionsutils.py: Shared utilities for all backends- All backends provide identical APIs:
run_on_paper(),clean_and_ground()
pipeline/ - Batch Processing
batch_analyze.py: Orchestrates PMC fetching + LLM analysis- Functions:
fetch_all_fulltexts(),analyze_texts(),flatten_to_rows()
services/ - External APIs
pmc.py: PMC full-text retrieval (JATS XML + HTML fallback)pubmed.py: PubMed search, metadata, and date filtering- Functions:
esearch_reviews(),esummary(),get_pmc_fulltext_with_meta()
# UI imports
from app.app import main
# LLM backends (all provide identical APIs)
from llm.gemini import run_on_paper, clean_and_ground
from llm.groq import run_on_paper, clean_and_ground
from llm.openai import run_on_paper, clean_and_ground
from llm.anthropic import run_on_paper, clean_and_ground
from llm.custom import run_on_paper, clean_and_ground
from llm.unified import run_on_paper, clean_and_ground # Unified interface
# Services
from services.pmc import get_pmc_fulltext_with_meta, get_last_fetch_source
from services.pubmed import esearch_reviews, esummary
# Pipeline
from pipeline.batch_analyze import fetch_all_fulltexts, analyze_texts
from pipeline.csv_export import flatten_to_rowsWhen the app starts, you can try a query like:
((Dengue[Title]) AND (protein)) AND ((active site[Text Word]) OR (mutation[Text Word]))
This will search for Dengue-related protein review literature mentioning an active site or mutations.
- Search PubMed – Enter your query and press search.
- Select PMIDs – Choose papers to include in analysis.
- Fetch Full Text – The app tries to pull PMC full text (if available).
- Run Extraction – Use the LLM-based pipeline to extract mutations, proteins, and structural features with grounded quotes.
- Review & Export – Inspect results in the UI; download CSV/JSON as needed.
Edit the extraction prompt in the web UI to customize what features are extracted.
- Expand "📝 Edit Extraction Prompt" in the web UI
- Edit the
PATTERN RECOGNITION GUIDEto add/modify pattern descriptions - Edit the
INSTRUCTIONSto change extraction priorities - Click "💾 Save Changes" and test on a sample paper
- PATTERN RECOGNITION GUIDE: Mutation formats, protein patterns, residue patterns, domain patterns
- INSTRUCTIONS: Extraction priorities, coverage requirements, filtering criteria
- SYSTEM / INSTRUCTION: AI extractor role and identity
- DEFINITIONS: Feature types and definitions
Locked (for safety): JSON schema, output format, few-shot examples
- Add all mutation notation styles you encounter (
A226V,p.Ala226Val, spelled mutations) - Include concrete examples in pattern descriptions
- Test incrementally with small changes
- Use "🔄 Reset to Default" if needed
To add arrow notation (226A→V), update the mutation patterns:
**Mutation Patterns:**
• Standard: A226V, K128E
• Arrow notation: 226A→V, 128K→E ← Add this
• HGVS: p.Ala226Val
GEMINI_API_KEY not set or GROQ_API_KEY not set
- Add your API key to
.envfile or export as environment variable - Restart the app after adding the key
NCBI_API_KEY not set
- NCBI API key is required for reliable operation
- Get your key from NCBI API Key Registration
- Add it to your
.envfile:NCBI_API_KEY="your_key_here" - Restart the app after adding the key
Rate limit/quota errors
- Lower
GEMINI_RPMand/orGEMINI_TPMin.env - Increase
PAPER_PAUSE_SECfor slower processing - Ensure you have an NCBI API key - it's required for reliable operation (3→10 req/sec, 10→50 req/min)
Some PMIDs show no PMC text
- The paper may be embargoed or not deposited in PMC
- The app will still process available items
Import errors or missing modules
- Ensure you're using Python 3.11+
- Check virtual environment is activated:
which pythonshould show.venv/bin/python - Reinstall dependencies:
pip install -r requirements.txt - Try:
pip install --upgrade pipthenpip install -r requirements.txt
Blank page or app won't start
- Try:
streamlit run app/app.pydirectly - Check console for error messages
- Ensure all dependencies are installed
Docker-specific issues
- Container won't start: Check logs with
docker-compose logs -f - Health check fails: Ensure the app is running (wait 40s for startup)
- Port already in use: Change port mapping in
docker-compose.yml(e.g.,"8502:8501") - API keys not working: Verify
.envfile is mounted and variables are set correctly - Build fails: Clear Docker cache:
docker-compose build --no-cache
- Python: 3.11+ (tested with 3.11.8)
- RAM: 4GB+ recommended
- Storage: 2GB+ free space
- OS: Windows 10+, macOS 10.15+, or Linux (Ubuntu 20.04+)
- The app focuses on Review articles and handles date filters and pagination.
- Fetching relies on PMC availability; for non-PMC papers, full text may not be obtainable.
- The extractor uses prompt-based pattern recognition (no regex pre-scanning).
- Multiple LLM backends are supported; choose based on your needs (speed, cost, accuracy).
MIT.