Skip to content

NIAID-BRC-Codeathons/pubmed-miner

Repository files navigation

PubMed Miner: AI-Powered Sequence Feature Extraction from Literature

Automated extraction of viral sequence features, mutations, and epitopes from PubMed literature using LLMs to accelerate outbreak response and genomic analysis

About This Project

This is a project from the NIAID BRC AI Codeathon 2025, taking place November 12-14, 2025 at Argonne National Laboratory.

Event Website: https://niaid-brc-codeathons.github.io/

Project Details: https://niaid-brc-codeathons.github.io/projects/pubmed-miner/

Codeathon Goals

The NIAID Bioinformatics Resource Centers (BRCs) invite researchers, data scientists, and developers to a three-day AI Codeathon focused on improving Findability, Accessibility, Interoperability, and Reusability (FAIR-ness) of BRC data and tools using artificial intelligence (AI) and large language models (LLMs).

Team

Indresh Singh (Team Leader)

Getting Started

PubMed Miner

A Streamlit app to search PubMed for review articles, fetch PMC full text (when available), and run an LLM-powered extractor to mine mutation/protein findings with grounded snippets. Supports multiple LLM backends: Gemini, OpenAI, Anthropic, Groq, and custom endpoints.


🚀 Choose Your Setup Method

Option 1: Traditional Python Setup

For developers who want to run the app directly with Python and have full control over the environment.

Option 2: Docker Setup (Recommended)

For users who want a consistent, isolated environment that works the same on any system.


📋 Prerequisites

For Python Setup

  • Python 3.11+ (tested with Python 3.11.8)
  • Operating System: Windows, macOS, or Linux
  • Memory: 4GB+ RAM recommended
  • Internet: Required for API calls and PMC fetching

For Docker Setup

  • Docker 20.10+ and Docker Compose 2.0+
  • Git for cloning the repository
  • Memory: 4GB+ RAM recommended
  • Internet: Required for API calls and PMC fetching

🐍 Option 1: Traditional Python Setup

Step 1: Python Version Setup

Check your Python version:

python --version
# or
python3 --version

If you need to install Python 3.11+:

Windows:

  1. Download from python.org
  2. Choose Python 3.11+ (latest stable)
  3. Check "Add Python to PATH" during installation

macOS:

# Using Homebrew (recommended)
brew install python@3.11

# Or download from python.org

Linux (Ubuntu/Debian):

sudo apt update
sudo apt install python3.11 python3.11-venv python3.11-pip

Linux (CentOS/RHEL):

sudo yum install python3.11 python3.11-venv python3.11-pip

Step 2: Clone the repository

git clone https://github.com/Shubhs0411/Pubmed_Miner.git
cd Pubmed_Miner

Step 3: Create & activate virtual environment

Create virtual environment:

# Using Python 3.11+ (replace with your Python version)
python3.11 -m venv .venv
# or
python -m venv .venv

Activate virtual environment:

Windows (PowerShell):

.\.venv\Scripts\Activate.ps1

Windows (Command Prompt):

.\.venv\Scripts\activate.bat

macOS/Linux:

source .venv/bin/activate

Verify activation:

# You should see (.venv) in your prompt
which python
# Should show: /path/to/Pubmed_Miner/.venv/bin/python

Deactivate when done:

deactivate

Step 4: Install dependencies

# Install core dependencies
pip install -r requirements.txt

# Optional: Install additional ML libraries if needed (not required)

Step 5: Configure API keys

Required: Get an LLM API Key (choose one)

Option 1: Gemini (Recommended)

  1. Go to Google AI Studio
  2. Sign in with your Google account
  3. Click "Get API key" → "Create API key"
  4. Copy the generated key

Option 2: Groq

  1. Go to Groq Console
  2. Sign in or create an account
  3. Create a new API key
  4. Copy the generated key

Option 3: OpenAI

  1. Go to OpenAI API Keys
  2. Sign in with your OpenAI account
  3. Click "Create new secret key"
  4. Copy the generated key

Option 4: Anthropic (Claude)

  1. Go to Anthropic Console
  2. Sign in with your Anthropic account
  3. Create a new API key
  4. Copy the generated key

Required: Get an NCBI API Key

  1. Go to NCBI API Key Registration
  2. Sign in to your NCBI account (create one if needed)
  3. Go to "API Key Management" → "Create API Key"
  4. Copy the generated key

Why NCBI API key is required?

  • Without it: 3 requests/second, 10 requests/minute
  • With it: 10 requests/second, 50 requests/minute
  • Result: Faster PMC fetching, fewer rate limit errors
  • Required for: Reliable operation and better performance

Configure API Keys

Option 1: Enter in the Web UI (Recommended)

  • Start the app and enter your API keys in the sidebar
  • Changes take effect immediately
  • Easy to switch between different keys

Option 2: Create .env file Create a .env file in the project root:

# Required: NCBI API key for PubMed search
NCBI_API_KEY="your_ncbi_api_key_here"

# LLM: Choose ONE backend (or provide multiple and choose in UI)
GEMINI_API_KEY="your_gemini_api_key_here"
# OR
# GROQ_API_KEY="your_groq_api_key_here"
# OR
# OPENAI_API_KEY="your_openai_api_key_here"
# OR
# ANTHROPIC_API_KEY="your_anthropic_api_key_here"

# Optional: Rate limiting (adjust if you hit quotas)
GEMINI_RPM=10
GEMINI_TPM=180000
PAPER_PAUSE_SEC=3.0

# Optional: Contact info for NCBI
CONTACT_EMAIL="your_email@example.com"

# Optional: Custom LLM endpoint (for hackathon/proxy setups)
# CUSTOM_LLM_URL="http://localhost:8080/v1/completions"
# CUSTOM_LLM_MODEL="gpt4o"
# CUSTOM_LLM_TIMEOUT=120

Note:

  • API keys entered in the web UI take priority over .env file
  • The app will use .env as a fallback if no key is entered in the UI
  • You can export these as environment variables instead of using .env if you prefer

Step 6: Run the app

# Method 1: Using the root shim (recommended)
streamlit run app.py

# Method 2: Direct execution
streamlit run app/app.py

Open the URL that Streamlit prints (usually http://localhost:8501).

Step 7: Verify installation

The app should display:

  • ✅ PubMed Review Miner title
  • ✅ LLM Settings sidebar
  • ✅ Query input area
  • ✅ Date range selector

🐳 Option 2: Docker Setup (Recommended)

Step 1: Clone the repository

git clone https://github.com/Shubhs0411/Pubmed_Miner.git
cd Pubmed_Miner

Step 2: Configure environment

Create a .env file with your API keys:

# Copy the example and edit
cp env.example .env
# Edit .env with your API keys

Required API Keys:

Note: All environment variables from .env are automatically loaded. The Docker setup supports all LLM backends.

Step 3: Run with Docker Compose

# Build and start the application
docker-compose up --build

# Or run in background
docker-compose up -d --build

Step 4: Access the application

  • Open your browser to: http://localhost:8501
  • The app will be running in a containerized environment

Step 5: Stop the application

# Stop containers
docker-compose down

# Stop and remove volumes
docker-compose down -v

Docker Commands Reference

# Build the image
docker build -t pubmed-miner .

# Run the container
docker run -p 8501:8501 --env-file .env pubmed-miner

# View logs
docker-compose logs -f

# Rebuild after changes
docker-compose up --build --force-recreate

# Clean up
docker-compose down -v --rmi all

Docker Benefits

  • Consistent environment across all systems
  • No Python version conflicts
  • Easy deployment and scaling
  • Isolated dependencies
  • Production-ready setup
  • All LLM backends supported (Gemini, OpenAI, Anthropic, Groq, Custom)
  • Health checks enabled for monitoring

Project Structure

Pubmed_Miner/
├── app/                          # Streamlit UI module
│   ├── __init__.py              # Package initialization
│   └── app.py                   # Main Streamlit application
├── app.py                       # Root shim (redirects to app/app.py)
├── llm/                         # LLM backends module
│   ├── __init__.py              # Package initialization
│   ├── gemini.py                # Google Gemini integration
│   ├── groq.py                  # Groq API integration
│   ├── openai.py                # OpenAI API integration
│   ├── anthropic.py             # Anthropic API integration
│   ├── custom.py                # Custom LLM endpoint
│   ├── prompts.py               # Prompt templates
│   ├── utils.py                 # Shared utilities
│   └── unified.py               # Unified LLM interface
├── pipeline/                    # Batch processing module
│   ├── __init__.py              # Package initialization
│   ├── batch_analyze.py         # Batch fetch + LLM analysis
│   └── csv_export.py            # CSV export utilities
├── services/                    # External API services
│   ├── __init__.py              # Package initialization
│   ├── pmc.py                   # PMC full-text fetching
│   └── pubmed.py                # PubMed search & metadata
├── venv/                        # Virtual environment (excluded from git)
├── .env                         # Environment variables (excluded from git)
├── .gitignore                   # Git ignore patterns
├── README.md                    # This file
└── requirements.txt             # Python dependencies

Module Responsibilities

app/ - User Interface

  • app.py: Streamlit web interface with search, selection, and results display
  • Handles user interactions and data visualization

llm/ - Language Model Integration

  • gemini.py: Google Gemini API integration
  • groq.py: Groq API integration (Llama models)
  • openai.py: OpenAI API integration (GPT-4o)
  • anthropic.py: Anthropic API integration (Claude)
  • custom.py: Custom/backend-agnostic LLM endpoint
  • prompts.py: Prompt templates and pattern descriptions
  • utils.py: Shared utilities for all backends
  • All backends provide identical APIs: run_on_paper(), clean_and_ground()

pipeline/ - Batch Processing

  • batch_analyze.py: Orchestrates PMC fetching + LLM analysis
  • Functions: fetch_all_fulltexts(), analyze_texts(), flatten_to_rows()

services/ - External APIs

  • pmc.py: PMC full-text retrieval (JATS XML + HTML fallback)
  • pubmed.py: PubMed search, metadata, and date filtering
  • Functions: esearch_reviews(), esummary(), get_pmc_fulltext_with_meta()

Import Examples

# UI imports
from app.app import main

# LLM backends (all provide identical APIs)
from llm.gemini import run_on_paper, clean_and_ground
from llm.groq import run_on_paper, clean_and_ground
from llm.openai import run_on_paper, clean_and_ground
from llm.anthropic import run_on_paper, clean_and_ground
from llm.custom import run_on_paper, clean_and_ground
from llm.unified import run_on_paper, clean_and_ground  # Unified interface

# Services
from services.pmc import get_pmc_fulltext_with_meta, get_last_fetch_source
from services.pubmed import esearch_reviews, esummary

# Pipeline
from pipeline.batch_analyze import fetch_all_fulltexts, analyze_texts
from pipeline.csv_export import flatten_to_rows

Sample PubMed Query

When the app starts, you can try a query like:

((Dengue[Title]) AND (protein)) AND ((active site[Text Word]) OR (mutation[Text Word]))

This will search for Dengue-related protein review literature mentioning an active site or mutations.


How to Use

  1. Search PubMed – Enter your query and press search.
  2. Select PMIDs – Choose papers to include in analysis.
  3. Fetch Full Text – The app tries to pull PMC full text (if available).
  4. Run Extraction – Use the LLM-based pipeline to extract mutations, proteins, and structural features with grounded quotes.
  5. Review & Export – Inspect results in the UI; download CSV/JSON as needed.

📝 Customizing the Extraction Prompt

Edit the extraction prompt in the web UI to customize what features are extracted.

Quick Start

  1. Expand "📝 Edit Extraction Prompt" in the web UI
  2. Edit the PATTERN RECOGNITION GUIDE to add/modify pattern descriptions
  3. Edit the INSTRUCTIONS to change extraction priorities
  4. Click "💾 Save Changes" and test on a sample paper

What You Can Edit

  • PATTERN RECOGNITION GUIDE: Mutation formats, protein patterns, residue patterns, domain patterns
  • INSTRUCTIONS: Extraction priorities, coverage requirements, filtering criteria
  • SYSTEM / INSTRUCTION: AI extractor role and identity
  • DEFINITIONS: Feature types and definitions

Locked (for safety): JSON schema, output format, few-shot examples

Tips

  • Add all mutation notation styles you encounter (A226V, p.Ala226Val, spelled mutations)
  • Include concrete examples in pattern descriptions
  • Test incrementally with small changes
  • Use "🔄 Reset to Default" if needed

Example

To add arrow notation (226A→V), update the mutation patterns:

**Mutation Patterns:**
• Standard: A226V, K128E
• Arrow notation: 226A→V, 128K→E  ← Add this
• HGVS: p.Ala226Val

Troubleshooting

Common Issues

GEMINI_API_KEY not set or GROQ_API_KEY not set

  • Add your API key to .env file or export as environment variable
  • Restart the app after adding the key

NCBI_API_KEY not set

  • NCBI API key is required for reliable operation
  • Get your key from NCBI API Key Registration
  • Add it to your .env file: NCBI_API_KEY="your_key_here"
  • Restart the app after adding the key

Rate limit/quota errors

  • Lower GEMINI_RPM and/or GEMINI_TPM in .env
  • Increase PAPER_PAUSE_SEC for slower processing
  • Ensure you have an NCBI API key - it's required for reliable operation (3→10 req/sec, 10→50 req/min)

Some PMIDs show no PMC text

  • The paper may be embargoed or not deposited in PMC
  • The app will still process available items

Import errors or missing modules

  • Ensure you're using Python 3.11+
  • Check virtual environment is activated: which python should show .venv/bin/python
  • Reinstall dependencies: pip install -r requirements.txt
  • Try: pip install --upgrade pip then pip install -r requirements.txt

Blank page or app won't start

  • Try: streamlit run app/app.py directly
  • Check console for error messages
  • Ensure all dependencies are installed

Docker-specific issues

  • Container won't start: Check logs with docker-compose logs -f
  • Health check fails: Ensure the app is running (wait 40s for startup)
  • Port already in use: Change port mapping in docker-compose.yml (e.g., "8502:8501")
  • API keys not working: Verify .env file is mounted and variables are set correctly
  • Build fails: Clear Docker cache: docker-compose build --no-cache

System Requirements

  • Python: 3.11+ (tested with 3.11.8)
  • RAM: 4GB+ recommended
  • Storage: 2GB+ free space
  • OS: Windows 10+, macOS 10.15+, or Linux (Ubuntu 20.04+)

Notes

  • The app focuses on Review articles and handles date filters and pagination.
  • Fetching relies on PMC availability; for non-PMC papers, full text may not be obtainable.
  • The extractor uses prompt-based pattern recognition (no regex pre-scanning).
  • Multiple LLM backends are supported; choose based on your needs (speed, cost, accuracy).

License

MIT.

About

Automated extraction of viral sequence features, mutations, and epitopes from PubMed literature using LLMs to accelerate outbreak response and genomic analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors