PubMed Miner: AI-Powered Sequence Feature Extraction from Literature

Automated extraction of viral sequence features, mutations, and epitopes from PubMed literature using LLMs to accelerate outbreak response and genomic analysis

About This Project

This is a project from the NIAID BRC AI Codeathon 2025, taking place November 12-14, 2025 at Argonne National Laboratory.

Event Website: https://niaid-brc-codeathons.github.io/

Project Details: https://niaid-brc-codeathons.github.io/projects/pubmed-miner/

Codeathon Goals

The NIAID Bioinformatics Resource Centers (BRCs) invite researchers, data scientists, and developers to a three-day AI Codeathon focused on improving Findability, Accessibility, Interoperability, and Reusability (FAIR-ness) of BRC data and tools using artificial intelligence (AI) and large language models (LLMs).

Team

Indresh Singh (Team Leader)

Getting Started

PubMed Miner

A Streamlit app to search PubMed for review articles, fetch PMC full text (when available), and run an LLM-powered extractor to mine mutation/protein findings with grounded snippets. Supports multiple LLM backends: Gemini, OpenAI, Anthropic, Groq, and custom endpoints.

🚀 Choose Your Setup Method

Option 1: Traditional Python Setup

For developers who want to run the app directly with Python and have full control over the environment.

Option 2: Docker Setup (Recommended)

For users who want a consistent, isolated environment that works the same on any system.

📋 Prerequisites

For Python Setup

Python 3.11+ (tested with Python 3.11.8)
Operating System: Windows, macOS, or Linux
Memory: 4GB+ RAM recommended
Internet: Required for API calls and PMC fetching

For Docker Setup

Docker 20.10+ and Docker Compose 2.0+
Git for cloning the repository
Memory: 4GB+ RAM recommended
Internet: Required for API calls and PMC fetching

🐍 Option 1: Traditional Python Setup

Step 1: Python Version Setup

Check your Python version:

python --version
# or
python3 --version

If you need to install Python 3.11+:

Windows:

Download from python.org
Choose Python 3.11+ (latest stable)
Check "Add Python to PATH" during installation

macOS:

# Using Homebrew (recommended)
brew install python@3.11

# Or download from python.org

Linux (Ubuntu/Debian):

sudo apt update
sudo apt install python3.11 python3.11-venv python3.11-pip

Linux (CentOS/RHEL):

sudo yum install python3.11 python3.11-venv python3.11-pip

Step 2: Clone the repository

git clone https://github.com/Shubhs0411/Pubmed_Miner.git
cd Pubmed_Miner

Step 3: Create & activate virtual environment

Create virtual environment:

# Using Python 3.11+ (replace with your Python version)
python3.11 -m venv .venv
# or
python -m venv .venv

Activate virtual environment:

Windows (PowerShell):

.\.venv\Scripts\Activate.ps1

Windows (Command Prompt):

.\.venv\Scripts\activate.bat

macOS/Linux:

source .venv/bin/activate

Verify activation:

# You should see (.venv) in your prompt
which python
# Should show: /path/to/Pubmed_Miner/.venv/bin/python

Deactivate when done:

deactivate

Step 4: Install dependencies

# Install core dependencies
pip install -r requirements.txt

# Optional: Install additional ML libraries if needed (not required)

Step 5: Configure API keys

Required: Get an LLM API Key (choose one)

Option 1: Gemini (Recommended)

Go to Google AI Studio
Sign in with your Google account
Click "Get API key" → "Create API key"
Copy the generated key

Option 2: Groq

Go to Groq Console
Sign in or create an account
Create a new API key
Copy the generated key

Option 3: OpenAI

Go to OpenAI API Keys
Sign in with your OpenAI account
Click "Create new secret key"
Copy the generated key

Option 4: Anthropic (Claude)

Go to Anthropic Console
Sign in with your Anthropic account
Create a new API key
Copy the generated key

Required: Get an NCBI API Key

Go to NCBI API Key Registration
Sign in to your NCBI account (create one if needed)
Go to "API Key Management" → "Create API Key"
Copy the generated key

Why NCBI API key is required?

Without it: 3 requests/second, 10 requests/minute
With it: 10 requests/second, 50 requests/minute
Result: Faster PMC fetching, fewer rate limit errors
Required for: Reliable operation and better performance

Configure API Keys

Option 1: Enter in the Web UI (Recommended)

Start the app and enter your API keys in the sidebar
Changes take effect immediately
Easy to switch between different keys

Option 2: Create .env file Create a .env file in the project root:

# Required: NCBI API key for PubMed search
NCBI_API_KEY="your_ncbi_api_key_here"

# LLM: Choose ONE backend (or provide multiple and choose in UI)
GEMINI_API_KEY="your_gemini_api_key_here"
# OR
# GROQ_API_KEY="your_groq_api_key_here"
# OR
# OPENAI_API_KEY="your_openai_api_key_here"
# OR
# ANTHROPIC_API_KEY="your_anthropic_api_key_here"

# Optional: Rate limiting (adjust if you hit quotas)
GEMINI_RPM=10
GEMINI_TPM=180000
PAPER_PAUSE_SEC=3.0

# Optional: Contact info for NCBI
CONTACT_EMAIL="your_email@example.com"

# Optional: Custom LLM endpoint (for hackathon/proxy setups)
# CUSTOM_LLM_URL="http://localhost:8080/v1/completions"
# CUSTOM_LLM_MODEL="gpt4o"
# CUSTOM_LLM_TIMEOUT=120

Note:

API keys entered in the web UI take priority over .env file

The app will use .env as a fallback if no key is entered in the UI

You can export these as environment variables instead of using .env if you prefer

Step 6: Run the app

# Method 1: Using the root shim (recommended)
streamlit run app.py

# Method 2: Direct execution
streamlit run app/app.py

Open the URL that Streamlit prints (usually http://localhost:8501).

Step 7: Verify installation

The app should display:

✅ PubMed Review Miner title
✅ LLM Settings sidebar
✅ Query input area
✅ Date range selector

🐳 Option 2: Docker Setup (Recommended)

Step 1: Clone the repository

git clone https://github.com/Shubhs0411/Pubmed_Miner.git
cd Pubmed_Miner

Step 2: Configure environment

Create a .env file with your API keys:

# Copy the example and edit
cp env.example .env
# Edit .env with your API keys

Required API Keys:

NCBI_API_KEY: Get from NCBI API Key Registration
LLM API Key (choose one or provide multiple and select in UI):
- GEMINI_API_KEY: Get from Google AI Studio
- GROQ_API_KEY: Get from Groq Console
- OPENAI_API_KEY: Get from OpenAI API Keys
- ANTHROPIC_API_KEY: Get from Anthropic Console
- CUSTOM_LLM_URL: For custom/hackathon endpoints

Note: All environment variables from .env are automatically loaded. The Docker setup supports all LLM backends.

Step 3: Run with Docker Compose

# Build and start the application
docker-compose up --build

# Or run in background
docker-compose up -d --build

Step 4: Access the application

Open your browser to: http://localhost:8501
The app will be running in a containerized environment

Step 5: Stop the application

# Stop containers
docker-compose down

# Stop and remove volumes
docker-compose down -v

Docker Commands Reference

# Build the image
docker build -t pubmed-miner .

# Run the container
docker run -p 8501:8501 --env-file .env pubmed-miner

# View logs
docker-compose logs -f

# Rebuild after changes
docker-compose up --build --force-recreate

# Clean up
docker-compose down -v --rmi all

Docker Benefits

✅ Consistent environment across all systems
✅ No Python version conflicts
✅ Easy deployment and scaling
✅ Isolated dependencies
✅ Production-ready setup
✅ All LLM backends supported (Gemini, OpenAI, Anthropic, Groq, Custom)
✅ Health checks enabled for monitoring

Project Structure

Pubmed_Miner/
├── app/                          # Streamlit UI module
│   ├── __init__.py              # Package initialization
│   └── app.py                   # Main Streamlit application
├── app.py                       # Root shim (redirects to app/app.py)
├── llm/                         # LLM backends module
│   ├── __init__.py              # Package initialization
│   ├── gemini.py                # Google Gemini integration
│   ├── groq.py                  # Groq API integration
│   ├── openai.py                # OpenAI API integration
│   ├── anthropic.py             # Anthropic API integration
│   ├── custom.py                # Custom LLM endpoint
│   ├── prompts.py               # Prompt templates
│   ├── utils.py                 # Shared utilities
│   └── unified.py               # Unified LLM interface
├── pipeline/                    # Batch processing module
│   ├── __init__.py              # Package initialization
│   ├── batch_analyze.py         # Batch fetch + LLM analysis
│   └── csv_export.py            # CSV export utilities
├── services/                    # External API services
│   ├── __init__.py              # Package initialization
│   ├── pmc.py                   # PMC full-text fetching
│   └── pubmed.py                # PubMed search & metadata
├── venv/                        # Virtual environment (excluded from git)
├── .env                         # Environment variables (excluded from git)
├── .gitignore                   # Git ignore patterns
├── README.md                    # This file
└── requirements.txt             # Python dependencies

Module Responsibilities

app/ - User Interface

app.py: Streamlit web interface with search, selection, and results display
Handles user interactions and data visualization

llm/ - Language Model Integration

gemini.py: Google Gemini API integration
groq.py: Groq API integration (Llama models)
openai.py: OpenAI API integration (GPT-4o)
anthropic.py: Anthropic API integration (Claude)
custom.py: Custom/backend-agnostic LLM endpoint
prompts.py: Prompt templates and pattern descriptions
utils.py: Shared utilities for all backends
All backends provide identical APIs: run_on_paper(), clean_and_ground()

pipeline/ - Batch Processing

batch_analyze.py: Orchestrates PMC fetching + LLM analysis
Functions: fetch_all_fulltexts(), analyze_texts(), flatten_to_rows()

services/ - External APIs

pmc.py: PMC full-text retrieval (JATS XML + HTML fallback)
pubmed.py: PubMed search, metadata, and date filtering
Functions: esearch_reviews(), esummary(), get_pmc_fulltext_with_meta()

Import Examples

# UI imports
from app.app import main

# LLM backends (all provide identical APIs)
from llm.gemini import run_on_paper, clean_and_ground
from llm.groq import run_on_paper, clean_and_ground
from llm.openai import run_on_paper, clean_and_ground
from llm.anthropic import run_on_paper, clean_and_ground
from llm.custom import run_on_paper, clean_and_ground
from llm.unified import run_on_paper, clean_and_ground  # Unified interface

# Services
from services.pmc import get_pmc_fulltext_with_meta, get_last_fetch_source
from services.pubmed import esearch_reviews, esummary

# Pipeline
from pipeline.batch_analyze import fetch_all_fulltexts, analyze_texts
from pipeline.csv_export import flatten_to_rows

Sample PubMed Query

When the app starts, you can try a query like:

((Dengue[Title]) AND (protein)) AND ((active site[Text Word]) OR (mutation[Text Word]))

This will search for Dengue-related protein review literature mentioning an active site or mutations.

How to Use

Search PubMed – Enter your query and press search.
Select PMIDs – Choose papers to include in analysis.
Fetch Full Text – The app tries to pull PMC full text (if available).
Run Extraction – Use the LLM-based pipeline to extract mutations, proteins, and structural features with grounded quotes.
Review & Export – Inspect results in the UI; download CSV/JSON as needed.

📝 Customizing the Extraction Prompt

Edit the extraction prompt in the web UI to customize what features are extracted.

Quick Start

Expand "📝 Edit Extraction Prompt" in the web UI
Edit the PATTERN RECOGNITION GUIDE to add/modify pattern descriptions
Edit the INSTRUCTIONS to change extraction priorities
Click "💾 Save Changes" and test on a sample paper

What You Can Edit

PATTERN RECOGNITION GUIDE: Mutation formats, protein patterns, residue patterns, domain patterns
INSTRUCTIONS: Extraction priorities, coverage requirements, filtering criteria
SYSTEM / INSTRUCTION: AI extractor role and identity
DEFINITIONS: Feature types and definitions

Locked (for safety): JSON schema, output format, few-shot examples

Tips

Add all mutation notation styles you encounter (A226V, p.Ala226Val, spelled mutations)
Include concrete examples in pattern descriptions
Test incrementally with small changes
Use "🔄 Reset to Default" if needed

Example

To add arrow notation (226A→V), update the mutation patterns:

**Mutation Patterns:**
• Standard: A226V, K128E
• Arrow notation: 226A→V, 128K→E  ← Add this
• HGVS: p.Ala226Val

Troubleshooting

Common Issues

GEMINI_API_KEY not set or GROQ_API_KEY not set

Add your API key to .env file or export as environment variable
Restart the app after adding the key

NCBI_API_KEY not set

NCBI API key is required for reliable operation
Get your key from NCBI API Key Registration
Add it to your .env file: NCBI_API_KEY="your_key_here"
Restart the app after adding the key

Rate limit/quota errors

Lower GEMINI_RPM and/or GEMINI_TPM in .env
Increase PAPER_PAUSE_SEC for slower processing
Ensure you have an NCBI API key - it's required for reliable operation (3→10 req/sec, 10→50 req/min)

Some PMIDs show no PMC text

The paper may be embargoed or not deposited in PMC
The app will still process available items

Import errors or missing modules

Ensure you're using Python 3.11+
Check virtual environment is activated: which python should show .venv/bin/python
Reinstall dependencies: pip install -r requirements.txt
Try: pip install --upgrade pip then pip install -r requirements.txt

Blank page or app won't start

Try: streamlit run app/app.py directly
Check console for error messages
Ensure all dependencies are installed

Docker-specific issues

Container won't start: Check logs with docker-compose logs -f
Health check fails: Ensure the app is running (wait 40s for startup)
Port already in use: Change port mapping in docker-compose.yml (e.g., "8502:8501")
API keys not working: Verify .env file is mounted and variables are set correctly
Build fails: Clear Docker cache: docker-compose build --no-cache

System Requirements

Python: 3.11+ (tested with 3.11.8)
RAM: 4GB+ recommended
Storage: 2GB+ free space
OS: Windows 10+, macOS 10.15+, or Linux (Ubuntu 20.04+)

Notes

The app focuses on Review articles and handles date filters and pagination.
Fetching relies on PMC availability; for non-PMC papers, full text may not be obtainable.
The extractor uses prompt-based pattern recognition (no regex pre-scanning).
Multiple LLM backends are supported; choose based on your needs (speed, cost, accuracy).

License

MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
app		app
llm		llm
pipeline		pipeline
services		services
tests		tests
.DS_Store		.DS_Store
.gitignore		.gitignore
Dockerfile		Dockerfile
New_Pubmed_Search_Schema		New_Pubmed_Search_Schema
README.md		README.md
SVF_Demo.pptx		SVF_Demo.pptx
app.py		app.py
compare_quotes.py		compare_quotes.py
csv_export_fixed.py		csv_export_fixed.py
docker-compose.yml		docker-compose.yml
env.example		env.example
improved_prompt.txt		improved_prompt.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

PubMed Miner: AI-Powered Sequence Feature Extraction from Literature

About This Project

Codeathon Goals

Team

Getting Started

PubMed Miner

🚀 Choose Your Setup Method

Option 1: Traditional Python Setup

Option 2: Docker Setup (Recommended)

📋 Prerequisites

For Python Setup

For Docker Setup

🐍 Option 1: Traditional Python Setup

Step 1: Python Version Setup

Check your Python version:

If you need to install Python 3.11+:

Step 2: Clone the repository

Step 3: Create & activate virtual environment

Create virtual environment:

Activate virtual environment:

Verify activation:

Deactivate when done:

Step 4: Install dependencies

Step 5: Configure API keys

Required: Get an LLM API Key (choose one)

Required: Get an NCBI API Key

Configure API Keys

Step 6: Run the app

Step 7: Verify installation

🐳 Option 2: Docker Setup (Recommended)

Step 1: Clone the repository

Step 2: Configure environment

Step 3: Run with Docker Compose

Step 4: Access the application

Step 5: Stop the application

Docker Commands Reference

Docker Benefits

Project Structure

Module Responsibilities

Import Examples

Sample PubMed Query

How to Use

📝 Customizing the Extraction Prompt

Quick Start

What You Can Edit

Tips

Example

Troubleshooting

Common Issues

System Requirements

Notes

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages