Open Targets GitHub Issue Scenario Miner

A Python package for mining GitHub issues from the opentargets/issues repository and extracting test scenarios in formats suitable for the platform-test package in ot-ui-apps.

Features

Two-Pass Extraction Pipeline
- Pass 1 (Regex): Fast, free extraction using compiled regex patterns
- Pass 2 (LLM): Optional Claude API enrichment for context-aware mapping and gap filling
Extensible Architecture
- Pluggable extractor interface for easy addition of new extraction strategies
- Multiple output writers (CSV, JSON, JSONL)
- Clear separation of concerns: loaders, extractors, writers
Comprehensive Entity Detection
- Drug identifiers (CHEMBL IDs)
- Gene identifiers (Ensembl gene IDs with inference from gene symbols)
- Disease identifiers (EFO/MONDO ontology IDs)
- Variants (pharmacogenomic and molecular QTL)
- Study identifiers (GWAS, molecular QTL)
- Credible set hashes

Installation

From source

git clone https://github.com/opentargets/ot-github-issue-miner.git
cd ot-github-issue-miner
pip install -e .

Install dependencies

pip install -r requirements.txt

Optional development dependencies

pip install -e ".[dev]"

Quick Start

Command Line Usage

Regex-only mode (no API cost)

python -m ot_miner.cli

With GitHub token (higher rate limit)

GITHUB_TOKEN=ghp_xxx ANTHROPIC_API_KEY=sk-ant-xxx python -m ot_miner.cli

Python API

from ot_miner import ScenarioMiner, Config

# Load configuration from environment
config = Config.from_env()

# Create miner and run
miner = ScenarioMiner(config)
mappings = miner.run()

# Access results
for mapping in mappings:
    print(f"Scenario: {mapping.scenario_name}")
    print(f"  Drug: {mapping.drug_id}")
    print(f"  Target: {mapping.target_id}")

Run Locally

Set up your environment:

export ANTHROPIC_API_KEY='your-anthropic-api-key'
export GITHUB_TOKEN='your-github-pat'  # Optional but recommended

Run the test script:

# Mine issues from last 7 days with LLM (default)
python test_llm_local.py

# Mine issues from last 30 days
python test_llm_local.py --days 30

# Mine issues from last 90 days (as used in testing)
python test_llm_local.py --days 90

# Save to custom directory with verbose output
python test_llm_local.py --days 14 --output-dir my-test --verbose

Check results: Results are saved to test-results/ (or your custom output directory):
- mined-scenarios.csv - CSV format
- mined-scenarios.json - JSON format

Local Testing

python test_llm_local.py --days 7

Uses regex + LLM (slower, costs API credits, smarter)
Reads issue body AND comments for better context
Infers IDs from gene/disease/drug names
Has OpenTargets GraphQL API access for real-time entity verification

What the LLM Sees

The LLM now receives:

Issue title
Issue body (first 3000 chars)
Comments (first 10 comments, 300 chars each)
OpenTargets GraphQL API access (can query in real-time)

This allows it to:

Find mentions in comments that regex missed
Query the OpenTargets API to verify gene/disease/drug IDs
Look up entities by name (e.g., "BRAF" → API query → ENSG00000157764)
Verify IDs are valid before returning them
Get accurate mappings instead of guessing

OpenTargets API Integration

The LLM can make GraphQL queries like:

query { 
  search(queryString:"BRAF", entityNames:["target"]) { 
    hits { id name } 
  } 
}

This ensures all returned IDs are:

Verified - Checked against the live OpenTargets database
Accurate - No guessing or incorrect mappings
Current - Uses latest data from OpenTargets platform

Configuration

Configuration is loaded from environment variables:

# GitHub settings
export GITHUB_OWNER=opentargets
export GITHUB_REPO=issues
export GITHUB_TOKEN=ghp_xxx  # Optional, increases rate limit

# LLM settings
export ANTHROPIC_API_KEY=sk-ant-xxx  # Required for LLM pass
export LLM_MODEL=claude-haiku-4-5
export LLM_BATCH_SIZE=5
export LLM_DELAY_MS=500

# Output settings
export OUTPUT_DIR=/path/to/output
export CSV_FILENAME=mined-scenarios.csv
export JSON_FILENAME=mined-scenarios.json

# Logging
export VERBOSE=true

Or use Python API:

from ot_miner import Config
from pathlib import Path

config = Config(
    github_token="ghp_xxx",
    anthropic_api_key="sk-ant-xxx",
    output_dir=Path("/tmp/output"),
    llm_batch_size=10,
    verbose=True,
)

Architecture

Core Components

ot_miner/
├── models.py           # Data models (ScenarioMapping, GitHubIssue)
├── config.py           # Configuration management
├── utils.py            # Utility functions, regex patterns, knowledge bases
├── cli.py              # Command-line entry point
├── miner.py            # Main orchestration (ScenarioMiner)
├── loaders/
│   └── __init__.py     # GitHub API loader and issue filtering
├── extractors/
│   ├── base.py         # Abstract base extractor interface
│   ├── regex.py        # Regex-based extraction (Pass 1)
│   ├── llm.py          # LLM-based enrichment (Pass 2)
│   └── __init__.py     # Extractor exports
└── writers/
    └── __init__.py     # Output writers (CSV, JSON, JSONL)

Extraction Pipeline

GitHub Issues
    ↓
[GitHubLoader] - Fetch all issues with pagination
    ↓
[RegexExtractor] - Pass 1: Fast pattern matching
    ↓
[LLMExtractor] - Pass 2: Context-aware enrichment via LangChain
    ↓
[Writers] - Output to CSV, JSON, etc.
    ↓
Test Scenarios (ready for Google Sheets import)

LLM via LangChain

The LLM extraction (Pass 2) uses LangChain to efficiently interface with Claude. LangChain handles:

Message formatting and API communication
JSON response parsing with Pydantic validation
Retry logic and error handling
Token counting for cost estimation

Simply provide your Anthropic API key and LLMExtractor handles the rest:

from ot_miner import ScenarioMiner, Config

config = Config.from_env()  # Requires ANTHROPIC_API_KEY
miner = ScenarioMiner(config)
mappings = miner.run()  # Two-pass extraction with LLM enrichment

Data Model

ScenarioMapping

Each extracted scenario is a ScenarioMapping dataclass with:

Field	Column	Description
`scenario_name`	A	Human-readable label derived from issue
`drug_id`	B	CHEMBL drug ID
`variant_id`	C	Main variant page ID
`variant_pgx`	D	Pharmacogenetics variant
`variant_molqtl`	E	Molecular QTL variant
`target_id`	F	Single Ensembl gene ID
`target_ids`	G	Comma-separated Ensembl IDs
`aotf_diseases`	H	Diseases for AOTF list
`disease_id`	I	EFO/MONDO disease ID
`aotf_genes`	J	Gene symbols for AOTF list
`disease_search`	K	Free-text disease search
`disease_alt`	L	Alternative disease IDs
`gwas_study`	M	GWAS study ID
`qtl_study`	N	Molecular QTL study ID
`credible_set_l2g`	O	L2G credible set hash
`credible_set_gwas`	P	GWAS credible set hash
`credible_set_qtl`	Q	QTL credible set hash
`source_url`	R	GitHub issue URL

Importing to Google Sheets

Run the miner to generate mined-scenarios.csv
In Google Sheets: File → Import → Upload
Select the CSV file
Choose "Append to current sheet"
Set separator to "Comma"
Click "Import data"

Extensibility

Custom Extractor

from ot_miner.extractors import BaseExtractor
from ot_miner.models import GitHubIssue, ScenarioMapping, ExtractionResult
from typing import List, Optional

class CustomExtractor(BaseExtractor):
    def extract(self, issue: GitHubIssue) -> Optional[ScenarioMapping]:
        # Your extraction logic here
        pass
    
    def extract_batch(self, issues: List[GitHubIssue]) -> List[ExtractionResult]:
        # Batch processing logic
        pass

Custom Writer

from ot_miner.writers import BaseWriter
from ot_miner.models import ScenarioMapping
from pathlib import Path
from typing import List

class CustomWriter(BaseWriter):
    def write(self, mappings: List[ScenarioMapping], path: Path) -> None:
        # Your output logic here
        pass

# Use in MultiWriter
from ot_miner.writers import MultiWriter

writers = MultiWriter()
writers.add_writer("custom", CustomWriter())

Troubleshooting

LLM not enriching

Ensure ANTHROPIC_API_KEY is set:

echo $ANTHROPIC_API_KEY  # Should print your API key

GitHub rate limiting

Use a GitHub token to increase rate limits:

export GITHUB_TOKEN=ghp_xxx

Large issue bodies

The LLM pipeline truncates issue bodies to 1500 characters to keep prompts manageable. Adjust in ot_miner/extractors/llm.py if needed.

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open a Pull Request

License

Apache License 2.0 - see LICENSE file for details

Acknowledgments

Built for the Open Targets Platform
Uses Claude API from Anthropic
GitHub API integration via PyGithub

Citation

If you use this tool in your research, please cite:

@software{opentargets_issue_miner_2026,
  title={Open Targets GitHub Issue Scenario Miner},
  author={Open Targets},
  year={2026},
  url={https://github.com/opentargets/ot-github-issue-miner}
}

Support

For issues, questions, or suggestions:

Open an GitHub Issue
Check existing documentation
Review architecture documentation (if available)

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
ot_miner		ot_miner
results		results
test-results		test-results
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
DEVELOPMENT.md		DEVELOPMENT.md
LICENSE		LICENSE
README.md		README.md
TEST_LLM.md		TEST_LLM.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
test_llm_local.py		test_llm_local.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Open Targets GitHub Issue Scenario Miner

Features

Installation

From source

Install dependencies

Optional development dependencies

Quick Start

Command Line Usage

Regex-only mode (no API cost)

With GitHub token (higher rate limit)

Python API

Run Locally

Local Testing

What the LLM Sees

OpenTargets API Integration

Configuration

Architecture

Core Components

Extraction Pipeline

LLM via LangChain

Data Model

ScenarioMapping

Importing to Google Sheets

Extensibility

Custom Extractor

Custom Writer

Troubleshooting

LLM not enriching

GitHub rate limiting

Large issue bodies

Contributing

License

Acknowledgments

Citation

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages