A Python package for mining GitHub issues from the opentargets/issues repository and extracting test scenarios in formats suitable for the platform-test package in ot-ui-apps.
-
Two-Pass Extraction Pipeline
- Pass 1 (Regex): Fast, free extraction using compiled regex patterns
- Pass 2 (LLM): Optional Claude API enrichment for context-aware mapping and gap filling
-
Extensible Architecture
- Pluggable extractor interface for easy addition of new extraction strategies
- Multiple output writers (CSV, JSON, JSONL)
- Clear separation of concerns: loaders, extractors, writers
-
Comprehensive Entity Detection
- Drug identifiers (CHEMBL IDs)
- Gene identifiers (Ensembl gene IDs with inference from gene symbols)
- Disease identifiers (EFO/MONDO ontology IDs)
- Variants (pharmacogenomic and molecular QTL)
- Study identifiers (GWAS, molecular QTL)
- Credible set hashes
git clone https://github.com/opentargets/ot-github-issue-miner.git
cd ot-github-issue-miner
pip install -e .pip install -r requirements.txtpip install -e ".[dev]"python -m ot_miner.cliGITHUB_TOKEN=ghp_xxx ANTHROPIC_API_KEY=sk-ant-xxx python -m ot_miner.clifrom ot_miner import ScenarioMiner, Config
# Load configuration from environment
config = Config.from_env()
# Create miner and run
miner = ScenarioMiner(config)
mappings = miner.run()
# Access results
for mapping in mappings:
print(f"Scenario: {mapping.scenario_name}")
print(f" Drug: {mapping.drug_id}")
print(f" Target: {mapping.target_id}")-
Set up your environment:
export ANTHROPIC_API_KEY='your-anthropic-api-key' export GITHUB_TOKEN='your-github-pat' # Optional but recommended
-
Run the test script:
# Mine issues from last 7 days with LLM (default) python test_llm_local.py # Mine issues from last 30 days python test_llm_local.py --days 30 # Mine issues from last 90 days (as used in testing) python test_llm_local.py --days 90 # Save to custom directory with verbose output python test_llm_local.py --days 14 --output-dir my-test --verbose
-
Check results: Results are saved to
test-results/(or your custom output directory):mined-scenarios.csv- CSV formatmined-scenarios.json- JSON format
python test_llm_local.py --days 7- Uses regex + LLM (slower, costs API credits, smarter)
- Reads issue body AND comments for better context
- Infers IDs from gene/disease/drug names
- Has OpenTargets GraphQL API access for real-time entity verification
The LLM now receives:
- Issue title
- Issue body (first 3000 chars)
- Comments (first 10 comments, 300 chars each)
- OpenTargets GraphQL API access (can query in real-time)
This allows it to:
- Find mentions in comments that regex missed
- Query the OpenTargets API to verify gene/disease/drug IDs
- Look up entities by name (e.g., "BRAF" → API query → ENSG00000157764)
- Verify IDs are valid before returning them
- Get accurate mappings instead of guessing
The LLM can make GraphQL queries like:
query {
search(queryString:"BRAF", entityNames:["target"]) {
hits { id name }
}
}This ensures all returned IDs are:
- Verified - Checked against the live OpenTargets database
- Accurate - No guessing or incorrect mappings
- Current - Uses latest data from OpenTargets platform
Configuration is loaded from environment variables:
# GitHub settings
export GITHUB_OWNER=opentargets
export GITHUB_REPO=issues
export GITHUB_TOKEN=ghp_xxx # Optional, increases rate limit
# LLM settings
export ANTHROPIC_API_KEY=sk-ant-xxx # Required for LLM pass
export LLM_MODEL=claude-haiku-4-5
export LLM_BATCH_SIZE=5
export LLM_DELAY_MS=500
# Output settings
export OUTPUT_DIR=/path/to/output
export CSV_FILENAME=mined-scenarios.csv
export JSON_FILENAME=mined-scenarios.json
# Logging
export VERBOSE=trueOr use Python API:
from ot_miner import Config
from pathlib import Path
config = Config(
github_token="ghp_xxx",
anthropic_api_key="sk-ant-xxx",
output_dir=Path("/tmp/output"),
llm_batch_size=10,
verbose=True,
)ot_miner/
├── models.py # Data models (ScenarioMapping, GitHubIssue)
├── config.py # Configuration management
├── utils.py # Utility functions, regex patterns, knowledge bases
├── cli.py # Command-line entry point
├── miner.py # Main orchestration (ScenarioMiner)
├── loaders/
│ └── __init__.py # GitHub API loader and issue filtering
├── extractors/
│ ├── base.py # Abstract base extractor interface
│ ├── regex.py # Regex-based extraction (Pass 1)
│ ├── llm.py # LLM-based enrichment (Pass 2)
│ └── __init__.py # Extractor exports
└── writers/
└── __init__.py # Output writers (CSV, JSON, JSONL)
GitHub Issues
↓
[GitHubLoader] - Fetch all issues with pagination
↓
[RegexExtractor] - Pass 1: Fast pattern matching
↓
[LLMExtractor] - Pass 2: Context-aware enrichment via LangChain
↓
[Writers] - Output to CSV, JSON, etc.
↓
Test Scenarios (ready for Google Sheets import)
The LLM extraction (Pass 2) uses LangChain to efficiently interface with Claude. LangChain handles:
- Message formatting and API communication
- JSON response parsing with Pydantic validation
- Retry logic and error handling
- Token counting for cost estimation
Simply provide your Anthropic API key and LLMExtractor handles the rest:
from ot_miner import ScenarioMiner, Config
config = Config.from_env() # Requires ANTHROPIC_API_KEY
miner = ScenarioMiner(config)
mappings = miner.run() # Two-pass extraction with LLM enrichmentEach extracted scenario is a ScenarioMapping dataclass with:
| Field | Column | Description |
|---|---|---|
scenario_name |
A | Human-readable label derived from issue |
drug_id |
B | CHEMBL drug ID |
variant_id |
C | Main variant page ID |
variant_pgx |
D | Pharmacogenetics variant |
variant_molqtl |
E | Molecular QTL variant |
target_id |
F | Single Ensembl gene ID |
target_ids |
G | Comma-separated Ensembl IDs |
aotf_diseases |
H | Diseases for AOTF list |
disease_id |
I | EFO/MONDO disease ID |
aotf_genes |
J | Gene symbols for AOTF list |
disease_search |
K | Free-text disease search |
disease_alt |
L | Alternative disease IDs |
gwas_study |
M | GWAS study ID |
qtl_study |
N | Molecular QTL study ID |
credible_set_l2g |
O | L2G credible set hash |
credible_set_gwas |
P | GWAS credible set hash |
credible_set_qtl |
Q | QTL credible set hash |
source_url |
R | GitHub issue URL |
- Run the miner to generate
mined-scenarios.csv - In Google Sheets: File → Import → Upload
- Select the CSV file
- Choose "Append to current sheet"
- Set separator to "Comma"
- Click "Import data"
from ot_miner.extractors import BaseExtractor
from ot_miner.models import GitHubIssue, ScenarioMapping, ExtractionResult
from typing import List, Optional
class CustomExtractor(BaseExtractor):
def extract(self, issue: GitHubIssue) -> Optional[ScenarioMapping]:
# Your extraction logic here
pass
def extract_batch(self, issues: List[GitHubIssue]) -> List[ExtractionResult]:
# Batch processing logic
passfrom ot_miner.writers import BaseWriter
from ot_miner.models import ScenarioMapping
from pathlib import Path
from typing import List
class CustomWriter(BaseWriter):
def write(self, mappings: List[ScenarioMapping], path: Path) -> None:
# Your output logic here
pass
# Use in MultiWriter
from ot_miner.writers import MultiWriter
writers = MultiWriter()
writers.add_writer("custom", CustomWriter())Ensure ANTHROPIC_API_KEY is set:
echo $ANTHROPIC_API_KEY # Should print your API keyUse a GitHub token to increase rate limits:
export GITHUB_TOKEN=ghp_xxxThe LLM pipeline truncates issue bodies to 1500 characters to keep prompts manageable. Adjust in ot_miner/extractors/llm.py if needed.
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
Apache License 2.0 - see LICENSE file for details
- Built for the Open Targets Platform
- Uses Claude API from Anthropic
- GitHub API integration via PyGithub
If you use this tool in your research, please cite:
@software{opentargets_issue_miner_2026,
title={Open Targets GitHub Issue Scenario Miner},
author={Open Targets},
year={2026},
url={https://github.com/opentargets/ot-github-issue-miner}
}For issues, questions, or suggestions:
- Open an GitHub Issue
- Check existing documentation
- Review architecture documentation (if available)