Skip to content

Data: config-driven ClinVarDataLoader and reference-validated windows#2

Open
cropsgg wants to merge 1 commit into
pr/01-config-reference-assetsfrom
pr/02-clinvar-data-loader
Open

Data: config-driven ClinVarDataLoader and reference-validated windows#2
cropsgg wants to merge 1 commit into
pr/01-config-reference-assetsfrom
pr/02-clinvar-data-loader

Conversation

@cropsgg
Copy link
Copy Markdown
Owner

@cropsgg cropsgg commented Apr 15, 2026

Summary

Refactors ClinVarDataLoader to take AppSettings: config-driven ClinVar fetch, reference-validated variant windows, and YAML-driven synthetic data (pathogenic templates, benign synonymous/intronic, VUS).

Base branch: pr/01-config-reference-assets (stacked PR — retarget to main after #1 merges, or merge in order).

What changed

  • ClinVar parsing — Simple cDNA SNV HGVS patterns; optional RefSeq-in-title gate; reference base must match FASTA before a row is kept.
  • hgvs_linear_overrides — Explicit0-based indices for HGVS (e.g. intronic) where automatic linearization does not match the bundled reference layout.
  • Safety — Avoids labeling wild-type windows as pathogenic when a variant cannot be applied.
  • Init checks — Optional rejection of ambiguous bases in the reference; validates synthetic pathogenic positions (and optional require_ref_base) against FASTA.
  • Teststests/test_data_loader.py uses load_settings() + temp cache dir; keeps leakage/stratification checks.

Why this PR exists

Training data quality is the ceiling for biological plausibility. This ties labels to actual alleles on the configured reference and makes ClinVar + synthetic generation explicit in YAML.

Dependencies

How to verify

pip install -r requirements.txt pytest
pytest tests/test_data_loader.py -q

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant