diff --git a/docs/advanced/debugging-results.md b/docs/advanced/debugging-results.md index f7b835d..256439b 100644 --- a/docs/advanced/debugging-results.md +++ b/docs/advanced/debugging-results.md @@ -54,6 +54,9 @@ To inspect a site with high N_FAIL, query with `--format json` to see all fields afquery query --db ./db/ --locus chr1:12345678 --format json ``` +!!! tip "Identify failing samples" + Use `afquery variant-info --db ./db/ --locus chr1:12345678` to see exactly which samples have FAIL status and their metadata (technology, phenotype codes). This helps determine if failures cluster in a specific technology or sample subset. See [Variant Info](../guides/variant-info.md). + If N_FAIL is consistently high across many sites, check the variant calling pipeline and FILTER field settings in your VCFs. --- diff --git a/docs/advanced/filter-pass-tracking.md b/docs/advanced/filter-pass-tracking.md index 45875b3..cc18c61 100644 --- a/docs/advanced/filter-pass-tracking.md +++ b/docs/advanced/filter-pass-tracking.md @@ -69,6 +69,16 @@ for r in results: `N_FAIL` is always an `int` (default `0`). +### Identifying specific FAIL samples + +To see which individual samples have FAIL status at a position, use `variant-info`: + +```bash +afquery variant-info --db ./db/ --locus chr1:925952 +``` + +Each carrier row shows its `filter` column as `PASS` or `FAIL`, along with sample metadata (technology, phenotype codes). This helps pinpoint whether failures cluster in a specific technology or sample group. See [Variant Info](../guides/variant-info.md) for full options. + --- ## VCF Annotation @@ -85,19 +95,6 @@ afquery annotate --db ./db/ --input variants.vcf --output annotated.vcf --- -## Schema Version Compatibility - -The `fail_bitmap` and `N_FAIL` tracking requires schema version 2.0 or later. Databases created with older versions of AFQuery do not contain `fail_bitmap` data. - -| Database schema | N_FAIL behavior | -|----------------|-----------------| -| ≥ 2.0 | `N_FAIL` is always an integer (0 or more) | -| < 2.0 (legacy) | `N_FAIL` is `None` in Python API results | - -To upgrade a legacy database, rebuild it with `afquery create-db`. There is no in-place migration. - ---- - ## PASS-Only Enforcement AF reflects the quality-filtered allele frequency — the frequency of the alt allele among high-quality calls. This is appropriate for most clinical and research use cases. PASS-only ingestion is always enforced. diff --git a/docs/faq.md b/docs/faq.md index 2f554c0..bca7372 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -209,7 +209,7 @@ Without stratification, rare variant interpretation in mixed cohorts can be misl AFQuery is purpose-built for fast subcohort AF computation and is not a general-purpose genomic database: - **Not a joint genotyper**: AFQuery does not perform joint genotyping. Input VCFs should be individually called before ingestion. -- **Not a variant database**: AFQuery stores only genotype-level summaries (bitmaps). Individual sample genotypes cannot be retrieved from the database. +- **Not a genotype store**: AFQuery stores genotype summaries as bitmaps, not raw FORMAT fields. Use `variant-info` to list carriers and their genotype class (het/hom) at a specific position; for full per-sample VCF fields (GQ, DP, AD, etc.), consult the source VCFs. - **No statistical genetics**: AFQuery does not compute Hardy-Weinberg equilibrium, population stratification, or other statistical genetics metrics. - **Batch queries**: The `--from-file` batch mode supports variants across multiple chromosomes in a single call. Point queries (`--locus`) and region queries (`--region`) target a single position or range; for multi-position multi-chromosome lookups, use `--from-file`. - **Cohort size limit**: Performance at >100K samples has not been validated. Memory requirements for the build phase scale with cohort size. diff --git a/docs/getting-started/quickstart.md b/docs/getting-started/quickstart.md index 56f813a..cb9b1bd 100644 --- a/docs/getting-started/quickstart.md +++ b/docs/getting-started/quickstart.md @@ -90,7 +90,19 @@ See [Sample Filtering](../guides/sample-filtering.md) for the full include/exclu --- -## 5. Query a Region +## 5. Inspect Carriers (optional) + +See which samples carry the variant you just queried: + +```bash +afquery variant-info --db ./my_db/ --locus chr1:925952 +``` + +This lists each carrier with their sex, technology, phenotype codes, genotype (het/hom), and FILTER status. See [Variant Info](../guides/variant-info.md) for details. + +--- + +## 6. Query a Region ```bash afquery query \ @@ -100,7 +112,7 @@ afquery query \ --- -## 6. Annotate a VCF +## 7. Annotate a VCF Given a VCF with variants you want to annotate: @@ -130,5 +142,6 @@ The output VCF gains INFO fields (see [Annotate a VCF](../guides/annotate-vcf.md - [Key Concepts](concepts.md) — understand how bitmaps, Parquet, and metadata filtering work together - [Sample Filtering](../guides/sample-filtering.md) — full syntax for phenotype, sex, and technology filters +- [Variant Info](../guides/variant-info.md) — list carriers of any variant with metadata - [Annotate a VCF](../guides/annotate-vcf.md) — annotation options, parallelism, and downstream usage - [ACMG Criteria](../use-cases/acmg-use-cases.md) — applying local AF to BA1, PM2, and PS4 diff --git a/docs/getting-started/tutorial.md b/docs/getting-started/tutorial.md index f2c9fcc..92f6431 100644 --- a/docs/getting-started/tutorial.md +++ b/docs/getting-started/tutorial.md @@ -138,7 +138,39 @@ chr1:946000 T>C AC=2 AN=20 AF=0.1000 n_eligible=10 N_HET=2 N_HOM_ALT=0 N_ --- -## 5. Filter by Sex +## 5. Inspect Variant Carriers + +After finding a variant of interest, use `variant-info` to see which specific samples carry it: + +```bash +afquery variant-info --db ./demo_db/ --locus chr1:925952 +``` + +Example output: + +``` +sample_id sample_name sex tech phenotypes genotype filter +--------- ----------- ------ ------ --------------- -------- ------ +0 DEMO_001 female wgs E11.9,I10 het PASS +2 DEMO_003 male wgs E11.9 het PASS +4 DEMO_005 female wes_v1 E11.9,control het PASS +6 DEMO_007 male wes_v2 control het PASS +8 DEMO_009 female wgs E11.9 hom PASS +``` + +Each row is one carrier. The `genotype` column shows `het` (heterozygous), `hom` (homozygous alt), or `alt` (non-ref with FILTER≠PASS). The `filter` column indicates whether the call passed quality filters in the source VCF. + +For machine-readable output, use `--format tsv`: + +```bash +afquery variant-info --db ./demo_db/ --locus chr1:925952 --format tsv > carriers.tsv +``` + +See [Variant Info](../guides/variant-info.md) for full options including allele-specific queries and sample filtering. + +--- + +## 6. Filter by Sex Query only female samples: @@ -168,7 +200,7 @@ AN drops from 20 to 10 in both cases because only 5 samples are eligible. The AF --- -## 6. Filter by Phenotype +## 7. Filter by Phenotype Query samples tagged with `E11.9`: @@ -198,7 +230,7 @@ The `^` prefix means "exclude". Excluding controls removes 4 samples, leaving 6. --- -## 7. Filter by Technology +## 8. Filter by Technology Restrict to WGS samples only: @@ -259,7 +291,7 @@ No variants found for the given filters. --- -## 8. Combine Filters +## 9. Combine Filters All filter dimensions compose with AND: @@ -282,7 +314,7 @@ Only one sample meets all three criteria (DEMO_001: female, wgs, E11.9). With n_ --- -## 9. Annotate a VCF +## 10. Annotate a VCF Use one of the demo VCFs as input: @@ -313,7 +345,7 @@ See [Annotate a VCF](../guides/annotate-vcf.md) for filtering and downstream usa --- -## 10. Bulk Export with Dump +## 11. Bulk Export with Dump Export all variant frequencies to CSV: @@ -350,7 +382,7 @@ This adds columns following the pattern `AC_{sex}_{tech}`, `AN_{sex}_{tech}`, `A --- -## 11. Interpret Results with ACMG Criteria +## 12. Interpret Results with ACMG Criteria With the annotated VCF or query results, you can apply ACMG criteria: diff --git a/docs/guides/variant-info.md b/docs/guides/variant-info.md new file mode 100644 index 0000000..c4cde67 --- /dev/null +++ b/docs/guides/variant-info.md @@ -0,0 +1,163 @@ +# Variant Info + +`afquery variant-info` returns the list of samples carrying a specific variant, together with each sample's metadata: sex, sequencing technology, phenotype codes, genotype (het/hom), and FILTER status (PASS/FAIL). + +This avoids the need to re-query raw VCF files when inspecting individual variant carriers. + +--- + +## Basic usage + +```bash +afquery variant-info --db ./db/ --locus chr1:925952 +``` + +!!! tip + `variant-info` is the natural next step after `query` — once you find a variant of interest, use it to see which specific samples carry it. + +By default all samples are queried and results are printed as an aligned text table: + +``` +sample_id sample_name sex tech phenotypes genotype filter +--------- ----------- ------ --------- ------------ -------- ------ +3 P003 male WGS E11.9,J45 het PASS +17 P017 female WES_kit_A E11.9 hom PASS +42 P042 male WGS I10 alt FAIL +``` + +--- + +## Filtering to a specific allele + +When multiple alleles exist at the same position, use `--ref` and `--alt` to restrict to one: + +```bash +afquery variant-info --db ./db/ --locus chr17:41245466 --ref A --alt T +``` + +!!! note + Without `--ref`/`--alt`, carriers for all alleles at the locus are returned and a warning is emitted if more than one allele is found. Specify both flags to disambiguate at multi-allelic sites. + +--- + +## Sample filters + +`variant-info` accepts the same sample filters as `query`: + +```bash +# Only female carriers +afquery variant-info --db ./db/ --locus chr1:925952 --sex female + +# Only carriers with phenotype E11.9 +afquery variant-info --db ./db/ --locus chr1:925952 --phenotype E11.9 + +# Exclude phenotype I10 +afquery variant-info --db ./db/ --locus chr1:925952 --phenotype ^I10 + +# Restrict to WGS samples +afquery variant-info --db ./db/ --locus chr1:925952 --tech WGS + +# Combine filters +afquery variant-info --db ./db/ --locus chr1:925952 \ + --sex female --phenotype E11.9 --tech WGS,WES_kit_A +``` + +See [Sample Filtering](sample-filtering.md) for the full filter syntax. + +--- + +## Output formats + +### TSV + +Machine-readable tab-separated output, suitable for downstream processing: + +```bash +afquery variant-info --db ./db/ --locus chr1:925952 --format tsv > carriers.tsv +``` + +``` +sample_id sample_name sex tech phenotypes genotype filter +3 P003 male WGS E11.9,J45 het PASS +17 P017 female WES_kit_A E11.9 hom PASS +42 P042 male WGS I10 alt FAIL +``` + +### JSON + +Structured output with variant metadata and a sample list: + +```bash +afquery variant-info --db ./db/ --locus chr1:925952 --format json +``` + +```json +{ + "variant": { + "chrom": "chr1", + "pos": 925952, + "ref": ".", + "alt": "." + }, + "samples": [ + { + "sample_id": 3, + "sample_name": "P003", + "sex": "male", + "tech": "WGS", + "phenotypes": ["E11.9", "J45"], + "genotype": "het", + "filter": "PASS" + }, + { + "sample_id": 42, + "sample_name": "P042", + "sex": "male", + "tech": "WGS", + "phenotypes": ["I10"], + "genotype": "alt", + "filter": "FAIL" + } + ] +} +``` + +When `--ref` and `--alt` are specified, the `variant` block contains the actual alleles. Otherwise, `"."` is used as a placeholder. + +--- + +## Genotype values + +| Value | Meaning | +|---|---| +| `het` | Heterozygous carrier, FILTER=PASS | +| `hom` | Homozygous alt carrier, FILTER=PASS | +| `alt` | Non-ref carrier with FILTER≠PASS (ploidy unknown) | + +--- + +## All options + +| Option | Default | Description | +|---|---|---| +| `--db` | required | Path to database directory | +| `--locus` | required | `CHROM:POS` (e.g. `chr1:925952`) | +| `--ref` | — | Filter to specific reference allele | +| `--alt` | — | Filter to specific alternate allele | +| `--phenotype` | all | Include phenotype (repeatable; `^CODE` excludes) | +| `--sex` | `both` | `male`, `female`, or `both` | +| `--tech` | all | Include technology (repeatable; `^NAME` excludes) | +| `--format` | `text` | `text`, `tsv`, or `json` | +| `--no-warn` | off | Suppress `AfqueryWarning` messages | + +See also [CLI Reference → variant-info](../reference/cli.md#variant-info). + +--- + +## Next Steps + +- [Sample Filtering](sample-filtering.md) — full filter syntax for phenotype, sex, and technology +- [Understanding Output](../getting-started/understanding-output.md) — field definitions and special cases +- [FILTER=PASS Tracking](../advanced/filter-pass-tracking.md) — understanding FAIL genotypes +- [Python API → variant_info](../reference/python-api.md#variant_info) — programmatic access +- [ACMG Criteria](../use-cases/acmg-use-cases.md) — using carrier info for variant classification diff --git a/docs/index.md b/docs/index.md index d206c52..a634a4d 100644 --- a/docs/index.md +++ b/docs/index.md @@ -30,6 +30,7 @@ AFQuery pre-indexes genotypes as [Roaring Bitmaps](https://roaringbitmap.org/) i - **Server-less** — a directory of Parquet files + SQLite. Copy to share, no daemon required. - **Ploidy-aware** — correct AN on chrX PAR/non-PAR, chrY, and chrM. - **Technology-aware AN** — per-position capture BED intersection across WGS, WES kits, and panels. +- **Carrier lookup** — list samples carrying any variant with full metadata (sex, tech, phenotypes, genotype, FILTER status). - **VCF annotation** — add `AFQUERY_AC/AN/AF/N_HET/N_HOM_ALT/N_HOM_REF/N_FAIL` INFO fields from any sample subset. - **Audit changelog** — every database operation is recorded for reproducibility. @@ -78,18 +79,22 @@ graph TD E["Classify variants
using ACMG criteria"] F["Compare AF across
groups"] + M["Find carriers of
a variant"] + A -->|First time| G["5-Min Quickstart"] A -->|Build| B A -->|Query| C A -->|Annotate| D A -->|Classify| E A -->|Compare| F + A -->|Carriers| M B --> H["Create a Database"] C --> I["Query Guide"] D --> J["Annotate a VCF"] E --> K["ACMG Criteria"] F --> L["Cohort Stratification"] + M --> N["Variant Info"] click G "getting-started/quickstart/" click H "guides/create-database/" @@ -97,6 +102,7 @@ graph TD click J "guides/annotate-vcf/" click K "use-cases/acmg-use-cases/" click L "use-cases/cohort-stratification/" + click N "guides/variant-info/" style A fill:#e3f2fd style G fill:#e8f5e9 diff --git a/docs/reference/cli.md b/docs/reference/cli.md index 689c723..9ef24ba 100644 --- a/docs/reference/cli.md +++ b/docs/reference/cli.md @@ -104,6 +104,30 @@ afquery dump [OPTIONS] --- +## variant-info + +List samples carrying a specific variant, with their metadata. + +``` +afquery variant-info [OPTIONS] +``` + +| Option | Type | Default | Description | +|--------|------|---------|-------------| +| `--db` | TEXT | **required** | Path to database directory | +| `--locus` | TEXT | **required** | Single position as `CHROM:POS` (e.g., `chr1:925952`) | +| `--ref` | TEXT | None | Filter to specific reference allele | +| `--alt` | TEXT | None | Filter to specific alternate allele | +| `--phenotype` | TEXT | None | Phenotype filter. Repeatable; comma-separated or multiple flags. Use `^` prefix to exclude. | +| `--sex` | `male`\|`female`\|`both` | `both` | Restrict to specified sex | +| `--tech` | TEXT | None | Technology filter. Repeatable; comma-separated or multiple flags. Use `^` prefix to exclude. | +| `--format` | `text`\|`json`\|`tsv` | `text` | Output format | +| `--no-warn` | flag | False | Suppress `AfqueryWarning` messages | + +Returns one row per carrier sample with genotype (`het`/`hom`/`alt`) and FILTER status (`PASS`/`FAIL`). Use `--ref`/`--alt` to restrict to a specific allele when multiple alleles share the same position. + +--- + ## update-db Add samples, remove samples, update sample metadata, or compact the database. diff --git a/docs/reference/glossary.md b/docs/reference/glossary.md index 8851ad8..28cd6b7 100644 --- a/docs/reference/glossary.md +++ b/docs/reference/glossary.md @@ -20,6 +20,10 @@ See [Roaring Bitmap](#roaring-bitmap). A 1 Mbp (1,000,000 base pair) genomic interval used to partition variant data within each chromosome. Bucket 0 covers positions 0–999,999; bucket 1 covers 1,000,000–1,999,999; and so on. See [Data Model](data-model.md). +## Carrier + +A sample that has at least one copy of the alternate allele at a given position (heterozygous or homozygous alt). Use `variant-info` to list carriers with their metadata. See [Variant Info](../guides/variant-info.md). + ## Capture Index An interval tree (stored as a `.pkl` file) built from a BED file that defines the genomic regions covered by a whole-exome sequencing (WES) technology. Used to determine which WES samples are eligible at a given position. WGS samples do not require a capture index. See [Key Concepts](../getting-started/concepts.md#capture-index-wes). diff --git a/docs/reference/python-api.md b/docs/reference/python-api.md index 404ab4a..bd3033b 100644 --- a/docs/reference/python-api.md +++ b/docs/reference/python-api.md @@ -281,6 +281,60 @@ Export allele frequency data to CSV. If `output` is None, writes to stdout. --- +### variant_info + +```python +db.variant_info( + chrom: str, + pos: int, + ref: str | None = None, + alt: str | None = None, + phenotype: list[str] | None = None, + sex: str = "both", + tech: list[str] | None = None, +) -> list[SampleCarrier] +``` + +Return all samples carrying the variant at the given position, with their metadata. + +**Parameters:** + +| Parameter | Type | Description | +|-----------|------|-------------| +| `chrom` | str | Chromosome name (e.g., `"chr1"`, `"chrX"`) | +| `pos` | int | 1-based genomic position | +| `ref` | str \| None | Filter to specific reference allele | +| `alt` | str \| None | Filter to specific alternate allele | +| `phenotype` | list[str] \| None | Phenotype filter codes. Use `"^CODE"` prefix to exclude. | +| `sex` | str | `"both"` (default), `"male"`, or `"female"` | +| `tech` | list[str] \| None | Technology filter. Use `"^TECH"` prefix to exclude. | + +**Returns:** List of `SampleCarrier` objects sorted by `sample_id`. Empty list if no eligible carrier exists. + +**Example:** + +```python +carriers = db.variant_info("chr1", pos=925952) +for c in carriers: + print(f"{c.sample_name} {c.genotype} {c.tech_name} {c.phenotypes}") + +# With allele and sample filters +carriers = db.variant_info( + chrom="chr17", pos=41245466, ref="A", alt="T", + phenotype=["E11.9"], sex="female", tech=["WGS"], +) +``` + +The `variant_info` function is also available at the top level for one-off use: + +```python +from afquery import variant_info + +carriers = variant_info("./db/", "chr1", 925952, ref="G", alt="A") +``` + +--- + ### add_samples ```python @@ -415,6 +469,34 @@ class VariantKey: --- +## SampleCarrier + +```python +@dataclass +class SampleCarrier: + sample_id: int # 0-based internal ID + sample_name: str # Name from manifest + sex: str # 'male' | 'female' + tech_name: str # Sequencing technology + phenotypes: list[str] # Sorted phenotype codes + genotype: str # 'het' | 'hom' | 'alt' + filter_pass: bool # False = FILTER≠PASS +``` + +Returned by `Database.variant_info()`. Each instance represents one sample carrying the queried variant. + +| Field | Type | Description | +|-------|------|-------------| +| `sample_id` | int | 0-based internal sample ID | +| `sample_name` | str | Sample name from manifest | +| `sex` | str | `"male"` or `"female"` | +| `tech_name` | str | Sequencing technology name | +| `phenotypes` | list[str] | Sorted list of phenotype codes | +| `genotype` | str | `"het"` (heterozygous, PASS), `"hom"` (homozygous alt, PASS), or `"alt"` (non-ref, FILTER≠PASS) | +| `filter_pass` | bool | `True` if FILTER=PASS, `False` otherwise | + +--- + ## SampleFilter ```python @@ -488,4 +570,9 @@ batch_results = db.query_batch_multi(variants) regions = [("chr1", 900000, 1000000), ("chrX", 5000000, 6000000)] region_results = db.query_region_multi(regions) print(f"Found {len(region_results)} variants across {len(regions)} regions") + +# Carrier lookup +carriers = db.variant_info("chr1", pos=925952) +for c in carriers: + print(f"{c.sample_name} {c.genotype} {c.tech_name} PASS={c.filter_pass}") ``` diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md index 6b93444..af9d921 100644 --- a/docs/troubleshooting.md +++ b/docs/troubleshooting.md @@ -119,16 +119,6 @@ afquery info --db ./db/ --samples | grep SAMP --- -## Version Compatibility - -If you get `"Unsupported schema_version"` errors, upgrade AFQuery: - -```bash -pip install --upgrade afquery -``` - ---- - ## WES Technology Treated as WGS **Symptom:** AN for WES samples is much higher than expected; positions outside the capture panel return results. diff --git a/docs/use-cases/acmg-use-cases.md b/docs/use-cases/acmg-use-cases.md index 02d306c..66132a9 100644 --- a/docs/use-cases/acmg-use-cases.md +++ b/docs/use-cases/acmg-use-cases.md @@ -133,6 +133,9 @@ if cases and controls: print("Strong enrichment — PS4 applicable if case count is sufficient") ``` +!!! tip "Verify carriers" + Use `afquery variant-info` to list the specific samples carrying the variant and confirm their phenotype assignments match expectations. See [Variant Info](../guides/variant-info.md). + !!! warning "Statistical considerations" The original ACMG/AMP framework (Richards et al., 2015) does not specify a minimum odds ratio for PS4. The OR > 5.0 threshold used here is a widely adopted convention, but ClinGen expert panels have proposed different thresholds depending on the gene/disease context (e.g., OR ≥ 6 for strong, OR ≥ 3 for moderate in hearing loss). The enrichment ratio above is a screening tool. For formal PS4 application, compute a Fisher's exact test on the 2×2 table of (carrier/non-carrier) × (case/control) and apply the thresholds relevant to your clinical context. @@ -223,6 +226,7 @@ bcftools filter -i 'AFQUERY_N_FAIL > 0' annotated.vcf | head ## Next Steps +- [Variant Info](../guides/variant-info.md) — list carriers with metadata to verify phenotype assignments - [Clinical Prioritization](../use-cases/clinical-prioritization.md) — full annotation and filtering workflow - [Pseudo-controls](../use-cases/pseudo-controls.md) — case vs. control AF comparison - [Population-Specific AF](../use-cases/population-specific-af.md) — ancestry-stratified queries diff --git a/docs/use-cases/clinical-prioritization.md b/docs/use-cases/clinical-prioritization.md index 3b154b3..6dad977 100644 --- a/docs/use-cases/clinical-prioritization.md +++ b/docs/use-cases/clinical-prioritization.md @@ -114,8 +114,19 @@ A typical clinical pipeline retains ~1,500 rare/novel candidates after cohort AF For detailed ACMG workflows with worked examples and AN threshold guidance, see [ACMG Criteria (BA1/PM2/PS4)](acmg-use-cases.md). +### 6. Inspect carriers of rare candidates + +After identifying rare variants, use `variant-info` to see who carries each one — useful for confirming case/control enrichment or checking sample quality: + +```bash +afquery variant-info --db ./db/ --locus chr2:166845670 --ref G --alt A +``` + +This lists each carrier with their phenotype codes, sequencing technology, genotype, and FILTER status. See [Variant Info](../guides/variant-info.md) for filtering options. + ## Next Steps +- [Variant Info](../guides/variant-info.md) — list carriers of any variant with metadata - [Annotate a VCF](../guides/annotate-vcf.md) — full annotation options - [FILTER=PASS Tracking](../advanced/filter-pass-tracking.md) — using N_FAIL in filtering - [Population-Specific AF](population-specific-af.md) — local vs. gnomAD comparison diff --git a/mkdocs.yml b/mkdocs.yml index 6dfd8ff..d7eb52d 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -84,6 +84,7 @@ nav: - Query Allele Frequencies: guides/query.md - Annotate a VCF: guides/annotate-vcf.md - Bulk Export: guides/dump-export.md + - Variant Info: guides/variant-info.md - Sample Filtering: guides/sample-filtering.md - 🏥 Clinical Workflows: - ACMG Criteria (BA1/PM2/PS4): use-cases/acmg-use-cases.md diff --git a/src/afquery/__init__.py b/src/afquery/__init__.py index 137c149..0103038 100644 --- a/src/afquery/__init__.py +++ b/src/afquery/__init__.py @@ -1,5 +1,6 @@ from afquery._version import __version__ from .database import Database -from .models import QueryResult, AfqueryWarning +from .models import QueryResult, SampleCarrier, AfqueryWarning +from .variant_info import variant_info -__all__ = ["Database", "QueryResult", "AfqueryWarning", "__version__"] +__all__ = ["Database", "QueryResult", "SampleCarrier", "AfqueryWarning", "variant_info", "__version__"] diff --git a/src/afquery/cli.py b/src/afquery/cli.py index e13212b..18450a7 100644 --- a/src/afquery/cli.py +++ b/src/afquery/cli.py @@ -109,6 +109,57 @@ def _print_results(results, fmt: str) -> None: ) +def _print_carriers(carriers, variant_key, fmt: str) -> None: + """Print variant_info results in the requested format.""" + if fmt == "json": + out = { + "variant": { + "chrom": variant_key[0], "pos": variant_key[1], + "ref": variant_key[2], "alt": variant_key[3], + }, + "samples": [ + { + "sample_id": c.sample_id, + "sample_name": c.sample_name, + "sex": c.sex, + "tech": c.tech_name, + "phenotypes": c.phenotypes, + "genotype": c.genotype, + "filter": "PASS" if c.filter_pass else "FAIL", + } + for c in carriers + ], + } + click.echo(json.dumps(out, indent=2)) + elif fmt == "tsv": + click.echo("sample_id\tsample_name\tsex\ttech\tphenotypes\tgenotype\tfilter") + for c in carriers: + click.echo( + f"{c.sample_id}\t{c.sample_name}\t{c.sex}\t{c.tech_name}\t" + f"{','.join(c.phenotypes)}\t{c.genotype}\t" + f"{'PASS' if c.filter_pass else 'FAIL'}" + ) + else: # text + if not carriers: + click.echo("No carriers found for the given filters.") + return + headers = ["sample_id", "sample_name", "sex", "tech", "phenotypes", "genotype", "filter"] + rows = [ + [ + str(c.sample_id), c.sample_name, c.sex, c.tech_name, + ",".join(c.phenotypes) if c.phenotypes else ".", + c.genotype, "PASS" if c.filter_pass else "FAIL", + ] + for c in carriers + ] + widths = [max(len(h), max((len(r[i]) for r in rows), default=0)) for i, h in enumerate(headers)] + fmt_row = " ".join(f"{{:<{w}}}" for w in widths) + click.echo(fmt_row.format(*headers)) + click.echo(" ".join("-" * w for w in widths)) + for row in rows: + click.echo(fmt_row.format(*row)) + + @click.group() def cli(): """AFQuery: bitmap-indexed allele frequency engine for local genomic cohorts. @@ -116,7 +167,7 @@ def cli(): Enables fast AC/AN/AF queries on user-defined subcohorts (phenotype, sex, technology) without rescanning VCFs. - Commands: query, annotate, dump, info, version, create-db, update-db, check, benchmark + Commands: query, variant-info, annotate, dump, info, version, create-db, update-db, check, benchmark """ @@ -178,6 +229,47 @@ def query(db, locus, region, from_file, phenotype, sex, ref, alt, tech, fmt, no_ _print_results(results, fmt) +@cli.command("variant-info") +@click.option("--db", required=True, help="Path to database directory.") +@click.option("--locus", required=True, help="Position as CHROM:POS (e.g., chr1:925952).") +@click.option("--ref", default=None, help="Filter to specific reference allele.") +@click.option("--alt", default=None, help="Filter to specific alternate allele.") +@click.option("--phenotype", multiple=True, help="Phenotype to include. Repeatable; comma-separated or multiple flags. Use ^ prefix to exclude. (default: include all samples)") +@click.option("--sex", default="both", type=click.Choice(["male", "female", "both"]), help="Restrict to specified sex. Options: male, female, both. (default: both)") +@click.option("--tech", multiple=True, help="Technology filter. Repeatable; comma-separated or multiple flags. Use ^ prefix to exclude. (default: include all samples)") +@click.option("--format", "fmt", default="text", type=click.Choice(["text", "json", "tsv"]), help="Output format. Options: text, json, tsv. (default: text)") +@click.option("--no-warn", is_flag=True, default=False, help="Suppress AfqueryWarning messages.") +def variant_info_cmd(db, locus, ref, alt, phenotype, sex, tech, fmt, no_warn): + """List samples carrying a specific variant, with their metadata. + + Returns one row per carrier sample with genotype (het/hom/alt) and FILTER + status (PASS/FAIL). Use --ref/--alt to restrict to a specific allele when + multiple alleles share the same position. + + \b + Examples: + afquery variant-info --db ./db --locus chr1:925952 + afquery variant-info --db ./db --locus chr1:925952 --ref A --alt T + afquery variant-info --db ./db --locus chrX:2700000 --sex female --format tsv + """ + if no_warn: + import warnings + from .models import AfqueryWarning + warnings.filterwarnings("ignore", category=AfqueryWarning) + + chrom, pos = _parse_locus(locus) + database = Database(db) + carriers = database.variant_info( + chrom=chrom, pos=pos, + ref=ref, alt=alt, + phenotype=_expand_tokens(phenotype), sex=sex, + tech=_expand_tokens(tech), + ) + + _variant_key = (chrom, pos, ref or ".", alt or ".") + _print_carriers(carriers, _variant_key, fmt) + + @cli.command() @click.option("--db", required=True, help="Path to database directory.") @click.option("--input", "input_vcf", required=True, help="Input VCF file (plain or .gz).") @@ -201,7 +293,7 @@ def annotate(db, input_vcf, output_vcf, phenotype, sex, tech, threads, verbose, AFQUERY_N_HET heterozygous carrier count (per ALT) AFQUERY_N_HOM_ALT homozygous alt count (per ALT) AFQUERY_N_HOM_REF homozygous ref count (per ALT) - AFQUERY_N_FAIL samples with FILTER!=PASS (v2 databases only) + AFQUERY_N_FAIL samples with FILTER!=PASS """ if no_warn: import warnings diff --git a/src/afquery/database.py b/src/afquery/database.py index dbd3e99..12f2c95 100644 --- a/src/afquery/database.py +++ b/src/afquery/database.py @@ -2,7 +2,7 @@ import sqlite3 from pathlib import Path -from .models import QueryParams, QueryResult, SampleFilter +from .models import QueryParams, QueryResult, SampleCarrier, SampleFilter from .query import QueryEngine @@ -86,6 +86,40 @@ def query_region( sf = self._make_filter(phenotype, sex, tech) return self._engine.query_region(chrom, start, end, sf) + def variant_info( + self, + chrom: str, + pos: int, + ref: str | None = None, + alt: str | None = None, + phenotype: list[str] | None = None, + sex: str = "both", + tech: list[str] | None = None, + ) -> list[SampleCarrier]: + """Return all samples carrying a variant, with their metadata. + + Each element includes sample name, sex, technology, phenotype codes, + genotype (``"het"``, ``"hom"``, or ``"alt"`` for FILTER≠PASS samples), + and FILTER status. + + Args: + chrom: Chromosome (e.g. ``"chr1"`` or ``"1"``). + pos: Position, 1-based. + ref: Reference allele filter. If omitted, all alleles at *pos* are returned. + alt: Alternate allele filter. If omitted, all alleles at *pos* are returned. + phenotype: Phenotype filter tokens. Use ``"^CODE"`` prefix to exclude. + sex: ``"both"`` (default), ``"male"``, or ``"female"``. + tech: Technology filter tokens. Use ``"^TECH"`` prefix to exclude. + + Returns: + List of :class:`~afquery.models.SampleCarrier` sorted by + ``sample_id``. Empty list if the variant is absent or no eligible + carrier exists. + """ + sf = self._make_filter(phenotype, sex, tech) + params = QueryParams(chrom=chrom, pos=pos, filter=sf, ref=ref, alt=alt) + return self._engine.variant_info(params) + def query_region_multi( self, regions: list[tuple[str, int, int]], diff --git a/src/afquery/models.py b/src/afquery/models.py index 4b7cf3a..996a5ac 100644 --- a/src/afquery/models.py +++ b/src/afquery/models.py @@ -73,3 +73,15 @@ class QueryResult: N_HOM_ALT: int = 0 N_HOM_REF: int = 0 N_FAIL: int = 0 + + +@dataclass +class SampleCarrier: + """A sample carrying a given variant, with its associated metadata.""" + sample_id: int + sample_name: str + sex: str # 'male' | 'female' + tech_name: str + phenotypes: list[str] + genotype: str # 'het' | 'hom' | 'alt' (FILTER≠PASS, ploidy unknown) + filter_pass: bool # False = FILTER≠PASS diff --git a/src/afquery/query.py b/src/afquery/query.py index 1f1dd76..450dd97 100644 --- a/src/afquery/query.py +++ b/src/afquery/query.py @@ -9,7 +9,7 @@ from .bitmaps import deserialize from .capture import CaptureIndex, load_capture_indices from .constants import normalize_chrom, ALL_CHROMS, CHROM_ORDER -from .models import AfqueryWarning, QueryParams, QueryResult, VariantKey, Sample, Technology, SampleFilter +from .models import AfqueryWarning, QueryParams, QueryResult, SampleCarrier, VariantKey, Sample, Technology, SampleFilter from .ploidy import compute_AN, split_ploidy BATCH_IN_THRESHOLD = 10_000 @@ -43,9 +43,18 @@ def __init__(self, db_path: str): "SELECT tech_id, tech_name, bed_path FROM technologies" ).fetchall() ] + + # Phenotype cache: sample_id → list of phenotype codes + self._sample_phenotypes: dict[int, list[str]] = {} + for row in con.execute( + "SELECT sample_id, phenotype_code FROM sample_phenotype" + ).fetchall(): + self._sample_phenotypes.setdefault(row[0], []).append(row[1]) + con.close() self._samples = samples + self._samples_by_id: dict[int, Sample] = {s.sample_id: s for s in samples} self._tech_map = {t.tech_id: t for t in techs} sex_bms = self._bitmaps.get("sex", {}) self._male_bm = sex_bms.get("male", BitMap()) @@ -549,3 +558,108 @@ def query_region( sample_bm = self._build_sample_bitmap(sf) return self._query_region_inner(chrom, start, end, sample_bm) + + def variant_info(self, params: QueryParams) -> list[SampleCarrier]: + """Return all samples carrying the variant at params.chrom:params.pos. + + Each returned :class:`~afquery.models.SampleCarrier` includes the + sample's name, sex, technology, phenotype codes, genotype (het/hom/alt), + and FILTER status. + + Args: + params: Query parameters (chrom, pos, optional ref/alt, sample filter). + + Returns: + List of :class:`~afquery.models.SampleCarrier`, sorted by + ``sample_id``. Empty if the variant is absent or no eligible + carrier exists. + """ + chrom = normalize_chrom(params.chrom) + pos = params.pos + + if chrom not in self._all_known_chroms: + warnings.warn( + f"Chromosome {chrom!r} has no data in this database. " + f"Available: {sorted(self._all_known_chroms)!r}", + AfqueryWarning, stacklevel=3, + ) + return [] + + sample_bm = self._build_sample_bitmap(params.filter) + parquet_path = self._parquet_path(chrom, pos) + if parquet_path is None: + return [] + + con = duckdb.connect(config={"temp_directory": "/tmp"}) + rows = con.execute( + "SELECT pos, ref, alt, het_bitmap, hom_bitmap, fail_bitmap" + " FROM read_parquet(?) WHERE pos = ?", + [str(parquet_path), pos], + ).fetchall() + con.close() + + if not rows: + return [] + + if params.ref is not None: + rows = [r for r in rows if r[1] == params.ref] + if params.alt is not None: + rows = [r for r in rows if r[2] == params.alt] + if not rows: + return [] + + # Warn if multiple alleles found without explicit ref/alt filter + if len(rows) > 1 and params.ref is None and params.alt is None: + alts = ", ".join(r[2] for r in rows) + warnings.warn( + f"Multiple alleles found at {chrom}:{pos} ({alts}). " + "Use --ref/--alt to restrict to a specific variant.", + AfqueryWarning, stacklevel=3, + ) + + carriers: list[SampleCarrier] = [] + for row in rows: + row_pos, ref, alt, het_bytes, hom_bytes, fail_bytes = row + het_bm = deserialize(bytes(het_bytes)) + hom_bm = deserialize(bytes(hom_bytes)) + fail_bm = deserialize(bytes(fail_bytes)) + het_elig = het_bm & sample_bm + hom_elig = hom_bm & sample_bm + fail_elig = fail_bm & sample_bm + + seen: set[int] = set() + for sid in sorted(hom_elig): + seen.add(sid) + is_fail = sid in fail_elig + carriers.append(self._make_carrier(sid, "alt" if is_fail else "hom", not is_fail)) + for sid in sorted(het_elig): + if sid not in seen: + seen.add(sid) + is_fail = sid in fail_elig + carriers.append(self._make_carrier(sid, "alt" if is_fail else "het", not is_fail)) + for sid in sorted(fail_elig): + if sid not in seen: + seen.add(sid) + carriers.append(self._make_carrier(sid, "alt", False)) + + carriers.sort(key=lambda c: c.sample_id) + return carriers + + def _make_carrier( + self, + sample_id: int, + genotype: str, + filter_pass: bool, + ) -> SampleCarrier: + s = self._samples_by_id[sample_id] + tech_name = self._tech_map[s.tech_id].tech_name + phenotypes = self._sample_phenotypes.get(sample_id, []) + return SampleCarrier( + sample_id=sample_id, + sample_name=s.sample_name, + sex=s.sex, + tech_name=tech_name, + phenotypes=sorted(phenotypes), + genotype=genotype, + filter_pass=filter_pass, + ) diff --git a/src/afquery/variant_info.py b/src/afquery/variant_info.py new file mode 100644 index 0000000..b07fd0a --- /dev/null +++ b/src/afquery/variant_info.py @@ -0,0 +1,49 @@ +"""variant_info — retrieve sample carriers for a specific variant. + +Thin convenience wrapper around :class:`~afquery.database.Database` for +programmatic use without instantiating the class directly. +""" +from .database import Database +from .models import SampleCarrier + + +def variant_info( + db_path: str, + chrom: str, + pos: int, + ref: str | None = None, + alt: str | None = None, + phenotype: list[str] | None = None, + sex: str = "both", + tech: list[str] | None = None, +) -> list[SampleCarrier]: + """Return all samples carrying the given variant, with their metadata. + + Args: + db_path: Path to the AFQuery database directory. + chrom: Chromosome (e.g. ``"chr1"`` or ``"1"``). + pos: Position, 1-based. + ref: Reference allele filter. If omitted, all alleles at *pos* are returned. + alt: Alternate allele filter. If omitted, all alleles at *pos* are returned. + phenotype: Phenotype filter tokens. Use ``"^CODE"`` prefix to exclude. + sex: ``"both"`` (default), ``"male"``, or ``"female"``. + tech: Technology filter tokens. Use ``"^TECH"`` prefix to exclude. + + Returns: + List of :class:`~afquery.models.SampleCarrier` sorted by + ``sample_id``. Empty list if the variant is absent or no eligible + carrier exists. + """ + db = Database(db_path) + return db.variant_info( + chrom=chrom, + pos=pos, + ref=ref, + alt=alt, + phenotype=phenotype, + sex=sex, + tech=tech, + ) + + +__all__ = ["variant_info"] diff --git a/tests/test_dump.py b/tests/test_dump.py index 58a9f67..fc8ac78 100644 --- a/tests/test_dump.py +++ b/tests/test_dump.py @@ -142,7 +142,7 @@ def test_csv_header_base_columns(self, test_db): assert "N_HET" in rows[0] assert "N_HOM_ALT" in rows[0] assert "N_HOM_REF" in rows[0] - assert "N_FAIL" in rows[0] # v2 DB + assert "N_FAIL" in rows[0] def test_ac_positive_filter(self, test_db): """By default all exported rows must have AC > 0.""" @@ -360,7 +360,7 @@ def test_by_sex_and_phenotype_cartesian_labels(self, test_db): assert "AC_female_E11.9" in rows[0] def test_n_fail_in_group_columns(self, test_db): - """v2 DB: N_FAIL_