diff --git a/docs/advanced/debugging-results.md b/docs/advanced/debugging-results.md index 256439b..23f2a04 100644 --- a/docs/advanced/debugging-results.md +++ b/docs/advanced/debugging-results.md @@ -34,13 +34,13 @@ A variant you expect to find is not in the database: | Check | Details | |-------|---------| | Was it in the source VCFs? | AFQuery only stores variants present in ingested VCFs | -| Was it FILTER=PASS? | Default ingestion skips non-PASS variants. Check with `afquery info` for `pass_only_filter` | +| Was the alt allele ever called? | A variant is stored only if at least one sample carries the alt allele or has a failed call there. A site seen only as `0/0` in every sample is genuinely absent. Note that non-PASS calls *are* kept (in `fail_bitmap`) — such a variant is not missing, it just shows `AC=0`. | | Multiallelic sites | AFQuery stores each ALT separately. Query the specific ALT allele, not just position | | Chromosome naming | Ensure consistent `chr` prefix usage | ### 4. Unexpected N_FAIL > 0 -`N_FAIL > 0` means some eligible samples had the alt allele called but with `FILTER≠PASS`. These samples are excluded from AC/AN. This is usually benign (1–2 samples), but a high N_FAIL warrants investigation: +`N_FAIL > 0` means some eligible samples had a call with `FILTER≠PASS` at this position. These samples are excluded from AC, but they remain eligible and still count in AN. This is usually benign (1–2 samples), but a high N_FAIL warrants investigation: | N_FAIL relative to n_eligible | Likely cause | |---|---| diff --git a/docs/advanced/filter-pass-tracking.md b/docs/advanced/filter-pass-tracking.md index cc18c61..11d03ca 100644 --- a/docs/advanced/filter-pass-tracking.md +++ b/docs/advanced/filter-pass-tracking.md @@ -11,9 +11,10 @@ In VCF format, the FILTER column indicates whether a variant call passed quality - `PASS` or `.` (missing) — the variant passed all filters - Any other value (e.g., `LowQual`, `VQSRTrancheSNP99.90to100.00`) — the variant failed one or more filters -AFQuery default behavior: +AFQuery's default behavior: -- **PASS-only**: only `FILTER=PASS` variants are counted in AC/AN. This is always enforced. +- Only `FILTER=PASS` calls contribute to **AC** (and therefore to **AF**). This is always enforced. +- **AN** is not affected by the filter — it counts every eligible sample at the position, failed calls included. --- @@ -21,19 +22,19 @@ AFQuery default behavior: AFQuery stores a third bitmap per variant alongside `het_bitmap` and `hom_bitmap`: -- **`fail_bitmap`** — bit set for each sample that has a non-ref genotype (AC>0) AND `FILTER≠PASS` +- **`fail_bitmap`** — bit set for each sample whose call at this site has `FILTER≠PASS`. In practice this is two cases: a sample that carries the alt allele but whose call failed a filter, and a sample with a missing genotype (`./.`) at a site that itself failed a filter. -This means: +What this means for a sample in `fail_bitmap`: -- A sample in `fail_bitmap` was genotyped with the alt allele but the call failed QC -- Such samples are **not** counted in AC/AN (they don't affect AF) -- Their count is exposed as `N_FAIL` +- Its alt alleles are **not** counted in AC, so it does not raise AF. +- It is still an eligible sample, so it **is** counted in AN. The failed call lowers AF by sitting in the denominator without contributing to the numerator. +- Its count is exposed separately as `N_FAIL`. --- ## Database Creation -The `fail_bitmap` is always written and always tracks PASS-only ingestion: +The `fail_bitmap` is always written, regardless of build options: ```bash afquery create-db --manifest manifest.tsv --output-dir ./db/ --genome-build GRCh38 @@ -52,10 +53,10 @@ afquery query --db ./db/ --locus chr1:925952 ``` ``` -chr1:925952 G>A AC=142 AN=2742 AF=0.0518 n_eligible=1371 N_HET=138 N_HOM_ALT=2 N_HOM_REF=1231 N_FAIL=7 +chr1:925952 G>A AC=142 AN=2742 AF=0.0518 n_eligible=1371 N_HET=138 N_HOM_ALT=2 N_HOM_REF=1224 N_FAIL=7 ``` -`N_FAIL=7` means 7 eligible samples had the alt allele called but with FILTER≠PASS. +`N_FAIL=7` means 7 eligible samples had a call with FILTER≠PASS at this site. They are part of `n_eligible` (and of AN), so the genotype counts still add up: 138 + 2 + 1224 + 7 = 1371. ### Python API @@ -83,21 +84,23 @@ Each carrier row shows its `filter` column as `PASS` or `FAIL`, along with sampl ## VCF Annotation -AFQuery adds an additional INFO field to annotated VCFs: +Among the INFO fields `afquery annotate` writes, one reports the failed-call count: | Field | Type | Description | |-------|------|-------------| -| `AFQUERY_N_FAIL` | Integer | Eligible samples with FILTER≠PASS at this variant | +| `AFQUERY_N_FAIL` | Integer | Eligible samples with `FILTER≠PASS` at this variant | ```bash afquery annotate --db ./db/ --input variants.vcf --output annotated.vcf ``` +See [Understanding Output](../getting-started/understanding-output.md#vcf-annotation-fields) for the full set of `AFQUERY_*` fields. + --- ## PASS-Only Enforcement -AF reflects the quality-filtered allele frequency — the frequency of the alt allele among high-quality calls. This is appropriate for most clinical and research use cases. PASS-only ingestion is always enforced. +PASS-only counting is always enforced and cannot be turned off. It applies to the numerator only: a failed call never adds an alt allele to AC, but the sample stays in the eligible set and so remains in AN. The result is a deliberately conservative AF — a failed carrier weighs on the denominator without lifting the numerator, which is the safe direction for clinical and research use. --- diff --git a/docs/advanced/multi-cohort-strategies.md b/docs/advanced/multi-cohort-strategies.md index b87578a..6ea03fd 100644 --- a/docs/advanced/multi-cohort-strategies.md +++ b/docs/advanced/multi-cohort-strategies.md @@ -61,7 +61,7 @@ All samples in one database, with phenotype codes distinguishing cohorts. ### Manifest Design ```tsv -sample_id vcf_path sex technology phenotype_codes +sample_name vcf_path sex tech_name phenotype_codes CARD_001 /data/card/001.vcf.gz female wgs cardiology,EUR,control CARD_002 /data/card/002.vcf.gz male wgs cardiology,EUR,case_HCM NEURO_001 /data/neuro/001.vcf.gz female wes neurology,AFR,case_epilepsy @@ -99,7 +99,7 @@ afquery query --db ./merged/ --locus chr1:12345678 --ref C --alt T \ |-----------|-------------| | Single database to maintain | Rebuilding requires all VCFs accessible | | Cross-cohort queries via phenotype filters | Phenotype code design must be planned upfront | -| One annotation pass covers all cohorts | Adding a new cohort requires `afquery update` | +| One annotation pass covers all cohorts | Adding a new cohort requires `afquery update-db` | | Flexible ad-hoc stratification | Larger database, longer rebuild time | --- @@ -134,7 +134,11 @@ Maintain both per-cohort and merged databases. afquery annotate --db /databases/institutional/cardiology/ \ --input patient.vcf.gz --output step1.vcf.gz -# Annotate against shared controls (use a different INFO prefix via Python API) +# Annotate against shared controls in a second pass +# Note: the AFQUERY_* INFO fields are overwritten on each pass — if you need +# both sets of frequencies, extract the values from step1.vcf.gz first. +afquery annotate --db /databases/shared/combined_controls/ \ + --input step1.vcf.gz --output step2.vcf.gz ``` --- diff --git a/docs/advanced/performance.md b/docs/advanced/performance.md index 43657cf..ace0553 100644 --- a/docs/advanced/performance.md +++ b/docs/advanced/performance.md @@ -50,7 +50,7 @@ Expected scaling (50K samples, 2,500 buckets, GRCh38): | 32 | ~18 min | ~24× | | 52 | ~13 min | ~38× | -Scaling is near-linear up to ~32 cores; beyond that, I/O and SQLite contention limit further gains. +Scaling is near-linear up to ~32 cores; beyond that, disk I/O contention limits further gains. Total RAM required: `build_threads × build_memory` @@ -80,7 +80,7 @@ afquery create-db ... --threads 32 --build-threads 16 --build-memory 4GB graph TD A["Query Request
chr1:925952"] B["Open DuckDB
connection"] - C["Locate Parquet
bucket_0/data.parquet"] + C["Locate Parquet
bucket_0.parquet"] D["Read rows
matching pos"] E["Deserialize
bitmaps"] F["Bitmap AND
with eligible
sample set"] diff --git a/docs/advanced/ploidy-and-sex-chroms.md b/docs/advanced/ploidy-and-sex-chroms.md index 030693a..987b991 100644 --- a/docs/advanced/ploidy-and-sex-chroms.md +++ b/docs/advanced/ploidy-and-sex-chroms.md @@ -103,12 +103,12 @@ afquery query --db ./db/ --locus chrM:3243 For every query result, the following identity holds: -**N_HET + N_HOM_ALT + N_HOM_REF + N_FAIL = n_eligible** +**N_HET + N_HOM_ALT + N_HOM_REF + N_FAIL + N_NO_COVERAGE = n_eligible** -This can be used to validate results. N_HOM_REF is the number of eligible samples that are homozygous reference (i.e., do not carry the alt allele and passed quality filters). +This can be used to validate results. N_HOM_REF is the number of eligible samples that are homozygous reference (i.e., do not carry the alt allele and passed quality filters). N_NO_COVERAGE is 0 unless a coverage-evidence filter is active — see [Coverage Evidence](coverage-evidence.md). !!! note "Mutual exclusivity" - N_HET, N_HOM_ALT, N_HOM_REF, and N_FAIL are mutually exclusive. A sample with a non-ref allele but FILTER≠PASS is counted in N_FAIL only — it does not appear in N_HET or N_HOM_ALT. Likewise, N_HOM_REF counts only PASS-filtered samples. + N_HET, N_HOM_ALT, N_HOM_REF, N_FAIL, and N_NO_COVERAGE are mutually exclusive. A sample with a non-ref allele but FILTER≠PASS is counted in N_FAIL only — it does not appear in N_HET or N_HOM_ALT. Likewise, N_HOM_REF counts only PASS-filtered samples. ### chrX non-PAR diff --git a/docs/faq.md b/docs/faq.md index 33fd8ed..1d7acff 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -149,7 +149,7 @@ On chrX non-PAR positions, males contribute AN=1 and females contribute AN=2. Wh ## What does N_FAIL mean in query output? -`N_FAIL` is the count of eligible samples that had the alt allele called but with `FILTER≠PASS`. Shown as `N_FAIL=N` in text output. These samples are not counted in AC/AN. See [FILTER=PASS Tracking](advanced/filter-pass-tracking.md). +`N_FAIL` is the count of eligible samples whose call had `FILTER≠PASS` at the position. Shown as `N_FAIL=N` in text output. These samples are excluded from AC, but they stay eligible and still count toward AN. See [FILTER=PASS Tracking](advanced/filter-pass-tracking.md). --- diff --git a/docs/getting-started/concepts.md b/docs/getting-started/concepts.md index 41a324b..fb1d5be 100644 --- a/docs/getting-started/concepts.md +++ b/docs/getting-started/concepts.md @@ -126,13 +126,13 @@ graph TB ### Per-Variant Bitmaps -For each variant row, AFQuery stores three [Roaring Bitmaps](https://roaringbitmap.org/): +For each variant row, AFQuery stores per-sample genotype information as [Roaring Bitmaps](https://roaringbitmap.org/). Three of them drive every query: -- **`het_bitmap`** — bit set for each sample that is heterozygous (GT=0/1 or 1/0) -- **`hom_bitmap`** — bit set for each sample that is homozygous alt (GT=1/1 or GT=1) -- **`fail_bitmap`** bit set for each sample called with FILTER≠PASS +- **`het_bitmap`** — sample is heterozygous (`GT=0/1`) with `FILTER=PASS` +- **`hom_bitmap`** — sample is homozygous alt (`GT=1/1`, or `GT=1` on haploid regions) with `FILTER=PASS` +- **`fail_bitmap`** — the sample's call at this site has `FILTER≠PASS` -Each sample has a stable integer ID (0-indexed). The bit position in the bitmap equals the sample ID. +Each sample has a stable integer ID (0-indexed). The bit position in the bitmap equals the sample ID. Databases built with coverage-quality filters carry two additional bitmaps — see [Data Model](../reference/data-model.md#parquet-schema). ### Parquet Storage @@ -141,8 +141,8 @@ Bitmaps are serialized and stored in Parquet files, partitioned by chromosome an ``` variants/ chr1/ - bucket_0/ ← positions 0–999,999 - bucket_1/ ← positions 1,000,000–1,999,999 + bucket_0.parquet ← positions 0–999,999 + bucket_1.parquet ← positions 1,000,000–1,999,999 ... chr2/ ... diff --git a/docs/getting-started/quickstart.md b/docs/getting-started/quickstart.md index cb9b1bd..1a3ace0 100644 --- a/docs/getting-started/quickstart.md +++ b/docs/getting-started/quickstart.md @@ -135,6 +135,7 @@ The output VCF gains INFO fields (see [Annotate a VCF](../guides/annotate-vcf.md | `AFQUERY_N_HOM_ALT` | A (per ALT) | Homozygous alt sample count | | `AFQUERY_N_HOM_REF` | A (per ALT) | Homozygous ref sample count | | `AFQUERY_N_FAIL` | 1 (per site) | Samples with FILTER≠PASS | +| `AFQUERY_N_NO_COVERAGE` | A (per ALT) | Eligible samples lacking coverage evidence (0 unless a coverage-evidence filter is active) | --- diff --git a/docs/getting-started/tutorial.md b/docs/getting-started/tutorial.md index 92f6431..df85551 100644 --- a/docs/getting-started/tutorial.md +++ b/docs/getting-started/tutorial.md @@ -86,7 +86,7 @@ Every query result includes the following fields: | **AC** | Allele count — alt allele copies in eligible samples | | **AN** | Allele number — total alleles examined (adjusted for coverage and ploidy) | | **AF** | Allele frequency — `AC / AN` | -| **n_eligible** | Samples in the filtered set (before coverage or ploidy adjustments). Here, 10 = no filter applied. | +| **n_eligible** | Eligible samples at this position — those passing the metadata filters *and* covered here. Ploidy is applied afterwards, to AN. Here, 10 = no filter applied and every sample is covered. | | **N_HET** | Eligible samples heterozygous for the alt allele | | **N_HOM_ALT** | Eligible samples homozygous for the alt allele | | **N_HOM_REF** | Eligible samples homozygous reference | @@ -94,7 +94,7 @@ Every query result includes the following fields: For `chr1:925952`: AN=20 = 10 diploid samples × 2 alleles; AC=6 = 4 het samples (1 copy each) + 1 hom-alt (2 copies). -**Accounting identity:** `n_eligible = N_HET + N_HOM_ALT + N_HOM_REF + N_FAIL` always holds (10 = 4+1+5+0 ✓). +**Accounting identity:** `n_eligible = N_HET + N_HOM_ALT + N_HOM_REF + N_FAIL + N_NO_COVERAGE` always holds. `N_NO_COVERAGE` is 0 here (it only becomes non-zero with a coverage-evidence filter), so 10 = 4+1+5+0+0 ✓. ### N_FAIL in practice diff --git a/docs/getting-started/understanding-output.md b/docs/getting-started/understanding-output.md index ecf6c7f..e1528b9 100644 --- a/docs/getting-started/understanding-output.md +++ b/docs/getting-started/understanding-output.md @@ -14,8 +14,8 @@ This page explains what each field in AFQuery output means and how to interpret | **N_HET** | int | Number of eligible samples heterozygous for the alt allele (GT=0/1) | | **N_HOM_ALT** | int | Number of eligible samples homozygous for the alt allele (GT=1/1 or GT=1). Includes haploid carriers on sex chromosomes and chrM. See [Ploidy](../advanced/ploidy-and-sex-chroms.md#genotype-counting). | | **N_HOM_REF** | int | Number of eligible samples homozygous reference (GT=0/0 or GT=0) | -| **n_eligible** | int | Number of samples in the eligible set (after sex/phenotype/tech filters) | -| **N_FAIL** | int | Number of eligible samples with a non-ref allele called but FILTER≠PASS at this position. These samples are counted *only* in N_FAIL — not in N_HET, N_HOM_ALT, or N_HOM_REF. | +| **n_eligible** | int | Number of eligible samples — those passing the sex/phenotype/tech filters *and* covered at this position | +| **N_FAIL** | int | Number of eligible samples whose call at this position had FILTER≠PASS. These samples are counted *only* in N_FAIL — not in N_HET, N_HOM_ALT, or N_HOM_REF — but they stay eligible and still count toward AN. | | **N_NO_COVERAGE** | int | Number of eligible samples whose tech lacks coverage evidence at this position. Excluded from `N_HOM_REF` to keep AC/AN conservative. Always `0` unless a coverage-evidence filter is active. See [Coverage Evidence](../advanced/coverage-evidence.md). | diff --git a/docs/guides/annotate-vcf.md b/docs/guides/annotate-vcf.md index 758916f..01b6df5 100644 --- a/docs/guides/annotate-vcf.md +++ b/docs/guides/annotate-vcf.md @@ -26,7 +26,7 @@ afquery annotate \ | `AFQUERY_N_HET` | Integer | A (per ALT) | Heterozygous sample count | | `AFQUERY_N_HOM_ALT` | Integer | A (per ALT) | Homozygous alt sample count | | `AFQUERY_N_HOM_REF` | Integer | A (per ALT) | Homozygous ref sample count | -| `AFQUERY_N_FAIL` | Integer | 1 (per site) | Samples with FILTER≠PASS and alt allele called. Mutually exclusive with N_HET/N_HOM_ALT/N_HOM_REF. | +| `AFQUERY_N_FAIL` | Integer | 1 (per site) | Eligible samples whose call had FILTER≠PASS. Excluded from AC, but still counted in AN. Mutually exclusive with N_HET/N_HOM_ALT/N_HOM_REF. | | `AFQUERY_N_NO_COVERAGE` | Integer | A (per ALT) | Eligible samples whose tech lacks coverage evidence at this position. Excluded from `N_HOM_REF` to keep AC/AN conservative. Always `0` unless a coverage-evidence filter is active. See [Coverage Evidence](../advanced/coverage-evidence.md). | !!! note "Multi-allelic sites" diff --git a/docs/guides/create-database.md b/docs/guides/create-database.md index 9e2c3c4..6e48a08 100644 --- a/docs/guides/create-database.md +++ b/docs/guides/create-database.md @@ -27,8 +27,8 @@ afquery create-db \ ## What Happens -1. **Ingest phase** — Each VCF is parsed with cyvcf2. Genotypes and INFO fields are written to a SQLite temporary database, one row per variant per sample. -2. **Build phase** — DuckDB reads the SQLite data, aggregates per 1-Mbp bucket, and writes Roaring Bitmap Parquet files partitioned by chromosome/bucket. +1. **Ingest phase** — Each VCF is parsed with cyvcf2. Genotypes and quality fields are written to a temporary per-sample Parquet file, one row per variant per sample. +2. **Build phase** — DuckDB reads the temporary Parquet files, aggregates per 1-Mbp bucket, and writes Roaring Bitmap Parquet files partitioned by chromosome and bucket. 3. **Finalize** — `manifest.json` and `metadata.sqlite` are written to the output directory. --- @@ -39,12 +39,10 @@ afquery create-db \ ./db/ ├── manifest.json # Build configuration (genome build, schema version, etc.) ├── metadata.sqlite # Sample/phenotype/technology/changelog metadata -├── variants/ # Hive-partitioned Parquet files +├── variants/ # Parquet files partitioned by chromosome and bucket │ ├── chr1/ -│ │ ├── bucket_0/ # Positions 0–999,999 -│ │ │ └── data.parquet -│ │ ├── bucket_1/ # Positions 1,000,000–1,999,999 -│ │ │ └── data.parquet +│ │ ├── bucket_0.parquet # Positions 0–999,999 +│ │ ├── bucket_1.parquet # Positions 1,000,000–1,999,999 │ │ └── ... │ ├── chr2/ │ └── ... @@ -95,7 +93,7 @@ afquery create-db --manifest manifest.tsv --output-dir ./db/ --genome-build GRCh ## FILTER=PASS Behavior -By default, only variants with `FILTER=PASS` (or no FILTER field) are counted in AC/AN. Variants that fail filters are tracked in `fail_bitmap`. PASS-only ingestion is always enforced — there is currently no CLI option to change this behaviour. +Only `FILTER=PASS` calls (or calls with no FILTER field) contribute to AC, and therefore to AF. AN is not affected — it counts every eligible sample, failed calls included. Calls that fail a filter are tracked in `fail_bitmap` and surfaced as `N_FAIL`. PASS-only counting is always enforced — there is currently no CLI option to change this behaviour. See [FILTER=PASS Tracking](../advanced/filter-pass-tracking.md) for details. diff --git a/docs/guides/sample-filtering.md b/docs/guides/sample-filtering.md index e7657b2..10c32cf 100644 --- a/docs/guides/sample-filtering.md +++ b/docs/guides/sample-filtering.md @@ -1,6 +1,6 @@ # Sample Filtering -AFQuery supports flexible metadata-based selection of samples for AF computation. Filters are available on all query commands (`query`, `annotate`, `dump`). +AFQuery supports flexible metadata-based selection of samples for AF computation. Filters are available on all query commands (`query`, `annotate`, `dump`, `variant-info`). **Phenotype codes are arbitrary string labels** — you define them in your manifest. They can be ICD-10 codes, HPO terms, project tags (`control`, `pilot`), or any strings meaningful to your cohort. The filtering system does not interpret the codes — it matches them exactly as stored. diff --git a/docs/guides/update-database.md b/docs/guides/update-database.md index 7df1dfb..b0b2717 100644 --- a/docs/guides/update-database.md +++ b/docs/guides/update-database.md @@ -1,6 +1,6 @@ # Update a Database -`afquery update-db` supports three operations: adding new samples, removing existing samples, and compacting the database to reclaim space. +`afquery update-db` adds new samples, removes existing samples, updates sample metadata, and compacts the database to reclaim space. ### Update Timeline diff --git a/docs/index.md b/docs/index.md index a634a4d..1ab7a88 100644 --- a/docs/index.md +++ b/docs/index.md @@ -31,7 +31,7 @@ AFQuery pre-indexes genotypes as [Roaring Bitmaps](https://roaringbitmap.org/) i - **Ploidy-aware** — correct AN on chrX PAR/non-PAR, chrY, and chrM. - **Technology-aware AN** — per-position capture BED intersection across WGS, WES kits, and panels. - **Carrier lookup** — list samples carrying any variant with full metadata (sex, tech, phenotypes, genotype, FILTER status). -- **VCF annotation** — add `AFQUERY_AC/AN/AF/N_HET/N_HOM_ALT/N_HOM_REF/N_FAIL` INFO fields from any sample subset. +- **VCF annotation** — add `AFQUERY_AC/AN/AF/N_HET/N_HOM_ALT/N_HOM_REF/N_FAIL/N_NO_COVERAGE` INFO fields from any sample subset. - **Audit changelog** — every database operation is recorded for reproducibility. --- @@ -43,7 +43,7 @@ graph TD A["🔍 Input VCFs
single-sample"] B["📥 Ingest
cyvcf2 reads VCFs →
per-sample Parquet files emitted"] C["🏗️ Build
DuckDB aggregates per bucket →
Roaring Bitmaps → Parquet"] - D["💾 Database on Disk
variants/chr*/bucket_*/
capture/*.pkl
metadata.sqlite
manifest.json"] + D["💾 Database on Disk
variants/chr*/bucket_*.parquet
capture/*.pkl
metadata.sqlite
manifest.json"] E["⚡ Query Engine
Load bitmap → filter samples →
compute AC/AN/AF
~10-100ms"] F["📊 Annotate"] G["🔄 Update"] diff --git a/docs/reference/data-model.md b/docs/reference/data-model.md index cb8c6f1..de1ce99 100644 --- a/docs/reference/data-model.md +++ b/docs/reference/data-model.md @@ -10,12 +10,10 @@ This page documents the on-disk layout of an AFQuery database, including file fo / ├── manifest.json # Build configuration ├── metadata.sqlite # Sample/phenotype/technology/changelog metadata -├── variants/ # Hive-partitioned Parquet variant data +├── variants/ # Parquet variant data, partitioned by chromosome and bucket │ ├── chr1/ -│ │ ├── bucket_0/ # Positions 0–999,999 -│ │ │ └── data.parquet -│ │ ├── bucket_1/ # Positions 1,000,000–1,999,999 -│ │ │ └── data.parquet +│ │ ├── bucket_0.parquet # Positions 0–999,999 +│ │ ├── bucket_1.parquet # Positions 1,000,000–1,999,999 │ │ └── ... │ ├── chr2/ │ └── ... @@ -33,11 +31,13 @@ Stores the build configuration used during `create-db`. | Field | Type | Description | |-------|------|-------------| | `genome_build` | string | `"GRCh37"` or `"GRCh38"` | -| `schema_version` | string | `"2.0"` | -| `pass_only_filter` | bool | Whether FILTER=PASS was enforced during ingest | +| `version` | string | manifest.json format version | | `db_version` | string | User-specified version label | +| `sample_count` | int | Number of samples at build time | +| `schema_version` | string | `"2.0"`, or `"3.0"` when the database was built with coverage-quality filters | +| `pass_only_filter` | bool | Always `true` — PASS-only counting is always enforced | +| `coverage_filter` | object | The `min_dp` / `min_gq` / `min_qual` / `min_covered` thresholds used at build time (all `0` when unused) | | `created_at` | string | ISO 8601 timestamp | -| `manifest_path` | string | Path to original manifest TSV | --- @@ -102,12 +102,16 @@ Each bucket Parquet file has this schema: | `pos` | `uint32` | 1-based genomic position | | `ref` | `large_utf8` | Reference allele | | `alt` | `large_utf8` | Alternate allele | -| `het_bitmap` | `large_binary` | Serialized Roaring Bitmap of heterozygous sample IDs | -| `hom_bitmap` | `large_binary` | Serialized Roaring Bitmap of homozygous alt sample IDs | -| `fail_bitmap` | `large_binary` | Serialized Roaring Bitmap of FILTER≠PASS sample IDs | +| `het_bitmap` | `large_binary` | Roaring Bitmap of heterozygous, `FILTER=PASS` sample IDs | +| `hom_bitmap` | `large_binary` | Roaring Bitmap of homozygous-alt, `FILTER=PASS` sample IDs | +| `fail_bitmap` | `large_binary` | Roaring Bitmap of sample IDs whose call has `FILTER≠PASS` at this site | +| `filtered_bitmap` | `large_binary` | Roaring Bitmap of WES non-carriers with uncertain coverage | +| `quality_pass_bitmap` | `large_binary` | Roaring Bitmap of carriers meeting the DP/GQ/QUAL thresholds | Rows are sorted by `(pos, alt)` within each bucket. +`filtered_bitmap` and `quality_pass_bitmap` back the [coverage-evidence](../advanced/coverage-evidence.md) feature. The columns are always present, but they are empty unless the database was built with coverage-quality filters (schema version `3.0`). + !!! important "large_utf8 / large_binary" AFQuery uses `large_utf8` and `large_binary` (64-bit offsets) rather than `utf8` / `binary` (32-bit). This is required for compatibility with DuckDB's Parquet reader on large chromosomes. @@ -118,9 +122,10 @@ Rows are sorted by `(pos, alt)` within each bucket. Bitmaps use the [Roaring Bitmap](https://roaringbitmap.org/) format, serialized by pyroaring's portable serialization. - Bit position = sample ID (0-indexed integer) -- `het_bitmap`: bit set iff sample is heterozygous at this variant -- `hom_bitmap`: bit set iff sample is homozygous alt at this variant -- `fail_bitmap`: bit set iff sample has genotype AC>0 AND FILTER≠PASS +- `het_bitmap`: bit set iff the sample is heterozygous with `FILTER=PASS` +- `hom_bitmap`: bit set iff the sample is homozygous alt with `FILTER=PASS` +- `fail_bitmap`: bit set iff the sample's call has `FILTER≠PASS` — either a carrier whose call failed a filter, or a missing genotype (`./.`) at a failed site +- `filtered_bitmap`, `quality_pass_bitmap`: see the Parquet schema above; both are empty unless the database was built with coverage-quality filters To deserialize in Python: ```python @@ -152,11 +157,11 @@ bucket_id = pos // 1_000_000 ``` !!! warning "DuckDB integer arithmetic" - When computing bucket IDs in DuckDB SQL, always use: + When computing bucket IDs in DuckDB SQL, always use the integer-division operator: ```sql - CAST(pos AS BIGINT) / 1000000 + CAST(pos AS BIGINT) // 1000000 ``` - Not `CAST(pos / 1000000 AS BIGINT)` — DuckDB performs float division first and rounds, producing wrong bucket IDs. + Not `CAST(pos / 1000000 AS BIGINT)` — the `/` operator does float division, and casting the rounded result afterwards produces wrong bucket IDs. --- diff --git a/docs/reference/glossary.md b/docs/reference/glossary.md index e9f8547..0db3816 100644 --- a/docs/reference/glossary.md +++ b/docs/reference/glossary.md @@ -34,7 +34,7 @@ A sample that passes all query filters (sex, phenotype, technology) and has cove ## N_FAIL -The count of eligible samples at a position where the genotype call had FILTER≠PASS in the source VCF. These samples are not counted in AC/AN. See [FILTER=PASS Tracking](../advanced/filter-pass-tracking.md). +The count of eligible samples whose call at a position had FILTER≠PASS in the source VCF. These samples are excluded from AC, but they remain eligible and so still count toward AN. See [FILTER=PASS Tracking](../advanced/filter-pass-tracking.md). ## N_NO_COVERAGE @@ -66,7 +66,7 @@ A compressed bitset data structure that efficiently stores sets of integers. AFQ ## Schema Version -The AFQuery database format version stored in `manifest.json`. Each variant stores `het_bitmap`, `hom_bitmap`, and `fail_bitmap`. See [Data Model](data-model.md). +The AFQuery database format version stored in `manifest.json` — `2.0`, or `3.0` when the database was built with coverage-quality filters. See [Data Model](data-model.md). ## Technology diff --git a/docs/reference/python-api.md b/docs/reference/python-api.md index 114ce4b..3937e12 100644 --- a/docs/reference/python-api.md +++ b/docs/reference/python-api.md @@ -453,7 +453,7 @@ class QueryResult: N_HET: int # Heterozygous count N_HOM_ALT: int # Homozygous alt count N_HOM_REF: int # Homozygous ref count - N_FAIL: int # Samples with alt allele called but FILTER≠PASS + N_FAIL: int # Eligible samples whose call had FILTER≠PASS (excluded from AC, kept in AN) N_NO_COVERAGE: int # Eligible samples whose tech lacks evidence (excluded from N_HOM_REF) ``` diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md index af9d921..6eee06f 100644 --- a/docs/troubleshooting.md +++ b/docs/troubleshooting.md @@ -77,7 +77,7 @@ AFQuery requires DuckDB to use Parquet for all temporary files. Arrow IPC is not **Cause:** A DuckDB float-division rounding bug was present in older AFQuery versions. When computing bucket IDs, `CAST(pos / 1000000 AS BIGINT)` rounds floats incorrectly (e.g., position 1,500,000 → bucket 2 instead of 1). -**Fix:** Upgrade to the latest AFQuery version. The fix uses `CAST(pos AS BIGINT) / 1000000` for correct integer division. Rebuild the database after upgrading. +**Fix:** Upgrade to the latest AFQuery version. The fix uses `CAST(pos AS BIGINT) // 1000000` — the integer-division operator — for correct bucket IDs. Rebuild the database after upgrading. --- diff --git a/docs/use-cases/acmg-use-cases.md b/docs/use-cases/acmg-use-cases.md index 66132a9..5ebb3b7 100644 --- a/docs/use-cases/acmg-use-cases.md +++ b/docs/use-cases/acmg-use-cases.md @@ -207,7 +207,7 @@ On chrX non-PAR regions, males contribute AN=1 and females AN=2. A cohort with 8 ### High N_FAIL at the Variant Site -`N_FAIL` counts eligible samples that were genotyped with the alt allele but failed quality filters (`FILTER≠PASS`). These samples are excluded from AC/AN, so AF is not directly inflated. However, a high N_FAIL indicates the site has systematic QC problems — and warrants caution before applying ACMG criteria. +`N_FAIL` counts eligible samples whose call at the site failed quality filters (`FILTER≠PASS`). These samples are excluded from AC but stay in AN, so they cannot inflate AF — if anything they make it slightly more conservative. A high N_FAIL still indicates the site has systematic QC problems, and warrants caution before applying ACMG criteria. | N_FAIL / n_eligible | Interpretation | |---|---| diff --git a/docs/use-cases/sex-specific-af.md b/docs/use-cases/sex-specific-af.md index b8a6dfd..b649303 100644 --- a/docs/use-cases/sex-specific-af.md +++ b/docs/use-cases/sex-specific-af.md @@ -34,10 +34,10 @@ afquery query --db ./db/ --locus chrX:153296777 ``` ``` -chrX:153296777 C>T AC=15 AN=2500 AF=0.0060 n_eligible=1500 N_HET=5 N_HOM_ALT=0 N_HOM_REF=1495 N_FAIL=0 +chrX:153296777 C>T AC=15 AN=2500 AF=0.0060 n_eligible=1500 N_HET=5 N_HOM_ALT=10 N_HOM_REF=1485 N_FAIL=0 ``` -AN=2500: 500 males × 1 + 1000 females × 2 = 2500 (mixed ploidy) +AN=2500: 500 males × 1 + 1000 females × 2 = 2500 (mixed ploidy). The 10 hemizygous male carriers land in N_HOM_ALT, not N_HET — on haploid positions there is no heterozygous state. ### 2. Query males only (hemizygous frequency) @@ -46,10 +46,10 @@ afquery query --db ./db/ --locus chrX:153296777 --sex male ``` ``` -chrX:153296777 C>T AC=10 AN=500 AF=0.0200 n_eligible=500 N_HET=10 N_HOM_ALT=0 N_HOM_REF=490 N_FAIL=0 +chrX:153296777 C>T AC=10 AN=500 AF=0.0200 n_eligible=500 N_HET=0 N_HOM_ALT=10 N_HOM_REF=490 N_FAIL=0 ``` -Hemizygous rate: 2% (N_HET here represents hemizygous males, GT=1) +Hemizygous rate: 2%. The carriers appear in N_HOM_ALT, not N_HET — males are haploid at chrX non-PAR, so `GT=1` carriers are counted as homozygous-alt. ### 3. Query females only (carrier frequency) diff --git a/src/afquery/cli.py b/src/afquery/cli.py index 0beea90..dda602e 100644 --- a/src/afquery/cli.py +++ b/src/afquery/cli.py @@ -74,26 +74,23 @@ def _print_results(results, fmt: str) -> None: "ref": r.variant.ref, "alt": r.variant.alt, "AC": r.AC, "AN": r.AN, "AF": r.AF, "n_eligible": r.n_samples_eligible, "N_HET": r.N_HET, "N_HOM_ALT": r.N_HOM_ALT, "N_HOM_REF": r.N_HOM_REF, + "N_FAIL": r.N_FAIL, "N_NO_COVERAGE": r.N_NO_COVERAGE, } - if r.N_FAIL is not None: - entry["N_FAIL"] = r.N_FAIL out.append(entry) click.echo(json.dumps(out, indent=2)) elif fmt == "tsv": - has_fail = any(r.N_FAIL is not None for r in results) - header = "chrom\tpos\tref\talt\tAC\tAN\tAF\tn_eligible\tN_HET\tN_HOM_ALT\tN_HOM_REF" - if has_fail: - header += "\tN_FAIL" + header = ( + "chrom\tpos\tref\talt\tAC\tAN\tAF\tn_eligible\t" + "N_HET\tN_HOM_ALT\tN_HOM_REF\tN_FAIL\tN_NO_COVERAGE" + ) click.echo(header) for r in results: af = f"{r.AF:.6f}" if r.AF is not None else "NA" line = ( f"{r.variant.chrom}\t{r.variant.pos}\t{r.variant.ref}\t" f"{r.variant.alt}\t{r.AC}\t{r.AN}\t{af}\t{r.n_samples_eligible}\t" - f"{r.N_HET}\t{r.N_HOM_ALT}\t{r.N_HOM_REF}" + f"{r.N_HET}\t{r.N_HOM_ALT}\t{r.N_HOM_REF}\t{r.N_FAIL}\t{r.N_NO_COVERAGE}" ) - if has_fail: - line += f"\t{r.N_FAIL if r.N_FAIL is not None else '?'}" click.echo(line) else: # text if not results: @@ -101,11 +98,11 @@ def _print_results(results, fmt: str) -> None: return for r in results: af = f"{r.AF:.4f}" if r.AF is not None else "NA" - fail_str = f" N_FAIL={r.N_FAIL}" if r.N_FAIL is not None else "" click.echo( f"{r.variant.chrom}:{r.variant.pos} {r.variant.ref}>{r.variant.alt} " f"AC={r.AC} AN={r.AN} AF={af} n_eligible={r.n_samples_eligible} " - f"N_HET={r.N_HET} N_HOM_ALT={r.N_HOM_ALT} N_HOM_REF={r.N_HOM_REF}{fail_str}" + f"N_HET={r.N_HET} N_HOM_ALT={r.N_HOM_ALT} N_HOM_REF={r.N_HOM_REF} " + f"N_FAIL={r.N_FAIL} N_NO_COVERAGE={r.N_NO_COVERAGE}" ) @@ -324,7 +321,8 @@ def annotate(db, input_vcf, output_vcf, phenotype, sex, tech, threads, verbose, AFQUERY_N_HET heterozygous carrier count (per ALT) AFQUERY_N_HOM_ALT homozygous alt count (per ALT) AFQUERY_N_HOM_REF homozygous ref count (per ALT) - AFQUERY_N_FAIL samples with FILTER!=PASS + AFQUERY_N_FAIL samples with FILTER!=PASS (per site) + AFQUERY_N_NO_COVERAGE eligible samples lacking coverage evidence (per ALT) """ if no_warn: import warnings diff --git a/tests/test_cli.py b/tests/test_cli.py index b8c4526..4251286 100644 --- a/tests/test_cli.py +++ b/tests/test_cli.py @@ -435,6 +435,34 @@ def test_query_tsv_with_fail_col(runner, test_db): assert "N_FAIL" in lines[0] +def test_query_output_includes_n_no_coverage_all_formats(runner, test_db): + """N_FAIL and N_NO_COVERAGE appear in text/tsv/json output, and TSV columns line up.""" + base = ["query", "--db", test_db, "--locus", "chr1:1500"] + + text = runner.invoke(cli, base) + assert text.exit_code == 0 + assert "N_FAIL=" in text.output + assert "N_NO_COVERAGE=" in text.output + + tsv = runner.invoke(cli, base + ["--format", "tsv"]) + assert tsv.exit_code == 0 + tsv_lines = tsv.output.strip().splitlines() + assert len(tsv_lines) >= 2 + header = tsv_lines[0].split("\t") + assert "N_FAIL" in header + assert header[-1] == "N_NO_COVERAGE" + for row in tsv_lines[1:]: + assert len(row.split("\t")) == len(header) + + js = runner.invoke(cli, base + ["--format", "json"]) + assert js.exit_code == 0 + data = json.loads(js.output) + assert len(data) >= 1 + for entry in data: + assert isinstance(entry["N_FAIL"], int) + assert isinstance(entry["N_NO_COVERAGE"], int) + + # --- afquery annotate: verbose flag --- def test_annotate_verbose(runner, test_db, tmp_path):