Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/advanced/debugging-results.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,13 +34,13 @@ A variant you expect to find is not in the database:
| Check | Details |
|-------|---------|
| Was it in the source VCFs? | AFQuery only stores variants present in ingested VCFs |
| Was it FILTER=PASS? | Default ingestion skips non-PASS variants. Check with `afquery info` for `pass_only_filter` |
| Was the alt allele ever called? | A variant is stored only if at least one sample carries the alt allele or has a failed call there. A site seen only as `0/0` in every sample is genuinely absent. Note that non-PASS calls *are* kept (in `fail_bitmap`) — such a variant is not missing, it just shows `AC=0`. |
| Multiallelic sites | AFQuery stores each ALT separately. Query the specific ALT allele, not just position |
| Chromosome naming | Ensure consistent `chr` prefix usage |

### 4. Unexpected N_FAIL > 0

`N_FAIL > 0` means some eligible samples had the alt allele called but with `FILTER≠PASS`. These samples are excluded from AC/AN. This is usually benign (1–2 samples), but a high N_FAIL warrants investigation:
`N_FAIL > 0` means some eligible samples had a call with `FILTER≠PASS` at this position. These samples are excluded from AC, but they remain eligible and still count in AN. This is usually benign (1–2 samples), but a high N_FAIL warrants investigation:

| N_FAIL relative to n_eligible | Likely cause |
|---|---|
Expand Down
29 changes: 16 additions & 13 deletions docs/advanced/filter-pass-tracking.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,29 +11,30 @@ In VCF format, the FILTER column indicates whether a variant call passed quality
- `PASS` or `.` (missing) — the variant passed all filters
- Any other value (e.g., `LowQual`, `VQSRTrancheSNP99.90to100.00`) — the variant failed one or more filters

AFQuery default behavior:
AFQuery's default behavior:

- **PASS-only**: only `FILTER=PASS` variants are counted in AC/AN. This is always enforced.
- Only `FILTER=PASS` calls contribute to **AC** (and therefore to **AF**). This is always enforced.
- **AN** is not affected by the filter — it counts every eligible sample at the position, failed calls included.

---

## fail_bitmap

AFQuery stores a third bitmap per variant alongside `het_bitmap` and `hom_bitmap`:

- **`fail_bitmap`** — bit set for each sample that has a non-ref genotype (AC>0) AND `FILTER≠PASS`
- **`fail_bitmap`** — bit set for each sample whose call at this site has `FILTER≠PASS`. In practice this is two cases: a sample that carries the alt allele but whose call failed a filter, and a sample with a missing genotype (`./.`) at a site that itself failed a filter.

This means:
What this means for a sample in `fail_bitmap`:

- A sample in `fail_bitmap` was genotyped with the alt allele but the call failed QC
- Such samples are **not** counted in AC/AN (they don't affect AF)
- Their count is exposed as `N_FAIL`
- Its alt alleles are **not** counted in AC, so it does not raise AF.
- It is still an eligible sample, so it **is** counted in AN. The failed call lowers AF by sitting in the denominator without contributing to the numerator.
- Its count is exposed separately as `N_FAIL`.

---

## Database Creation

The `fail_bitmap` is always written and always tracks PASS-only ingestion:
The `fail_bitmap` is always written, regardless of build options:

```bash
afquery create-db --manifest manifest.tsv --output-dir ./db/ --genome-build GRCh38
Expand All @@ -52,10 +53,10 @@ afquery query --db ./db/ --locus chr1:925952
```

```
chr1:925952 G>A AC=142 AN=2742 AF=0.0518 n_eligible=1371 N_HET=138 N_HOM_ALT=2 N_HOM_REF=1231 N_FAIL=7
chr1:925952 G>A AC=142 AN=2742 AF=0.0518 n_eligible=1371 N_HET=138 N_HOM_ALT=2 N_HOM_REF=1224 N_FAIL=7
```

`N_FAIL=7` means 7 eligible samples had the alt allele called but with FILTER≠PASS.
`N_FAIL=7` means 7 eligible samples had a call with FILTER≠PASS at this site. They are part of `n_eligible` (and of AN), so the genotype counts still add up: 138 + 2 + 1224 + 7 = 1371.

### Python API

Expand Down Expand Up @@ -83,21 +84,23 @@ Each carrier row shows its `filter` column as `PASS` or `FAIL`, along with sampl

## VCF Annotation

AFQuery adds an additional INFO field to annotated VCFs:
Among the INFO fields `afquery annotate` writes, one reports the failed-call count:

| Field | Type | Description |
|-------|------|-------------|
| `AFQUERY_N_FAIL` | Integer | Eligible samples with FILTER≠PASS at this variant |
| `AFQUERY_N_FAIL` | Integer | Eligible samples with `FILTER≠PASS` at this variant |

```bash
afquery annotate --db ./db/ --input variants.vcf --output annotated.vcf
```

See [Understanding Output](../getting-started/understanding-output.md#vcf-annotation-fields) for the full set of `AFQUERY_*` fields.

---

## PASS-Only Enforcement

AF reflects the quality-filtered allele frequency — the frequency of the alt allele among high-quality calls. This is appropriate for most clinical and research use cases. PASS-only ingestion is always enforced.
PASS-only counting is always enforced and cannot be turned off. It applies to the numerator only: a failed call never adds an alt allele to AC, but the sample stays in the eligible set and so remains in AN. The result is a deliberately conservative AF — a failed carrier weighs on the denominator without lifting the numerator, which is the safe direction for clinical and research use.

---

Expand Down
10 changes: 7 additions & 3 deletions docs/advanced/multi-cohort-strategies.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ All samples in one database, with phenotype codes distinguishing cohorts.
### Manifest Design

```tsv
sample_id vcf_path sex technology phenotype_codes
sample_name vcf_path sex tech_name phenotype_codes
CARD_001 /data/card/001.vcf.gz female wgs cardiology,EUR,control
CARD_002 /data/card/002.vcf.gz male wgs cardiology,EUR,case_HCM
NEURO_001 /data/neuro/001.vcf.gz female wes neurology,AFR,case_epilepsy
Expand Down Expand Up @@ -99,7 +99,7 @@ afquery query --db ./merged/ --locus chr1:12345678 --ref C --alt T \
|-----------|-------------|
| Single database to maintain | Rebuilding requires all VCFs accessible |
| Cross-cohort queries via phenotype filters | Phenotype code design must be planned upfront |
| One annotation pass covers all cohorts | Adding a new cohort requires `afquery update` |
| One annotation pass covers all cohorts | Adding a new cohort requires `afquery update-db` |
| Flexible ad-hoc stratification | Larger database, longer rebuild time |

---
Expand Down Expand Up @@ -134,7 +134,11 @@ Maintain both per-cohort and merged databases.
afquery annotate --db /databases/institutional/cardiology/ \
--input patient.vcf.gz --output step1.vcf.gz

# Annotate against shared controls (use a different INFO prefix via Python API)
# Annotate against shared controls in a second pass
# Note: the AFQUERY_* INFO fields are overwritten on each pass — if you need
# both sets of frequencies, extract the values from step1.vcf.gz first.
afquery annotate --db /databases/shared/combined_controls/ \
--input step1.vcf.gz --output step2.vcf.gz
```

---
Expand Down
4 changes: 2 additions & 2 deletions docs/advanced/performance.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ Expected scaling (50K samples, 2,500 buckets, GRCh38):
| 32 | ~18 min | ~24× |
| 52 | ~13 min | ~38× |

Scaling is near-linear up to ~32 cores; beyond that, I/O and SQLite contention limit further gains.
Scaling is near-linear up to ~32 cores; beyond that, disk I/O contention limits further gains.

Total RAM required: `build_threads × build_memory`

Expand Down Expand Up @@ -80,7 +80,7 @@ afquery create-db ... --threads 32 --build-threads 16 --build-memory 4GB
graph TD
A["Query Request<br/>chr1:925952"]
B["Open DuckDB<br/>connection"]
C["Locate Parquet<br/>bucket_0/data.parquet"]
C["Locate Parquet<br/>bucket_0.parquet"]
D["Read rows<br/>matching pos"]
E["Deserialize<br/>bitmaps"]
F["Bitmap AND<br/>with eligible<br/>sample set"]
Expand Down
6 changes: 3 additions & 3 deletions docs/advanced/ploidy-and-sex-chroms.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,12 +103,12 @@ afquery query --db ./db/ --locus chrM:3243

For every query result, the following identity holds:

**N_HET + N_HOM_ALT + N_HOM_REF + N_FAIL = n_eligible**
**N_HET + N_HOM_ALT + N_HOM_REF + N_FAIL + N_NO_COVERAGE = n_eligible**

This can be used to validate results. N_HOM_REF is the number of eligible samples that are homozygous reference (i.e., do not carry the alt allele and passed quality filters).
This can be used to validate results. N_HOM_REF is the number of eligible samples that are homozygous reference (i.e., do not carry the alt allele and passed quality filters). N_NO_COVERAGE is 0 unless a coverage-evidence filter is active — see [Coverage Evidence](coverage-evidence.md).

!!! note "Mutual exclusivity"
N_HET, N_HOM_ALT, N_HOM_REF, and N_FAIL are mutually exclusive. A sample with a non-ref allele but FILTER≠PASS is counted in N_FAIL only — it does not appear in N_HET or N_HOM_ALT. Likewise, N_HOM_REF counts only PASS-filtered samples.
N_HET, N_HOM_ALT, N_HOM_REF, N_FAIL, and N_NO_COVERAGE are mutually exclusive. A sample with a non-ref allele but FILTER≠PASS is counted in N_FAIL only — it does not appear in N_HET or N_HOM_ALT. Likewise, N_HOM_REF counts only PASS-filtered samples.

### chrX non-PAR

Expand Down
2 changes: 1 addition & 1 deletion docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,7 @@ On chrX non-PAR positions, males contribute AN=1 and females contribute AN=2. Wh

## What does N_FAIL mean in query output?

`N_FAIL` is the count of eligible samples that had the alt allele called but with `FILTER≠PASS`. Shown as `N_FAIL=N` in text output. These samples are not counted in AC/AN. See [FILTER=PASS Tracking](advanced/filter-pass-tracking.md).
`N_FAIL` is the count of eligible samples whose call had `FILTER≠PASS` at the position. Shown as `N_FAIL=N` in text output. These samples are excluded from AC, but they stay eligible and still count toward AN. See [FILTER=PASS Tracking](advanced/filter-pass-tracking.md).

---

Expand Down
14 changes: 7 additions & 7 deletions docs/getting-started/concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,13 +126,13 @@ graph TB

### Per-Variant Bitmaps

For each variant row, AFQuery stores three [Roaring Bitmaps](https://roaringbitmap.org/):
For each variant row, AFQuery stores per-sample genotype information as [Roaring Bitmaps](https://roaringbitmap.org/). Three of them drive every query:

- **`het_bitmap`** — bit set for each sample that is heterozygous (GT=0/1 or 1/0)
- **`hom_bitmap`** — bit set for each sample that is homozygous alt (GT=1/1 or GT=1)
- **`fail_bitmap`** bit set for each sample called with FILTER≠PASS
- **`het_bitmap`** — sample is heterozygous (`GT=0/1`) with `FILTER=PASS`
- **`hom_bitmap`** — sample is homozygous alt (`GT=1/1`, or `GT=1` on haploid regions) with `FILTER=PASS`
- **`fail_bitmap`** — the sample's call at this site has `FILTER≠PASS`

Each sample has a stable integer ID (0-indexed). The bit position in the bitmap equals the sample ID.
Each sample has a stable integer ID (0-indexed). The bit position in the bitmap equals the sample ID. Databases built with coverage-quality filters carry two additional bitmaps — see [Data Model](../reference/data-model.md#parquet-schema).

### Parquet Storage

Expand All @@ -141,8 +141,8 @@ Bitmaps are serialized and stored in Parquet files, partitioned by chromosome an
```
variants/
chr1/
bucket_0/ ← positions 0–999,999
bucket_1/ ← positions 1,000,000–1,999,999
bucket_0.parquet ← positions 0–999,999
bucket_1.parquet ← positions 1,000,000–1,999,999
...
chr2/
...
Expand Down
1 change: 1 addition & 0 deletions docs/getting-started/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,7 @@ The output VCF gains INFO fields (see [Annotate a VCF](../guides/annotate-vcf.md
| `AFQUERY_N_HOM_ALT` | A (per ALT) | Homozygous alt sample count |
| `AFQUERY_N_HOM_REF` | A (per ALT) | Homozygous ref sample count |
| `AFQUERY_N_FAIL` | 1 (per site) | Samples with FILTER≠PASS |
| `AFQUERY_N_NO_COVERAGE` | A (per ALT) | Eligible samples lacking coverage evidence (0 unless a coverage-evidence filter is active) |

---

Expand Down
4 changes: 2 additions & 2 deletions docs/getting-started/tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,15 +86,15 @@ Every query result includes the following fields:
| **AC** | Allele count — alt allele copies in eligible samples |
| **AN** | Allele number — total alleles examined (adjusted for coverage and ploidy) |
| **AF** | Allele frequency — `AC / AN` |
| **n_eligible** | Samples in the filtered set (before coverage or ploidy adjustments). Here, 10 = no filter applied. |
| **n_eligible** | Eligible samples at this position — those passing the metadata filters *and* covered here. Ploidy is applied afterwards, to AN. Here, 10 = no filter applied and every sample is covered. |
| **N_HET** | Eligible samples heterozygous for the alt allele |
| **N_HOM_ALT** | Eligible samples homozygous for the alt allele |
| **N_HOM_REF** | Eligible samples homozygous reference |
| **N_FAIL** | Eligible samples with a non-ref allele called but `FILTER≠PASS` in the source VCF |

For `chr1:925952`: AN=20 = 10 diploid samples × 2 alleles; AC=6 = 4 het samples (1 copy each) + 1 hom-alt (2 copies).

**Accounting identity:** `n_eligible = N_HET + N_HOM_ALT + N_HOM_REF + N_FAIL` always holds (10 = 4+1+5+0 ✓).
**Accounting identity:** `n_eligible = N_HET + N_HOM_ALT + N_HOM_REF + N_FAIL + N_NO_COVERAGE` always holds. `N_NO_COVERAGE` is 0 here (it only becomes non-zero with a coverage-evidence filter), so 10 = 4+1+5+0+0 ✓.

### N_FAIL in practice

Expand Down
4 changes: 2 additions & 2 deletions docs/getting-started/understanding-output.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@ This page explains what each field in AFQuery output means and how to interpret
| **N_HET** | int | Number of eligible samples heterozygous for the alt allele (GT=0/1) |
| **N_HOM_ALT** | int | Number of eligible samples homozygous for the alt allele (GT=1/1 or GT=1). Includes haploid carriers on sex chromosomes and chrM. See [Ploidy](../advanced/ploidy-and-sex-chroms.md#genotype-counting). |
| **N_HOM_REF** | int | Number of eligible samples homozygous reference (GT=0/0 or GT=0) |
| **n_eligible** | int | Number of samples in the eligible set (after sex/phenotype/tech filters) |
| **N_FAIL** | int | Number of eligible samples with a non-ref allele called but FILTER≠PASS at this position. These samples are counted *only* in N_FAIL — not in N_HET, N_HOM_ALT, or N_HOM_REF. |
| **n_eligible** | int | Number of eligible samples — those passing the sex/phenotype/tech filters *and* covered at this position |
| **N_FAIL** | int | Number of eligible samples whose call at this position had FILTER≠PASS. These samples are counted *only* in N_FAIL — not in N_HET, N_HOM_ALT, or N_HOM_REF — but they stay eligible and still count toward AN. |
| **N_NO_COVERAGE** | int | Number of eligible samples whose tech lacks coverage evidence at this position. Excluded from `N_HOM_REF` to keep AC/AN conservative. Always `0` unless a coverage-evidence filter is active. See [Coverage Evidence](../advanced/coverage-evidence.md). |


Expand Down
2 changes: 1 addition & 1 deletion docs/guides/annotate-vcf.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ afquery annotate \
| `AFQUERY_N_HET` | Integer | A (per ALT) | Heterozygous sample count |
| `AFQUERY_N_HOM_ALT` | Integer | A (per ALT) | Homozygous alt sample count |
| `AFQUERY_N_HOM_REF` | Integer | A (per ALT) | Homozygous ref sample count |
| `AFQUERY_N_FAIL` | Integer | 1 (per site) | Samples with FILTER≠PASS and alt allele called. Mutually exclusive with N_HET/N_HOM_ALT/N_HOM_REF. |
| `AFQUERY_N_FAIL` | Integer | 1 (per site) | Eligible samples whose call had FILTER≠PASS. Excluded from AC, but still counted in AN. Mutually exclusive with N_HET/N_HOM_ALT/N_HOM_REF. |
| `AFQUERY_N_NO_COVERAGE` | Integer | A (per ALT) | Eligible samples whose tech lacks coverage evidence at this position. Excluded from `N_HOM_REF` to keep AC/AN conservative. Always `0` unless a coverage-evidence filter is active. See [Coverage Evidence](../advanced/coverage-evidence.md). |

!!! note "Multi-allelic sites"
Expand Down
14 changes: 6 additions & 8 deletions docs/guides/create-database.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@ afquery create-db \

## What Happens

1. **Ingest phase** — Each VCF is parsed with cyvcf2. Genotypes and INFO fields are written to a SQLite temporary database, one row per variant per sample.
2. **Build phase** — DuckDB reads the SQLite data, aggregates per 1-Mbp bucket, and writes Roaring Bitmap Parquet files partitioned by chromosome/bucket.
1. **Ingest phase** — Each VCF is parsed with cyvcf2. Genotypes and quality fields are written to a temporary per-sample Parquet file, one row per variant per sample.
2. **Build phase** — DuckDB reads the temporary Parquet files, aggregates per 1-Mbp bucket, and writes Roaring Bitmap Parquet files partitioned by chromosome and bucket.
3. **Finalize** — `manifest.json` and `metadata.sqlite` are written to the output directory.

---
Expand All @@ -39,12 +39,10 @@ afquery create-db \
./db/
├── manifest.json # Build configuration (genome build, schema version, etc.)
├── metadata.sqlite # Sample/phenotype/technology/changelog metadata
├── variants/ # Hive-partitioned Parquet files
├── variants/ # Parquet files partitioned by chromosome and bucket
│ ├── chr1/
│ │ ├── bucket_0/ # Positions 0–999,999
│ │ │ └── data.parquet
│ │ ├── bucket_1/ # Positions 1,000,000–1,999,999
│ │ │ └── data.parquet
│ │ ├── bucket_0.parquet # Positions 0–999,999
│ │ ├── bucket_1.parquet # Positions 1,000,000–1,999,999
│ │ └── ...
│ ├── chr2/
│ └── ...
Expand Down Expand Up @@ -95,7 +93,7 @@ afquery create-db --manifest manifest.tsv --output-dir ./db/ --genome-build GRCh

## FILTER=PASS Behavior

By default, only variants with `FILTER=PASS` (or no FILTER field) are counted in AC/AN. Variants that fail filters are tracked in `fail_bitmap`. PASS-only ingestion is always enforced — there is currently no CLI option to change this behaviour.
Only `FILTER=PASS` calls (or calls with no FILTER field) contribute to AC, and therefore to AF. AN is not affected — it counts every eligible sample, failed calls included. Calls that fail a filter are tracked in `fail_bitmap` and surfaced as `N_FAIL`. PASS-only counting is always enforced — there is currently no CLI option to change this behaviour.

See [FILTER=PASS Tracking](../advanced/filter-pass-tracking.md) for details.

Expand Down
2 changes: 1 addition & 1 deletion docs/guides/sample-filtering.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Sample Filtering

AFQuery supports flexible metadata-based selection of samples for AF computation. Filters are available on all query commands (`query`, `annotate`, `dump`).
AFQuery supports flexible metadata-based selection of samples for AF computation. Filters are available on all query commands (`query`, `annotate`, `dump`, `variant-info`).

**Phenotype codes are arbitrary string labels** — you define them in your manifest. They can be ICD-10 codes, HPO terms, project tags (`control`, `pilot`), or any strings meaningful to your cohort. The filtering system does not interpret the codes — it matches them exactly as stored.

Expand Down
2 changes: 1 addition & 1 deletion docs/guides/update-database.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Update a Database

`afquery update-db` supports three operations: adding new samples, removing existing samples, and compacting the database to reclaim space.
`afquery update-db` adds new samples, removes existing samples, updates sample metadata, and compacts the database to reclaim space.

### Update Timeline

Expand Down
4 changes: 2 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ AFQuery pre-indexes genotypes as [Roaring Bitmaps](https://roaringbitmap.org/) i
- **Ploidy-aware** — correct AN on chrX PAR/non-PAR, chrY, and chrM.
- **Technology-aware AN** — per-position capture BED intersection across WGS, WES kits, and panels.
- **Carrier lookup** — list samples carrying any variant with full metadata (sex, tech, phenotypes, genotype, FILTER status).
- **VCF annotation** — add `AFQUERY_AC/AN/AF/N_HET/N_HOM_ALT/N_HOM_REF/N_FAIL` INFO fields from any sample subset.
- **VCF annotation** — add `AFQUERY_AC/AN/AF/N_HET/N_HOM_ALT/N_HOM_REF/N_FAIL/N_NO_COVERAGE` INFO fields from any sample subset.
- **Audit changelog** — every database operation is recorded for reproducibility.

---
Expand All @@ -43,7 +43,7 @@ graph TD
A["🔍 Input VCFs<br/>single-sample"]
B["📥 Ingest<br/>cyvcf2 reads VCFs →<br/>per-sample Parquet files emitted"]
C["🏗️ Build<br/>DuckDB aggregates per bucket →<br/>Roaring Bitmaps → Parquet"]
D["💾 Database on Disk<br/>variants/chr*/bucket_*/<br/>capture/*.pkl<br/>metadata.sqlite<br/>manifest.json"]
D["💾 Database on Disk<br/>variants/chr*/bucket_*.parquet<br/>capture/*.pkl<br/>metadata.sqlite<br/>manifest.json"]
E["⚡ Query Engine<br/>Load bitmap → filter samples →<br/>compute AC/AN/AF<br/>~10-100ms"]
F["📊 Annotate"]
G["🔄 Update"]
Expand Down
Loading
Loading