Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
b40e84b
feat: add performance and bed kit benchmarkings
dlopez-bioinfo Mar 25, 2026
58d2dd9
feat: improve benchmark infrastructure
dlopez-bioinfo Mar 25, 2026
b32bc96
refactor: add shared benchmark config and utilities
dlopez-bioinfo Mar 26, 2026
3011616
refactor: homogenize performance benchmark
dlopez-bioinfo Mar 26, 2026
2580979
refactor: homogenize capture_kit benchmark
dlopez-bioinfo Mar 26, 2026
8586c89
feat: add unified Snakemake pipeline with SLURM profile
dlopez-bioinfo Mar 26, 2026
cb84fa8
docs: update benchmarks README for Snakemake pipeline
dlopez-bioinfo Mar 26, 2026
0c213d3
docs: update sub-benchmark READMEs for Snakemake pipeline
dlopez-bioinfo Mar 26, 2026
8dd2dd1
docs: remove environment-specific details from benchmark READMEs
dlopez-bioinfo Mar 26, 2026
5b51c51
fix: use relative profile path for snakemake invocation from benchmarks/
dlopez-bioinfo Mar 26, 2026
313a944
fix: replace workflow.cores with explicit thread counts in Snakemake …
dlopez-bioinfo Mar 26, 2026
f8a662a
refactor: add CLI args to benchmark scripts for single-config execution
dlopez-bioinfo Mar 26, 2026
52c5973
feat: add aggregation scripts for per-run benchmark results
dlopez-bioinfo Mar 26, 2026
852bdb4
refactor: parallelize Snakemake rules with wildcards (12 → 65 jobs)
dlopez-bioinfo Mar 26, 2026
dc7dc26
chore: update SLURM profile for new parallelized rules
dlopez-bioinfo Mar 26, 2026
28b368c
fix: replace slurm_extra --nodes with nodes resource in SLURM profile
dlopez-bioinfo Mar 26, 2026
252aaab
fix: create log dirs in onstart before rules execute
dlopez-bioinfo Mar 26, 2026
d76534e
feat: use rule names instead of UUIDs for SLURM job names
dlopez-bioinfo Mar 26, 2026
2979fa0
fix: add onstart to all snakefiles to ensure log dirs exist
dlopez-bioinfo Mar 26, 2026
4fcdfda
fix: create log directories in shell commands before redirecting output
dlopez-bioinfo Mar 26, 2026
5bc01b4
fix: repair benchmark pipeline - fix log redirects and DAG dependencies
dlopez-bioinfo Mar 26, 2026
ea73742
fix: complete benchmark pipeline - fix annotate script and plot robus…
dlopez-bioinfo Mar 26, 2026
6878c2b
fix: remove unsupported job-name-format option from SLURM profile
dlopez-bioinfo Mar 26, 2026
0eaeefc
fix: add PATH export to capture_kit shell blocks for conda env access
dlopez-bioinfo Mar 26, 2026
6a15488
fix: add PATH export to all performance rule shell blocks
dlopez-bioinfo Mar 26, 2026
bb40828
fix: update download_1kg rule with proper log handling and PATH setup
dlopez-bioinfo Mar 26, 2026
718b5e7
fix: propagate data_dir override to all shared/config.py paths
dlopez-bioinfo Mar 26, 2026
a2c3184
fix: add download_1kg dep to capture_kit_split_vcf
dlopez-bioinfo Mar 26, 2026
c92be2a
fix: replace hardcoded ONEKG_DIR fallback with explicit error guard
dlopez-bioinfo Mar 26, 2026
0d499b6
fix: make data_dir and threads configurable in benchmarks
dlopez-bioinfo Mar 26, 2026
257cf2b
fix: comment out cluster-specific slurm_partition in SLURM profile
dlopez-bioinfo Mar 26, 2026
cb011a6
fix: clarify available sample selection in 01a_assign_samples.py
dlopez-bioinfo Mar 26, 2026
d1e2f0f
fix: add missing entries to .gitignore and fix cleanup in 03_build.py
dlopez-bioinfo Mar 26, 2026
81eb0d8
feat: pin package versions in benchmarks/envs/benchmark.yaml for repr…
dlopez-bioinfo Mar 26, 2026
b9764d1
fix: quote shell paths in capture_kit Snakefile rules
dlopez-bioinfo Mar 26, 2026
602e5ba
refactor: remove hardcoded chr22 and magic bias thresholds in 03_comp…
dlopez-bioinfo Mar 26, 2026
28c27d1
docs: add SLURM parallelism comments to performance prepare rules
dlopez-bioinfo Mar 26, 2026
4f421c0
fix: use chr-prefixed chrom for CaptureIndex and quote output.vcf path
dlopez-bioinfo Mar 26, 2026
3f8a21e
chore: remove .snakemake from git tracking
dlopez-bioinfo Mar 26, 2026
0caff83
fix: use conda env instead of micromamba
dlopez-bioinfo Mar 26, 2026
f260e15
refactor: parallelize 1KG splitting via Snakemake wildcard rules
dlopez-bioinfo Mar 26, 2026
4a6f8cd
fix: restore performance_all output marker and correct all rule targets
dlopez-bioinfo Mar 26, 2026
e86913c
fix: disable conda management in slurm profile, use existing environm…
dlopez-bioinfo Mar 26, 2026
82da97a
fix: use absolute paths in synthetic manifest vcf_path
dlopez-bioinfo Mar 26, 2026
3826ddf
fix: remove duplicate 'results' in RESULTS_DIR path
dlopez-bioinfo Mar 26, 2026
4d79507
fix: use absolute paths for PERF_RESULTS and PERF_FIGURES
dlopez-bioinfo Mar 26, 2026
394901d
fix: remove invalid 'nodes' resource from slurm profile
dlopez-bioinfo Mar 26, 2026
8fb0404
fix: propagate DATA_DIR to subprocess environment
dlopez-bioinfo Mar 27, 2026
f8cd8bc
fix: use absolute paths in download_1kg.smk
dlopez-bioinfo Mar 27, 2026
5c91c7b
fix: chromosome name mismatch in bcftools concordance
dlopez-bioinfo Mar 27, 2026
40c001d
fix: benchmark pipeline critical issues and documentation
dlopez-bioinfo Mar 27, 2026
cffd45a
feat: add directional error analysis to capture kit benchmark
dlopez-bioinfo Mar 28, 2026
b4d57a7
fix(benchmarking): remove slurm node selection
dlopez-bioinfo Apr 29, 2026
0fedf4d
fix: pin benchmark environment versions and update installation docs
dlopez-bioinfo May 5, 2026
9383e34
test: add coverage for variant_info wrapper function and filters
dlopez-bioinfo May 5, 2026
22deaf1
test: add coverage for variant_info edge cases and warnings
dlopez-bioinfo May 5, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions benchmarks/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Benchmark data directories (large, external)
# Users create these as needed via config.py

# pycache
__pycache__/
*.pyc
.pytest_cache/

# benchmarking results
results
.snakemake

# Snakemake log directory
logs/

# Default data directory
.results/

# Local test run logs
smoke_test.log
141 changes: 141 additions & 0 deletions benchmarks/INSTALL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# Installation Guide for AFQuery Benchmarks

This guide covers installing benchmark dependencies in your existing `afquery_bench` micromamba environment.
See [README.md](./README.md) for the recommended setup using `envs/benchmark.yaml`.

## Option 1: Using conda environment file (Recommended)

The easiest way is to create the pre-configured environment:

```bash
cd benchmarks/
micromamba env create -f envs/benchmark.yaml
micromamba activate afquery_bench
```

Then skip to [Running Benchmarks](#running-benchmarks).

## Option 2: Manual installation

If you prefer to install into an existing environment, install dependencies into your `snakemake` micromamba environment:

### Prerequisites

You must have:
- Snakemake ≥ 8 installed in a micromamba environment named `snakemake`

If not, install it:
```bash
micromamba install -n snakemake snakemake
```

### Install all dependencies at once

```bash
micromamba run -n snakemake mamba install \
-c bioconda -c conda-forge \
bcftools bedtools matplotlib numpy pandas scipy wget \
&& micromamba run -n snakemake pip install afquery
```

### Or step-by-step installation

#### 1. Bioinformatics tools

```bash
micromamba run -n snakemake mamba install \
-c bioconda -c conda-forge \
bcftools bedtools wget
```

#### 2. Python scientific libraries

```bash
micromamba run -n snakemake mamba install \
-c conda-forge \
numpy pandas scipy matplotlib
```

#### 3. AFQuery

```bash
micromamba run -n snakemake pip install afquery
```

## Verification

Test that everything is installed:

```bash
# Test bcftools
micromamba run -n afquery_bench bcftools --version

# Test Python imports
micromamba run -n afquery_bench python -c "
import afquery
import numpy
import pandas
import scipy
import matplotlib
print('All dependencies OK!')
"
```

Expected output:
```
bcftools >=1.17
All dependencies OK!
```

## Running Benchmarks

Once dependencies are installed and the environment is active:

```bash
cd benchmarks/
micromamba activate afquery_bench
snakemake --cores 52 all
```

Or run with conda activation:

```bash
micromamba run -n afquery_bench snakemake --cores 52 all
```

See [README.md](./README.md) for more options (smoke test, dry-run, etc.).

## Troubleshooting

### `afquery` import fails

Verify the installation:
```bash
micromamba run -n afquery_bench pip show afquery
```

If not found, reinstall:
```bash
micromamba run -n afquery_bench pip install --upgrade afquery
```

### Missing Python packages

Check installed packages:
```bash
micromamba run -n afquery_bench pip list | grep -E "numpy|pandas|scipy|matplotlib"
```

If any are missing, re-run the mamba install commands above.

### `bcftools` not found

Verify bcftools is installed:
```bash
micromamba run -n afquery_bench which bcftools
```

If not found, install it:
```bash
micromamba run -n afquery_bench mamba install -c bioconda bcftools
```
172 changes: 172 additions & 0 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
# AFQuery Benchmarking Suite

This directory contains benchmarking experiments for AFQuery.

## Benchmark Categories

### 1. [Performance Benchmarks](./performance/README.md)

Core performance characterization across 4 experiments:

- **Experiment 1:** Query latency scaling with sample count
- **Experiment 2:** Build time scaling with parallelism
- **Experiment 3:** VCF annotation throughput
- **Experiment 4:** AFQuery vs. bcftools comparison

Uses real 1000 Genomes data (chr22) and synthetic datasets (1K-50K samples).

### 2. [Capture Kit Benchmark](./capture_kit/README.md)

Capture kit mixing impact on allele frequency classification:

- Sample generation with three Agilent SureSelect kits (v5, v6, v7)
- Three mixing scenarios (balanced, skewed, extreme)
- ACMG classification discordance analysis with directional error decomposition (toward-pathogenic vs. toward-benign)
- Coverage overlap metrics

## Prerequisites

- micromamba with `snakemake` environment already set up (with Snakemake >= 8 and SLURM executor plugin)
- Benchmark dependencies installed in the `snakemake` environment (see Environment Setup section)
- `/usr/bin/time` (GNU time) for memory profiling -- should be available on most Linux systems
- ~200 GB disk space for all 1KG data and databases

## Environment Setup

All benchmark dependencies are installed in the `snakemake` micromamba environment.

### Install dependencies

The benchmark requires:
- External bioinformatics tools: bcftools >= 1.18, bedtools, bgzip/tabix
- Python packages: pandas, numpy, scipy, matplotlib
- AFQuery and its dependencies: pyroaring, pyarrow, duckdb, cyvcf2, pyranges, click, tqdm

Install them in your `snakemake` environment:

```bash
micromamba run -n snakemake mamba install \
-c bioconda -c conda-forge \
bcftools bedtools matplotlib numpy pandas scipy wget \
&& micromamba run -n snakemake \
pip install -e <path_to_afquery_repo>
```

### Verify

```bash
micromamba run -n snakemake bcftools --version
micromamba run -n snakemake python -c "import afquery; print('OK')"
```

## Running the Benchmarks

All commands below are run from the `benchmarks/` directory. Activate the `snakemake` environment first:

```bash
micromamba activate snakemake
```

Then use Snakemake with `--cores 52` to utilize all local CPU cores:

```bash
# Run everything (both benchmarks)
snakemake --cores 52 all

# Run individual benchmarks
snakemake --cores 52 performance_all
snakemake --cores 52 capture_kit_all

# Download 1KG data only (prerequisite for both)
snakemake --cores 52 download_1kg

# Dry run (preview what will execute)
snakemake --cores 52 --dry-run all

# Smoke test (fast validation with small parameter scales)
snakemake --cores 52 --config smoke_test=true all
```

### Smoke Test

The `smoke_test=true` flag reduces both pipelines to minimal scales for fast validation:

- **Performance:** 1 synthetic scale (1K), 1 1KG subset (500), 2 thread counts (1, 4), 1 rep
- **Capture kit:** 50 samples, balanced scenario only

This is useful for verifying pipeline logic without waiting for full-scale runs.

### Resuming after a failure

Snakemake uses output files to track completed steps. Re-running any command above will automatically skip steps whose outputs already exist and resume from the first incomplete step.

## Directory Structure

```
benchmarks/
├── Snakefile # Root pipeline (includes both benchmarks)
├── config.yaml # Global parameters (data_dir, threads, etc.)
├── envs/
│ └── benchmark.yaml # conda/micromamba environment spec
├── shared/
│ ├── __init__.py
│ ├── config.py # Common constants: DATA_DIR, 1KG paths, SEED
│ ├── utils.py # Common helpers: stats, time_ms, save_figure, WONG_COLORS
│ └── rules/
│ └── download_1kg.smk # Shared Snakemake rules: download + split 1KG
├── performance/
│ ├── Snakefile # Performance-specific rules
│ ├── config.py # Performance parameters (scales, reps, thread counts)
│ ├── config_smoke.py # Minimal config for quick smoke testing
│ ├── 01_prepare_data.py
│ ├── 02_query_scaling.py
│ ├── 03_build.py
│ ├── 04_annotate.py
│ ├── 05_vs_bcftools.py
│ ├── 06_plot.py
│ ├── collect_prepare.py # Aggregation scripts
│ ├── collect_query_scaling.py
│ ├── collect_build_perf.py
│ ├── collect_annotate.py
│ └── collect_bcftools.py
└── capture_kit/
├── Snakefile # Capture kit-specific rules
├── config.py # Capture kit parameters (scenarios, ACMG thresholds)
├── beds/ # Agilent SureSelect BED files (chr22, committed)
│ ├── SureSelect_v5.bed
│ ├── SureSelect_v6.bed
│ └── SureSelect_v7.bed
├── 01a_assign_samples.py # Subsample & assign technologies
├── 01b_write_manifests.py # Write AFQuery manifest TSVs
├── 02_build_databases.py
├── 03_compute_metrics.py
├── 04_classify_acmg.py
└── 05_plot_figures.py
```

## Configuration

Edit `config.yaml` to set the data directory and global parameters:

```yaml
data_dir: "/path/to/bench_data" # needs ~200 GB free
build_memory: "8GB" # DuckDB memory per worker
```

The data directory can also be set via the `AFQUERY_BENCH_DATA` environment variable.

## Known Limitations

- **Chromosome scope:** All benchmarks use only chromosome 22 (1000 Genomes Phase 3). Chr22 is one of the smallest autosomes (~51 Mb) and coverage patterns may not be representative of genome-wide behavior. The paper's methods section should discuss generalizability.
- **Synthetic data for scaling:** Query scaling experiments (1K-50K samples) use synthetic data. Allele frequency spectra and LD patterns differ from real population data.
- **Capture kits:** The capture kit benchmark uses only Agilent SureSelect kits (v5, v6, v7). Other vendors (Illumina, Roche) and custom panels are not evaluated.

## Reproducibility

All random seeds are fixed in `shared/config.py` (SEED = 42).
Re-running with the same configuration produces identical results within timing noise.

## References

- Weber LM et al. (2019). Essential guidelines for computational method benchmarking. *Genome Biology* 20:125. DOI: 10.1186/s13059-019-1738-8
- Auton A et al. (2015). A global reference for human genetic variation. *Nature* 526:68-74. DOI: 10.1038/nature15393
Loading
Loading