feat: add comprehensive benchmarking framework and variant info module by dlopez-bioinfo · Pull Request #30 · babelomics/afquery

dlopez-bioinfo · 2026-05-05T10:37:55Z

Summary

This PR introduces a comprehensive benchmarking framework and new variant information features for AFQuery:

Benchmarking Infrastructure: Complete performance and correctness benchmarking suite with Snakemake pipelines for testing query scaling, database build performance, annotation speed, and comparison with bcftools
Capture Kit Analysis: Dedicated benchmark for ACMG variant classification across multiple WES capture kits (Agilent SureSelect v5/v6/v7)
Variant Info Module: New structured module for variant metadata retrieval and processing
Documentation: Updated CLI reference, guides, and troubleshooting docs
Critical Fixes: Pin benchmark environment versions, fix SLURM node selection, resolve chromosome name mismatches in bcftools concordance

Test plan

Run all existing tests to ensure no regressions
Benchmarking suite requires manual execution with genomic data (1KG download, sample preparation)
Validate variant_info tests with test database

codecov-commenter · 2026-05-05T10:40:38Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

- Auto-detect CPU count for thread defaults (os.cpu_count, nproc) - Add resumable wget downloads with -c flag - Add htslib module to capture kit pipeline - Extend annotate benchmarks to 32 threads

Move common constants (DATA_DIR, 1KG paths, SEED) to shared/config.py and duplicate helpers (stats, time_ms, save_figure, WONG_COLORS) to shared/utils.py. Move download script to shared/scripts/.

Import shared constants from shared/config.py. Replace local _stats, _time_ms, COLORS, _save with imports from shared/utils.py. Add Snakefile with per-rule SLURM resources and resume support.

Import shared constants from shared/config.py instead of performance/config.py. Fix broken capture_kit_bench imports. Replace duplicate style constants and _save with shared/utils.py. Add Snakefile replacing run_pipeline.sh.

Single root Snakefile orchestrates both benchmarks with shared 1KG download step. SLURM profile supports multi-node HPC execution with per-rule resource allocation. Snakemake resume semantics handle partial failures automatically via output file tracking.

…rules workflow.cores requires --cores N which is incompatible with the SLURM executor. Each rule now declares explicit threads matching cpus_per_task in the SLURM profile.

- Fixed 19 shell blocks with cd-before-redirect bug (performance + capture_kit Snakefiles) - Pattern: LOG=$(realpath {log}) before cd to avoid silent 0-byte logs - Fixed DAG dependencies for automatic rule execution - Added data_inventory.json to performance_plot inputs (ensures prepare rules are included) - Added explicit outputs to performance_all and capture_kit_all rules - Changed root rule all to depend on outputs instead of inputs - All logs now capture actual error messages instead of being silently swallowed - Dry-run now shows correct DAG with 21 jobs (was failing completely before) Note: Further work needed to ensure prepare rules (performance_prepare_synth, performance_prepare_1kg) are included in DAG automatically.

…tness - Modified 04_annotate.py to accept --max-subset parameter for smoke_test compatibility - Updated Snakefile to pass max_subset to annotate script via params - Made 06_plot.py robust to missing disk_size fields in inventory data - All performance benchmark rules now execute successfully end-to-end Test results (smoke_test mode): ✅ performance_prepare_synth ✅ performance_prepare_1kg ✅ performance_query_scaling_one ✅ performance_annotate_one (both variants) ✅ performance_annotate_collect ✅ performance_build_one ✅ performance_build_collect ✅ performance_plot ✅ performance_all (complete end-to-end success)

- snakemake 9.17.3 does not support --job-name-format argument - Removed this option from profiles/slurm/config.yaml - Dry-run and SLURM execution now work without unrecognized arguments error

- Added 'set -euo pipefail' and PATH export to all capture_kit rules - Ensures bioinformatics tools (parallel, bedtools, bcftools) are accessible - Tools are installed in snakemake conda environment but weren't in PATH for shell blocks

- Ensures Python and tools from snakemake conda environment are accessible - Added 'set -euo pipefail' for error handling - All performance rules now have proper environment setup

- Added PATH export for conda environment access - Fixed log path handling with realpath and proper quoting - Added set -euo pipefail for error handling

Add _set_data_dir() to shared/config.py to atomically update DATA_DIR and all derived paths (ONEKG_VCF_DIR, ONEKG_MERGED_VCF, ONEKG_PANEL) in one call. Update root Snakefile to use it instead of manually patching individual attributes, which left ONEKG_VCF_DIR and ONEKG_MERGED_VCF frozen at the default .results/ path. Also fix the AFQUERY_BENCH_DATA env var support: the previous code used a hardcoded lustre path as the fallback instead of .results/, making the env var effectively useless.

- Replace hardcoded developer paths in config.yaml (data_dir, threads) - data_dir: set to empty string with instructions for users to configure - threads: set to 4 (reasonable default for local runs; SLURM has per-rule settings) - Add startup guard in Snakefile to catch empty data_dir before DAG construction

Users should uncomment and set slurm_partition for their own cluster, as partition names vary across HPC systems (normal, compute, standard, etc).

Remove misleading [:N_SAMPLES] slice from line 96. The available samples are already a subset from the initial subsample(), so the slice was redundant and confusing. Use available directly when VCF files exist.

…oducibility

…ute_metrics.py - Add ONEKG_CHROM to config.py imports from shared.config - Import ONEKG_CHROM in 03_compute_metrics.py from config - Add chrom parameter to compute_kit_coverage() defaulting to ONEKG_CHROM - Promote bias check magic numbers (0.01, 0.001) to named module constants _AFQUERY_BIAS_THRESHOLD and _NAIVE_POSITIVE_BIAS_THRESHOLD with explanatory comments

Replace single-node GNU parallel split in download_1kg.sh with split_1kg_sample wildcard rule (one SLURM job per sample). - Add checkpoint download_1kg_data for VCF + panel download - Add split_1kg_sample wildcard rule (2504 independent 1-CPU jobs) - Rewrite download_1kg to write manifest.tsv once all splits complete - Remove capture_kit_split_vcf (replaced by split_1kg_sample) - Delete download_1kg.sh - Update SLURM profile and README

…ents

Python scripts in subprocesses import shared.config fresh, which checks AFQUERY_BENCH_DATA at module import time. Without this env var, they fall back to the default Path('.results'), causing them to look in the wrong directory for manifests and databases.

Prevents manifest.tsv from being written with relative VCF paths. Relative paths would be misresolved when subsampled manifests are read from different directories.

The 1KG Phase 3 VCF uses bare chromosome names (22) while AFQuery normalizes to chr-prefixed format (chr22). This caused zero common variants in concordance. - Normalize bcftools output to chr-prefixed format in _compute_concordance - Use bare ONEKG_CHROM for bcftools timing queries - Add fallback figure when no concordance data - Import ONEKG_CHROM in config Result: 1.1M common variants (R²=1.0) with fig5 generated successfully.

Critical fixes: - Fix chromosome name mismatch in bcftools comparison (bare vs chr-prefixed) - Use scenario-specific seeds for independent capture kit technology assignments - Include Database initialization in bcftools timing for fair comparison - Sample 2000 random pairs for concordance scatter plot vs first 100 Design improvements: - Add repetitions for build (BUILD_REPS=3) and annotation (ANNOTATE_REPS=3) benchmarks - Propagate smoke_test configuration to capture_kit benchmark - Guard stats() function for empty timing lists - Fix RESULTS_DIR path separation from raw data Documentation: - Expand performance/README.md with detailed experiment descriptions and interpretation guides - Expand capture_kit/README.md with ground truth setup and clinical impact sections - Update benchmarks/README.md with correct directory structure and known limitations - Add timing methodology section explaining cold vs warm queries Bug fixes: - Prevent race conditions in parallel build repetitions by using unique db_dir per rep - Fix BED file path documentation for capture_kit benchmark

- Add toward_pathogenic and toward_benign metrics in ACMG classification - Add BS1 recall metric for disease-specific evaluation - Add Figure 5 plotting toward-pathogenic misclassification rates - Update documentation with new metric definitions and interpretations

- Pin all conda and pip package versions in envs/benchmark.yaml based on the validated afquery_bench environment (tested locally with smoke test) - Add snakemake==9.20.0 to pip dependencies (conda resolver incompatibility) - Add smoke_test.log to .gitignore (local test artifact) - Update INSTALL.md with corrected instructions (remove hardcoded user paths) These changes ensure reproducibility and avoid compatibility issues when others run the benchmarks. The environment versions were captured from a working installation tested with the smoke test pipeline.

- Add tests using the public variant_info() wrapper function directly - Test variant_info() with phenotype filters - Test variant_info() with technology filters - Improves codecov coverage for variant_info.py and imports

- Test unknown chromosome warning path in variant_info() - Test multiple alleles warning when ref/alt not specified - Test empty results without spurious warnings - Covers warning paths in QueryEngine.variant_info()

dlopez-bioinfo mentioned this pull request May 5, 2026

feat: customizable homozygous reference sample selection #28

Closed

dlopez-bioinfo force-pushed the feat/benchmarks branch from fdcdf71 to 0fedf4d Compare May 5, 2026 11:09

dlopez-bioinfo added 27 commits May 5, 2026 13:10

feat: add performance and bed kit benchmarkings

b40e84b

feat: improve benchmark infrastructure

58d2dd9

- Auto-detect CPU count for thread defaults (os.cpu_count, nproc) - Add resumable wget downloads with -c flag - Add htslib module to capture kit pipeline - Extend annotate benchmarks to 32 threads

refactor: add shared benchmark config and utilities

b32bc96

Move common constants (DATA_DIR, 1KG paths, SEED) to shared/config.py and duplicate helpers (stats, time_ms, save_figure, WONG_COLORS) to shared/utils.py. Move download script to shared/scripts/.

refactor: homogenize performance benchmark

3011616

Import shared constants from shared/config.py. Replace local _stats, _time_ms, COLORS, _save with imports from shared/utils.py. Add Snakefile with per-rule SLURM resources and resume support.

refactor: homogenize capture_kit benchmark

2580979

Import shared constants from shared/config.py instead of performance/config.py. Fix broken capture_kit_bench imports. Replace duplicate style constants and _save with shared/utils.py. Add Snakefile replacing run_pipeline.sh.

docs: update benchmarks README for Snakemake pipeline

cb84fa8

docs: update sub-benchmark READMEs for Snakemake pipeline

0c213d3

docs: remove environment-specific details from benchmark READMEs

8dd2dd1

fix: use relative profile path for snakemake invocation from benchmarks/

5b51c51

fix: replace workflow.cores with explicit thread counts in Snakemake …

313a944

…rules workflow.cores requires --cores N which is incompatible with the SLURM executor. Each rule now declares explicit threads matching cpus_per_task in the SLURM profile.

refactor: add CLI args to benchmark scripts for single-config execution

f8a662a

feat: add aggregation scripts for per-run benchmark results

52c5973

refactor: parallelize Snakemake rules with wildcards (12 → 65 jobs)

852bdb4

chore: update SLURM profile for new parallelized rules

dc7dc26

fix: replace slurm_extra --nodes with nodes resource in SLURM profile

28b368c

fix: create log dirs in onstart before rules execute

252aaab

feat: use rule names instead of UUIDs for SLURM job names

d76534e

fix: add onstart to all snakefiles to ensure log dirs exist

2979fa0

fix: create log directories in shell commands before redirecting output

4fcdfda

fix: remove unsupported job-name-format option from SLURM profile

6878c2b

- snakemake 9.17.3 does not support --job-name-format argument - Removed this option from profiles/slurm/config.yaml - Dry-run and SLURM execution now work without unrecognized arguments error

fix: add PATH export to all performance rule shell blocks

6a15488

- Ensures Python and tools from snakemake conda environment are accessible - Added 'set -euo pipefail' for error handling - All performance rules now have proper environment setup

fix: update download_1kg rule with proper log handling and PATH setup

bb40828

- Added PATH export for conda environment access - Fixed log path handling with realpath and proper quoting - Added set -euo pipefail for error handling

dlopez-bioinfo added 28 commits May 5, 2026 13:10

fix: replace hardcoded ONEKG_DIR fallback with explicit error guard

c92be2a

fix: comment out cluster-specific slurm_partition in SLURM profile

257cf2b

Users should uncomment and set slurm_partition for their own cluster, as partition names vary across HPC systems (normal, compute, standard, etc).

fix: clarify available sample selection in 01a_assign_samples.py

cb011a6

Remove misleading [:N_SAMPLES] slice from line 96. The available samples are already a subset from the initial subsample(), so the slice was redundant and confusing. Use available directly when VCF files exist.

fix: add missing entries to .gitignore and fix cleanup in 03_build.py

d1e2f0f

feat: pin package versions in benchmarks/envs/benchmark.yaml for repr…

81eb0d8

…oducibility

fix: quote shell paths in capture_kit Snakefile rules

b9764d1

docs: add SLURM parallelism comments to performance prepare rules

28c27d1

fix: use chr-prefixed chrom for CaptureIndex and quote output.vcf path

4f421c0

chore: remove .snakemake from git tracking

3f8a21e

fix: use conda env instead of micromamba

0caff83

fix: restore performance_all output marker and correct all rule targets

4a6f8cd

fix: disable conda management in slurm profile, use existing environm…

e86913c

…ents

fix: use absolute paths in synthetic manifest vcf_path

82da97a

fix: remove duplicate 'results' in RESULTS_DIR path

3826ddf

fix: use absolute paths for PERF_RESULTS and PERF_FIGURES

4d79507

fix: remove invalid 'nodes' resource from slurm profile

394901d

fix: use absolute paths in download_1kg.smk

f8cd8bc

Prevents manifest.tsv from being written with relative VCF paths. Relative paths would be misresolved when subsampled manifests are read from different directories.

fix(benchmarking): remove slurm node selection

b4d57a7

test: add coverage for variant_info wrapper function and filters

9383e34

- Add tests using the public variant_info() wrapper function directly - Test variant_info() with phenotype filters - Test variant_info() with technology filters - Improves codecov coverage for variant_info.py and imports

test: add coverage for variant_info edge cases and warnings

22deaf1

- Test unknown chromosome warning path in variant_info() - Test multiple alleles warning when ref/alt not specified - Test empty results without spurious warnings - Covers warning paths in QueryEngine.variant_info()

dlopez-bioinfo merged commit 2c7fd6d into master May 5, 2026
3 checks passed

dlopez-bioinfo deleted the feat/benchmarks branch May 5, 2026 11:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add comprehensive benchmarking framework and variant info module#30

feat: add comprehensive benchmarking framework and variant info module#30
dlopez-bioinfo merged 56 commits into
masterfrom
feat/benchmarks

dlopez-bioinfo commented May 5, 2026

Uh oh!

codecov-commenter commented May 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dlopez-bioinfo commented May 5, 2026

Summary

Test plan

Uh oh!

codecov-commenter commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov-commenter commented May 5, 2026 •

edited

Loading