feat: add comprehensive benchmarking framework and variant info module#30
Merged
Conversation
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
fdcdf71 to
0fedf4d
Compare
- Auto-detect CPU count for thread defaults (os.cpu_count, nproc) - Add resumable wget downloads with -c flag - Add htslib module to capture kit pipeline - Extend annotate benchmarks to 32 threads
Move common constants (DATA_DIR, 1KG paths, SEED) to shared/config.py and duplicate helpers (stats, time_ms, save_figure, WONG_COLORS) to shared/utils.py. Move download script to shared/scripts/.
Import shared constants from shared/config.py. Replace local _stats, _time_ms, COLORS, _save with imports from shared/utils.py. Add Snakefile with per-rule SLURM resources and resume support.
Import shared constants from shared/config.py instead of performance/config.py. Fix broken capture_kit_bench imports. Replace duplicate style constants and _save with shared/utils.py. Add Snakefile replacing run_pipeline.sh.
Single root Snakefile orchestrates both benchmarks with shared 1KG download step. SLURM profile supports multi-node HPC execution with per-rule resource allocation. Snakemake resume semantics handle partial failures automatically via output file tracking.
…rules workflow.cores requires --cores N which is incompatible with the SLURM executor. Each rule now declares explicit threads matching cpus_per_task in the SLURM profile.
- Fixed 19 shell blocks with cd-before-redirect bug (performance + capture_kit Snakefiles)
- Pattern: LOG=$(realpath {log}) before cd to avoid silent 0-byte logs
- Fixed DAG dependencies for automatic rule execution
- Added data_inventory.json to performance_plot inputs (ensures prepare rules are included)
- Added explicit outputs to performance_all and capture_kit_all rules
- Changed root rule all to depend on outputs instead of inputs
- All logs now capture actual error messages instead of being silently swallowed
- Dry-run now shows correct DAG with 21 jobs (was failing completely before)
Note: Further work needed to ensure prepare rules (performance_prepare_synth, performance_prepare_1kg) are included in DAG automatically.
…tness - Modified 04_annotate.py to accept --max-subset parameter for smoke_test compatibility - Updated Snakefile to pass max_subset to annotate script via params - Made 06_plot.py robust to missing disk_size fields in inventory data - All performance benchmark rules now execute successfully end-to-end Test results (smoke_test mode): ✅ performance_prepare_synth ✅ performance_prepare_1kg ✅ performance_query_scaling_one ✅ performance_annotate_one (both variants) ✅ performance_annotate_collect ✅ performance_build_one ✅ performance_build_collect ✅ performance_plot ✅ performance_all (complete end-to-end success)
- snakemake 9.17.3 does not support --job-name-format argument - Removed this option from profiles/slurm/config.yaml - Dry-run and SLURM execution now work without unrecognized arguments error
- Added 'set -euo pipefail' and PATH export to all capture_kit rules - Ensures bioinformatics tools (parallel, bedtools, bcftools) are accessible - Tools are installed in snakemake conda environment but weren't in PATH for shell blocks
- Ensures Python and tools from snakemake conda environment are accessible - Added 'set -euo pipefail' for error handling - All performance rules now have proper environment setup
- Added PATH export for conda environment access - Fixed log path handling with realpath and proper quoting - Added set -euo pipefail for error handling
Add _set_data_dir() to shared/config.py to atomically update DATA_DIR and all derived paths (ONEKG_VCF_DIR, ONEKG_MERGED_VCF, ONEKG_PANEL) in one call. Update root Snakefile to use it instead of manually patching individual attributes, which left ONEKG_VCF_DIR and ONEKG_MERGED_VCF frozen at the default .results/ path. Also fix the AFQUERY_BENCH_DATA env var support: the previous code used a hardcoded lustre path as the fallback instead of .results/, making the env var effectively useless.
- Replace hardcoded developer paths in config.yaml (data_dir, threads) - data_dir: set to empty string with instructions for users to configure - threads: set to 4 (reasonable default for local runs; SLURM has per-rule settings) - Add startup guard in Snakefile to catch empty data_dir before DAG construction
Users should uncomment and set slurm_partition for their own cluster, as partition names vary across HPC systems (normal, compute, standard, etc).
Remove misleading [:N_SAMPLES] slice from line 96. The available samples are already a subset from the initial subsample(), so the slice was redundant and confusing. Use available directly when VCF files exist.
…ute_metrics.py - Add ONEKG_CHROM to config.py imports from shared.config - Import ONEKG_CHROM in 03_compute_metrics.py from config - Add chrom parameter to compute_kit_coverage() defaulting to ONEKG_CHROM - Promote bias check magic numbers (0.01, 0.001) to named module constants _AFQUERY_BIAS_THRESHOLD and _NAIVE_POSITIVE_BIAS_THRESHOLD with explanatory comments
Replace single-node GNU parallel split in download_1kg.sh with split_1kg_sample wildcard rule (one SLURM job per sample). - Add checkpoint download_1kg_data for VCF + panel download - Add split_1kg_sample wildcard rule (2504 independent 1-CPU jobs) - Rewrite download_1kg to write manifest.tsv once all splits complete - Remove capture_kit_split_vcf (replaced by split_1kg_sample) - Delete download_1kg.sh - Update SLURM profile and README
Python scripts in subprocesses import shared.config fresh, which checks
AFQUERY_BENCH_DATA at module import time. Without this env var, they fall
back to the default Path('.results'), causing them to look in the wrong
directory for manifests and databases.
Prevents manifest.tsv from being written with relative VCF paths. Relative paths would be misresolved when subsampled manifests are read from different directories.
The 1KG Phase 3 VCF uses bare chromosome names (22) while AFQuery normalizes to chr-prefixed format (chr22). This caused zero common variants in concordance. - Normalize bcftools output to chr-prefixed format in _compute_concordance - Use bare ONEKG_CHROM for bcftools timing queries - Add fallback figure when no concordance data - Import ONEKG_CHROM in config Result: 1.1M common variants (R²=1.0) with fig5 generated successfully.
Critical fixes: - Fix chromosome name mismatch in bcftools comparison (bare vs chr-prefixed) - Use scenario-specific seeds for independent capture kit technology assignments - Include Database initialization in bcftools timing for fair comparison - Sample 2000 random pairs for concordance scatter plot vs first 100 Design improvements: - Add repetitions for build (BUILD_REPS=3) and annotation (ANNOTATE_REPS=3) benchmarks - Propagate smoke_test configuration to capture_kit benchmark - Guard stats() function for empty timing lists - Fix RESULTS_DIR path separation from raw data Documentation: - Expand performance/README.md with detailed experiment descriptions and interpretation guides - Expand capture_kit/README.md with ground truth setup and clinical impact sections - Update benchmarks/README.md with correct directory structure and known limitations - Add timing methodology section explaining cold vs warm queries Bug fixes: - Prevent race conditions in parallel build repetitions by using unique db_dir per rep - Fix BED file path documentation for capture_kit benchmark
- Add toward_pathogenic and toward_benign metrics in ACMG classification - Add BS1 recall metric for disease-specific evaluation - Add Figure 5 plotting toward-pathogenic misclassification rates - Update documentation with new metric definitions and interpretations
- Pin all conda and pip package versions in envs/benchmark.yaml based on the validated afquery_bench environment (tested locally with smoke test) - Add snakemake==9.20.0 to pip dependencies (conda resolver incompatibility) - Add smoke_test.log to .gitignore (local test artifact) - Update INSTALL.md with corrected instructions (remove hardcoded user paths) These changes ensure reproducibility and avoid compatibility issues when others run the benchmarks. The environment versions were captured from a working installation tested with the smoke test pipeline.
- Add tests using the public variant_info() wrapper function directly - Test variant_info() with phenotype filters - Test variant_info() with technology filters - Improves codecov coverage for variant_info.py and imports
- Test unknown chromosome warning path in variant_info() - Test multiple alleles warning when ref/alt not specified - Test empty results without spurious warnings - Covers warning paths in QueryEngine.variant_info()
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces a comprehensive benchmarking framework and new variant information features for AFQuery:
Test plan