Variant Medium Nextflow Pipeline#3
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces a comprehensive Nextflow pipeline for the VariantMedium somatic variant caller, enabling automated execution of SNV and INDEL calling workflows with support for both conda and singularity environments.
Key changes:
- Nextflow DSL2 modules and workflows for variant filtering and calling
- Bash launcher script (variantmedium.sh) for simplified pipeline execution
- Code reorganization from src/ to bin/ to fix relative imports
- Samplesheet-based input handling (CSV/TSV formats)
Reviewed changes
Copilot reviewed 33 out of 71 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| workflows/variantmedium_call_variants.nf | Implements SNV/INDEL variant calling workflow with DenseNet models |
| workflows/variantmedium_filter_candidates.nf | Orchestrates ExtraTrees-based candidate filtering for SNV/INDEL |
| workflows/variantmedium_stage_data.nf | Stages reference data and ML models |
| workflows/variantmedium_prepare_inputs.nf | Prepares input TSV files from samplesheet |
| variantmedium.sh | Bash launcher script orchestrating all 8 pipeline steps |
| main.nf | Entry point coordinating workflow execution based on execution_step parameter |
| nextflow.config | Configuration with conda/singularity profiles and parameter definitions |
| conf/modules.config | Per-module configuration including publish directories |
| conf/base.config | Process resource labels and defaults |
| subworkflows/data_staging/main.nf | Stages references and models through dedicated modules |
| subworkflows/parse_samplesheet/main.nf | Validates and parses input samplesheet |
| subworkflows/parameter_validation/main.nf | Validates required parameters |
| modules/variantmedium/filter/main.nf | ExtraTrees filtering module |
| modules/variantmedium/call/main.nf | DenseNet variant calling module |
| modules/prepare_inputs/main.nf | Generates TSV files for downstream tools |
| modules/stage_refs/main.nf | Downloads and stages reference data |
| modules/stage_models/main.nf | Downloads and verifies ML model weights |
| bin/prepare_input_files.py | Python script to generate pipeline input TSVs |
| bin/filter_candidates.py | Python script for candidate filtering |
| bin/run_variant_medium.py | Entry point for variant calling |
| bin/src/* | Refactored Python source code with fixed relative imports |
| README.md | Updated with new pipeline launcher documentation |
Comments suppressed due to low confidence (1)
README.md:80
- There's a spelling error in "envirnments". It should be "environments".
reference genome and S07604624 SureSelect Human All Exon V6+UTR from UCSC if you need them.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| VariantMedium is a deep learning-based somatic variant caller for matched tumor-normal short-read sequencing data. It integrates machine learning–based filtering and 3D convolutional neural networks to classify candidate sites as somatic, germline, or non-variant, with high sensitivity and robustness across diverse genomic contexts and sample types. | ||
|
|
||
| ## Dependencies | ||
| ## Dependencies (handleled in the module environments) |
There was a problem hiding this comment.
the modules now have a environment.yml file from which the env is built on-the-go we do not need any prior installations
There was a problem hiding this comment.
we don't even need Nextflow installation?
| - conda >= 4.4 (miniconda >=23.11.0 recommended) | ||
| - CUDA 11.4 (optional for GPU support) | ||
|
|
||
| ## Installation |
There was a problem hiding this comment.
The user would still need to clone the repository, or?
There was a problem hiding this comment.
no need to clone the repo as well, the pipeline can be run as nextflow run TRON-Bioinformatics/VariantMedium [--options]
| --profile STRING Nextflow profile name (conda, singularity) [default: conda] | ||
| [Parts of the pipeline may not support singularity - Prefer using conda] | ||
| OPTIONAL ARGUMENTS: | ||
| --config PATH Path to custom config file (.conf) |
There was a problem hiding this comment.
the parameters in the config file are essential for running the pipeline, this is a required argument
There was a problem hiding this comment.
We need a very imple minimal example usage for basic users that come with their bam files and want VCFs with variant calls. Only the minimum needed steps should be as simple as possible explained.
All further detail can come in later section of the README or in a separate documentation.
|
|
||
| stub: | ||
| """ | ||
| touch fake.somatic_snv.VariantMedium.tsv |
| reference_dir = 'ref_data' // directory name for reference data | ||
| models_dir = 'models' // directory name for trained models | ||
|
|
||
| // call |
There was a problem hiding this comment.
Actually, some of the parameters (learning rate, drop rate, possibly epoch) below are not updated during inference stage, and are only there for data ops. The others (aug_rate and aug_mixes) might really impact the results and should not be changed (esp. aug_rate, aug_mixes is also mostly there for tracking)
could you add a note, that they are not to be changed
|
|
||
| manifest { | ||
| name = 'TRON-Bioinformatics/variantmedium' | ||
| author = 'Ozlem Muslu, Jonas Ibn-Salem, Shaya Akbarinejad, Luis Kress' |
There was a problem hiding this comment.
Please add your name as well :)
| nextflowVersion = '>=24.10.3' | ||
| version = VERSION | ||
| doi = DOI | ||
| version = '1.1.0' |
There was a problem hiding this comment.
Let's update the version to 1.1.1
| #--------------------------------------- | ||
|
|
||
| TSV_FOLDER="${OUTDIR}/tsv_folder" | ||
| REF_DIR="${OUTDIR}/data_staging/ref_data" |
There was a problem hiding this comment.
variables on lines 171-174 should be pulled from the config file
There was a problem hiding this comment.
This behaviour (setting a default and updating with config later) is risky because user might think they don't need to input these files, but have BAMs aligned to other versions of the reference genome
| VariantMedium is a deep learning-based somatic variant caller for matched tumor-normal short-read sequencing data. It integrates machine learning–based filtering and 3D convolutional neural networks to classify candidate sites as somatic, germline, or non-variant, with high sensitivity and robustness across diverse genomic contexts and sample types. | ||
|
|
||
| ## Dependencies | ||
| ## Dependencies (handleled in the module environments) |
There was a problem hiding this comment.
we don't even need Nextflow installation?
| touch fake.all_scores_somatic_indel.tsv | ||
| touch fake.all_scores_germline_indel.tsv | ||
| touch sample.somatic_snv.VariantMedium.tsv | ||
| touch sample.germline_snv.VariantMedium.tsv |
There was a problem hiding this comment.
not sure why we have the files with germline here, VariantMedium is not designed for germline variant calling, and I have it only for experimental reasons
bash variantmedium.sh --help- updated in README