Dual-SVF: a robust knowledge-guided multimodal method for structural variant filtering in long-readsequencing
Overall architecture of Dual-SVF with confidence-driven collaborative denoising and multi-level knowledge-aware constraints.
Semantic-Syntactic Collaborative Denoising for SV Filtering (Dual-SVF) integrates nucleotide-level sequence semantics and alignment-derived syntactic structures through a knowledge-guided dual-stream architecture. Genomic priors (e.g., mapping quality, GC content, repeat annotations) dynamically regulate cross-modal fusion via a confidence-gated mechanism, while multi-level constraints ensure biologically consistent predictions. The framework achieves robust SV filtering in noisy long-read data.
π Dual-Modal Architecture: Integrates syntactic (CIGAR) and semantic (bases) features from genomic data
π― Confidence-Driven Denoising: Adaptive confidence estimation with cross-modal correction
π Multi-level knowledge-aware constraints: Feature-space contrastive learning and decision consistency regularization
π Distributed Training: Native PyTorch DDP support for multi-GPU training
π Adaptive GPK Dimensions: Intelligent handling of varying global prior knowledge dimensions
π State-of-the-art Performance: Superior filtering accuracy across multiple sequencing platforms
- Python 3.8+
- CUDA 11.8+ (for GPU acceleration)
- 4+ GB VRAM per GPU recommended
- 16+ GB RAM
# Install core dependencies
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install pysam==0.23.3 pandas numpy matplotlib scikit-learn scipy
# Install specialized bioinformatics packages
pip install biopython mappy cuteSV sniffles intervaltree pyfaidx PyVCF3 Truvari Badread
# Install packages from requirements.txt
pip install -r requirements.txt
python generate_images.py \
--txt_file /path/to/HG002_SVs_Tier1_v0.6.PASS.ALL.pos.txt \
--bam_file /path/to/HG002.CLR.70x.bam \
--ref_file /path/to/human_hs37d5.fasta \
--output_dir ./01.sv_images \
--extend_length 500 \
--select_read 70 \
--csv_out sv_statistics.csv
After generate_images.py:
01.sv_images/
βββ bases/*.png # Semantic images
βββ cigar/*.png # Syntactic images
python generate_gpk.py \
--txt_file /home/laicx/02.study/03.bib/01.data_process/HG002_SVs_Tier1_v0.6.PASS.ALL.pos.txt \
--bam_file /home/ifs/laicx/00.dataset/01.bam_file/02.HG002_bam/HG002.CSS.28x.bam \
--ref_file /home/ifs/laicx/00.dataset/03.ref_file/01.hg37/human_hs37d5.fasta \
--output_dir /home/laicx/02.study/03.bib/02.get_data/01.sv_images/02.CSS28 \
--extend_length 500 \
--select_read 28\
--csv_out sv_gpk.csv
After generate_gpk.py:
./
βββ sv_gpk.csv # Global prior knowledge
python split_data.py \
./01.sv_images \
./02.split_data
02.split_data/
βββ bases/
β βββ Del_positive/*.png
β βββ Ins_positive/*.png
β βββ Match_negative/*.png
βββ cigar/
βββ Del_positive/*.png
βββ Ins_positive/*.png
βββ Match_negative/*.png
python CoDAC-main.py \
--bases_root ./02.split_data/bases \
--cigar_root ./02.split_data/cigar \
--gpk_csv ./01.sv_images/sv_statistics.csv \
--class_dirs Del_positive Ins_positive Match_negative \
--bases_model resnet50 \
--cigar_model mobilenet_v2 \
--train_chrs 1 2 \
--test_chrs 13 \
--use_cdcd \
--use_mac \
--epochs 30 \
--batch_size 64 \
--save_path ./dual_icme.pth \
--lr 1e-4 \
--weight_decay 1e-2 \
--early_stop_patience 5
torchrun --nproc_per_node=4 CoDAC-main.py \
--bases_root ./02.split_data/bases \
--cigar_root ./02.split_data/cigar \
--gpk_csv ./01.sv_images/sv_statistics.csv \
--class_dirs Del_positive Ins_positive Match_negative \
--bases_model resnet50 \
--cigar_model mobilenet_v2 \
--train_chrs 1 2 3 4 5 6 7 8 9 10 11 12 \
--test_chrs 13 14 15 16 17 18 19 20 21 22 X Y \
--use_cdcd \
--use_mac \
--epochs 30 \
--batch_size 64 \
--save_path ./dual_icme.pth \
--lr 1e-4 \
--weight_decay 1e-2 \
--early_stop_patience 5
| Parameter | Description | Default |
|---|---|---|
--bases_model |
Backbone for semantic modality | resnet50 |
--cigar_model |
Backbone for syntactic modality | mobilenet_v2 |
--use_cdcd |
Enable collaborative denoising | True |
--use_mac |
Enable multi-granularity constraints | True |
--train_chrs |
Chromosomes for training | 1-12 |
--test_chrs |
Chromosomes for testing | 13-22,X,Y |
--batch_size |
Per-GPU batch size | 64 |
Input:
βββ Semantic Modality (BASES) β ResNet50 β Feature Projection
βββ Syntactic Modality (CIGAR) β MobileNetV2 β Feature Projection
Core:
βββ Confidence-Driven Collaborative Denoising (CDCD)
β βββ Confidence Map Estimation (CME)
β βββ Confidence-Gated Cross-Attention (CGCA)
β βββ Adaptive Residual Correction
βββ Multi-Granularity Awareness Constraints (MAC)
βββ Prior-Modulated Contrastive Loss (PMCL)
βββ Biological Consistency Regularization (BCR)
βββ Learnable Weight Balancing
Output: SV Classification (MATCH/DEL/INS)
Experimental evaluation of Dual-SVF is conducted on three benchmark third-generation sequencing datasets from the well-characterized HG002 human genome (GIAB consortium).
The study primarily utilizes the human and yeast reference assemblies:
- Human (hs37d5): https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz.
- Yeast (R64-1-1): http://ftp.ensembl.org/pub/release-110/fasta/saccharomyces%20cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz.
Benchmark sequencing data providing unique read properties across different platforms.
The HG002_SVs_Tier1_v0.6 serves as the gold standard for model training and evaluation.
- Reference: GRCh37
- SV Types: INS (Insertions), DEL (Deletions)
-
Size Threshold:
$\ge 50$ bp - VCF Link:
- https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/HG002_SVs_Tier1_v0.6.vcf.gz.
- Syntactic Modality (CIGAR)
MobileNetV2 for efficient alignment feature extraction - Semantic Modality (BASES)
ResNet50 for deep sequence context understanding
Confidence Map Estimation β Confidence-Gated Cross-Attention β Adaptive Residual Correction
- Prior-Modulated Contrastive Loss (PMCL)
- Biological Consistency Regularization (BCR)
- Adaptive weight balancing with learnable coefficients
