Skip to content

sokolo05/Dual-SVF

Repository files navigation

Dual-SVF: a robust knowledge-guided multimodal method for structural variant filtering in long-readsequencing

Dual-SVF Architecture

Overall architecture of Dual-SVF with confidence-driven collaborative denoising and multi-level knowledge-aware constraints.

πŸ“– Overview

Semantic-Syntactic Collaborative Denoising for SV Filtering (Dual-SVF) integrates nucleotide-level sequence semantics and alignment-derived syntactic structures through a knowledge-guided dual-stream architecture. Genomic priors (e.g., mapping quality, GC content, repeat annotations) dynamically regulate cross-modal fusion via a confidence-gated mechanism, while multi-level constraints ensure biologically consistent predictions. The framework achieves robust SV filtering in noisy long-read data.

✨ Key Features

πŸ”€ Dual-Modal Architecture: Integrates syntactic (CIGAR) and semantic (bases) features from genomic data

🎯 Confidence-Driven Denoising: Adaptive confidence estimation with cross-modal correction

πŸ“Š Multi-level knowledge-aware constraints: Feature-space contrastive learning and decision consistency regularization

πŸš€ Distributed Training: Native PyTorch DDP support for multi-GPU training

πŸ”„ Adaptive GPK Dimensions: Intelligent handling of varying global prior knowledge dimensions

πŸ“ˆ State-of-the-art Performance: Superior filtering accuracy across multiple sequencing platforms

πŸš€ Quick Start

Prerequisites

  • Python 3.8+
  • CUDA 11.8+ (for GPU acceleration)
  • 4+ GB VRAM per GPU recommended
  • 16+ GB RAM

Installation

# Install core dependencies
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install pysam==0.23.3 pandas numpy matplotlib scikit-learn scipy

# Install specialized bioinformatics packages
pip install biopython mappy cuteSV sniffles intervaltree pyfaidx PyVCF3 Truvari Badread

# Install packages from requirements.txt
pip install -r requirements.txt

Essential Bioinformatics Dependencies

Package Purpose
pysam BAM/CRAM file processing
biopython Sequence analysis and manipulation
mappy Sequence alignment and mapping
cuteSV SV detection and calling
sniffles Long-read SV caller
intervaltree Genomic interval handling
pyfaidx FASTA file indexing and access
PyVCF3 VCF file parsing and writing
Truvari SV benchmarking and comparison
Badread Read simulation and quality control

πŸ“ Data Preparation Pipeline

Generate SV Images

python generate_images.py \
  --txt_file /path/to/HG002_SVs_Tier1_v0.6.PASS.ALL.pos.txt \
  --bam_file /path/to/HG002.CLR.70x.bam \
  --ref_file /path/to/human_hs37d5.fasta \
  --output_dir ./01.sv_images \
  --extend_length 500 \
  --select_read 70 \
  --csv_out sv_statistics.csv

After generate_images.py:
01.sv_images/
β”œβ”€β”€ bases/*.png               # Semantic images
└── cigar/*.png               # Syntactic images

Generate GPK

python generate_gpk.py \
  --txt_file /home/laicx/02.study/03.bib/01.data_process/HG002_SVs_Tier1_v0.6.PASS.ALL.pos.txt \
  --bam_file /home/ifs/laicx/00.dataset/01.bam_file/02.HG002_bam/HG002.CSS.28x.bam \
  --ref_file /home/ifs/laicx/00.dataset/03.ref_file/01.hg37/human_hs37d5.fasta \
  --output_dir /home/laicx/02.study/03.bib/02.get_data/01.sv_images/02.CSS28 \
  --extend_length 500 \
  --select_read 28\
  --csv_out sv_gpk.csv

After generate_gpk.py:
./
└── sv_gpk.csv               # Global prior knowledge

Organize Data by SV Type

python split_data.py \
  ./01.sv_images \
  ./02.split_data

02.split_data/
β”œβ”€β”€ bases/
β”‚   β”œβ”€β”€ Del_positive/*.png
β”‚   β”œβ”€β”€ Ins_positive/*.png
β”‚   └── Match_negative/*.png
└── cigar/
    β”œβ”€β”€ Del_positive/*.png
    β”œβ”€β”€ Ins_positive/*.png
    └── Match_negative/*.png

πŸ‹οΈ Training

Single GPU Training

python CoDAC-main.py \
  --bases_root ./02.split_data/bases \
  --cigar_root ./02.split_data/cigar \
  --gpk_csv ./01.sv_images/sv_statistics.csv \
  --class_dirs Del_positive Ins_positive Match_negative \
  --bases_model resnet50 \
  --cigar_model mobilenet_v2 \
  --train_chrs 1 2 \
  --test_chrs 13 \
  --use_cdcd \
  --use_mac \
  --epochs 30 \
  --batch_size 64 \
  --save_path ./dual_icme.pth \
  --lr 1e-4 \
  --weight_decay 1e-2 \
  --early_stop_patience 5

Multi-GPU Distributed Training (Recommended)

torchrun --nproc_per_node=4 CoDAC-main.py \
  --bases_root ./02.split_data/bases \
  --cigar_root ./02.split_data/cigar \
  --gpk_csv ./01.sv_images/sv_statistics.csv \
  --class_dirs Del_positive Ins_positive Match_negative \
  --bases_model resnet50 \
  --cigar_model mobilenet_v2 \
  --train_chrs 1 2 3 4 5 6 7 8 9 10 11 12 \
  --test_chrs 13 14 15 16 17 18 19 20 21 22 X Y \
  --use_cdcd \
  --use_mac \
  --epochs 30 \
  --batch_size 64 \
  --save_path ./dual_icme.pth \
  --lr 1e-4 \
  --weight_decay 1e-2 \
  --early_stop_patience 5

Key Training Parameters:

Parameter Description Default
--bases_model Backbone for semantic modality resnet50
--cigar_model Backbone for syntactic modality mobilenet_v2
--use_cdcd Enable collaborative denoising True
--use_mac Enable multi-granularity constraints True
--train_chrs Chromosomes for training 1-12
--test_chrs Chromosomes for testing 13-22,X,Y
--batch_size Per-GPU batch size 64

πŸ—οΈ Architecture Overview

Input:
β”œβ”€β”€ Semantic Modality (BASES) β†’ ResNet50 β†’ Feature Projection
└── Syntactic Modality (CIGAR) β†’ MobileNetV2 β†’ Feature Projection

Core:
β”œβ”€β”€ Confidence-Driven Collaborative Denoising (CDCD)
β”‚   β”œβ”€β”€ Confidence Map Estimation (CME)
β”‚   β”œβ”€β”€ Confidence-Gated Cross-Attention (CGCA)
β”‚   └── Adaptive Residual Correction
└── Multi-Granularity Awareness Constraints (MAC)
    β”œβ”€β”€ Prior-Modulated Contrastive Loss (PMCL)
    β”œβ”€β”€ Biological Consistency Regularization (BCR)
    └── Learnable Weight Balancing

Output: SV Classification (MATCH/DEL/INS)

πŸ“‚ Datasets

Experimental evaluation of Dual-SVF is conducted on three benchmark third-generation sequencing datasets from the well-characterized HG002 human genome (GIAB consortium).

1. Reference Genomes

The study primarily utilizes the human and yeast reference assemblies:

2. Sequencing Alignment Data (BAM)

Benchmark sequencing data providing unique read properties across different platforms.

Platform Coverage Alignment (GRCh37)
PacBio CLR 69x https://ftp.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/PacBio_MtSinai_NIST/Baylor_NGMLR_bam/GRCh37/HG002_PB_70x_RG_HP10XtrioRTG.bam
ONT (Ultralong) 48x https://ftp.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/UCSC_Ultralong_OxfordNanopore_Promethion/HG002_GRCh37_ONT-UL_UCSC_20200508.phased.bam

3. Benchmark SV Callset (Ground Truth)

The HG002_SVs_Tier1_v0.6 serves as the gold standard for model training and evaluation.

Core Components

1. Dual-Stream Feature Extraction

  • Syntactic Modality (CIGAR)
    MobileNetV2 for efficient alignment feature extraction
  • Semantic Modality (BASES)
    ResNet50 for deep sequence context understanding

2. Confidence-Driven Collaborative Denoising (CDCD)

Confidence Map Estimation β†’ Confidence-Gated Cross-Attention β†’ Adaptive Residual Correction

3.Multi-level Adaptive constraints (MAC)

  • Prior-Modulated Contrastive Loss (PMCL)
  • Biological Consistency Regularization (BCR)
  • Adaptive weight balancing with learnable coefficients

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages