Haplotype-aware structural-variant (SV) phasing and genotyping from long-read data
SvPhaser assigns haplotype-aware genotypes to pre-called structural variants (SVs) using HP-tagged long-read alignments (PacBio HiFi, ONT Q20+, etc.).
It fills a critical gap in long-read SV analysis:
- SV callers (e.g. Sniffles2) discover variants
- SvPhaser phases and genotypes them (
1|0,0|1,1|1, or./.) - with explicit read-level evidence and a quantitative genotype quality (GQ)
SvPhaser is caller-agnostic, deterministic, and designed for large-scale benchmarking and biological interpretation.
- Post-hoc SV phasing from HP-tagged BAM/CRAM (no re-calling required)
- Per-chromosome parallelization (efficient on HPC and multi-core systems)
- SV-type-aware evidence detection (DEL / INS / INV / BND / DUP)
- Deterministic Δ-based decision logic (no HMMs, no sampling)
- Explicit confidence modeling via GQ and reason codes
- CSV-first design for transparent benchmarking and debugging
- VCF-compliant output with rich
SVP_*INFO annotations
# Requires Python >= 3.9
pip install svphaserOptional extras:
pip install "svphaser[plots]" # plotting utilities
pip install "svphaser[bench]" # benchmarking helpers
pip install "svphaser[dev]" # development + lintinggit clone https://github.com/SFGLab/SvPhaser.git
cd SvPhaser
pip install -e .SvPhaser requires two inputs only:
-
Unphased SV VCF (
.vcf/.vcf.gz)- Produced by an SV caller (e.g. Sniffles2)
- May optionally contain
RNAMESINFO for precise read support
-
HP-tagged BAM/CRAM
- Long-read alignments with haplotype tags (
HP=1/2) - Generated by an upstream phasing pipeline (e.g. WhatsHap)
- Long-read alignments with haplotype tags (
⚠️ If the BAM does not contain HP tags, SvPhaser cannot assign haplotypes.
svphaser phase \
sample_unphased.vcf.gz \
sample.sorted_phased.bam \
--out-dir results/ \
--min-support 10 \
--min-tagged-support 3 \
--major-delta 0.60 \
--equal-delta 0.10 \
--support-mode hybrid \
--dynamic-window \
--tie-to-hom-alt \
--gq-bins "30:High,10:Moderate" \
--threads 32For an input sample.vcf.gz, SvPhaser produces:
-
sample_phased.csv— primary analysis artifact- Per-SV read support (
hp1,hp2,nohp) - Derived metrics (
tagged_total,support_total, Δ) - Final decisions (
gt,gq,reason)
- Per-SV read support (
-
sample_phased.vcf(.gz)— interoperability outputFORMAT/GT,FORMAT/GQ- Optional
SVP_*INFO annotations when--svp-infois enabled
The CSV is intended for benchmarking, visualization, and interpretation; the VCF is a downstream-consumable representation.
A full, implementation-faithful description of the algorithm—including:
- evidence collection
- haplotype decision logic
- pseudoalgorithm
- workflow diagram
is provided in:
➡️ docs/Methodology.md
This document is the authoritative reference for reviewers and users seeking algorithmic clarity.
from pathlib import Path
from svphaser.phasing.io import phase_vcf
phase_vcf(
Path("sample.vcf.gz"),
Path("sample.sorted_phased.bam"),
out_dir=Path("results"),
min_support=10,
min_tagged_support=3,
major_delta=0.60,
equal_delta=0.10,
support_mode="hybrid",
dynamic_window=True,
tie_to_hom_alt=True,
gq_bins="30:High,10:Moderate",
threads=8,
)SvPhaser/
├─ src/svphaser/ # core package
├─ docs/ # methodology & design notes
├─ tests/ # unit + regression tests
├─ notebooks/ # benchmarking & analysis
├─ pyproject.toml
├─ README.md
└─ CHANGELOG.md
If SvPhaser contributes to your research, please cite:
@software{svphaser2026,
author = {Pranjul Mishra and Sachin Gadakh},
title = {SvPhaser: Haplotype-aware phasing of structural variants from long-read data},
version = {2.1.x},
year = {2026},
url = {https://github.com/SFGLab/SvPhaser},
note = {PyPI: https://pypi.org/project/svphaser/}
}For maximum reproducibility, include the exact git commit hash used.
SvPhaser is released under the MIT License — see LICENSE.
Developed at SFG Lab (BioAI).
- Pranjul Mishra — pranjul.mishra@proton.me
- Sachin Gadakh — s.gadakh@cent.uw.edu.pl
Bug reports and feature requests: please open a GitHub issue.