Variant merging for VCF deconstructed from a pangenome graph, including vertical merging for overlapping/duplicated variants and horizontal merging for similar structural variantions (SVs).
Vertical variant merging
Variant decomposition and normalization are usually used to simplify variants deconstructed from a pangenome graph and make variants comparable between different callsets. However, these analysis can produce overlapping or even duplicated variant records. The collapse-bubble pipeline can concatenate overlapping variants and merge the genotypes of duplicated records to generate a deduplicated non-overlapping VCF.
Horizontal SV merging
A pangenome VCF includes highly similar SVs with even 1 base difference. Therefore, SV merging is important to remove redundant SVs. The collapse-bubble pipeline uses Truvari's engine to merge SVs. As compared to truvari collapse, it is optimized for pangenome VCF by using bubbles and haplotypes to improve SV merging.
For more information on how it works, please refer to the documentation.
Note: collapse-bubble is currently tested with VCFs from the minigraph-cactus pipeline. But it should theoretically support VCFs generated by vg deconstruct from phased assembly-based pangenome graphs.
All scripts have been tested with Python 3.10. To install dependencies, run:
pip install -r requirements.txt
or
pip install .
The following tools are not used by collapse-bubble scripts but are required to prepare the input VCF:
- bcftools: The
+fill-tagsplugin is used. Please make sureBCFTOOLS_PLUGINSis configured. - vcfwave: Please use
vcfwavev1.0.12 or later, as earlier versions may output incorrect genotypes for some mutli-allelic variants.
- Multiallelic graph VCF: Generated by the vg deconstruct and processed by vcfbub. The ID field of this VCF are bubble IDs, e.g.,
<73488<73517. This is the default output VCF of the minigraph-cactus pipeline. - Reference genome FASTA file: Used for normalization.
Example script to run the pipeline from the default output VCF of minigraph-cactus:
##### 1. Preprocessing #####
# split into biallelic
bcftools norm -m -any mc.vcf.gz -Oz -o mc.biallele.vcf.gz
# annotate VCF and assign unique variant ID
python scripts/annotate_var_id.py \
-i mc.biallele.vcf.gz \
-o mc.biallele.uniq_id.vcf.gz
# drop INFO/AT and decompose by vcfwave
bcftools annotate -x INFO/AT mc.biallele.uniq_id.vcf.gz | \
vcfwave -I 1000 | bgzip -c > mc.biallele.uniq_id.vcfwave.vcf.gz
# normalize variants, update AC/AN/AF, and sort
bcftools norm -f ref.fa mc.biallele.uniq_id.vcfwave.vcf.gz | \
bcftools +fill-tags -- -t AC,AN,AF | \
bcftools sort --max-mem 4G -Oz -o mc.biallele.uniq_id.vcfwave.sort.vcf.gz
##### 2. Merge overlapping variants #####
python scripts/merge_duplicates.py \
-i mc.biallele.uniq_id.vcfwave.sort.vcf.gz \
-o mc.biallele.uniq_id.vcfwave.sort.merge_dup.vcf.gz \
-c repeat \
-t ID
##### 3. SV merging #####
python collapse_bubble.py \
-i mc.biallele.uniq_id.vcfwave.sort.merge_dup.vcf.gz \
-o mc.biallele.uniq_id.vcfwave.sort.merge_dup.merge_sv.vcf.gz \
--map mc.biallele.uniq_id.vcfwave.sort.merge_dup.merge_sv.mapping
The preprocessing step generates a VCF file that meets the following requirements:
- Decomposed by
vcfwave. - Biallelic and sorted.
- All variants have unique IDs.
- Variants from the same bubble have identical
INFO/BUBBLE_IDannotation. - AC, AN, and AF have been updated based on the genotypes.
If you already have such a VCF, it can be used directly without preprocessing. To start with the multiallelic graph VCF, the following steps are required:
- Split multiallelic records into biallelic and update AC, AN, AF using
bcftools. - Annotate variants' bubble ID and generate unique variant IDs using
annotate_var_id.py. - Decompose the VCF using
vcfwave. - Left-align and sort using
bcftools.
# suppose the name of input VCF is "mc.vcf.gz"
# split into biallelic
bcftools norm -m -any mc.vcf.gz -Oz -o mc.biallele.vcf.gz
# annotate VCF and assign unique variant ID
python scripts/annotate_var_id.py \
-i mc.biallele.vcf.gz \
-o mc.biallele.uniq_id.vcf.gz
# drop INFO/AT (optional, suggested by cactus) and decompose by vcfwave
bcftools annotate -x INFO/AT mc.biallele.uniq_id.vcf.gz | \
vcfwave -I 1000 | bgzip -c > mc.biallele.uniq_id.vcfwave.vcf.gz
# for merge_duplicates.py -c repeat:
# fast normalization, update AC/AN/AF, and sort in one step
bcftools norm -f ref.fa mc.biallele.uniq_id.vcfwave.vcf.gz | \
bcftools +fill-tags -- -t AC,AN,AF | \
bcftools sort --max-mem 4G -Oz -o mc.biallele.uniq_id.vcfwave.sort.vcf.gz
# for merge_duplicates.py -c position:
# normalization, update AC/AN/AF, and stable sort
bcftools norm -f ref.fa mc.biallele.uniq_id.vcfwave.vcf.gz | \
bcftools +fill-tags -Oz -o tmp.vcf.gz -- -t AC,AN,AF
(bcftools view -h tmp.vcf.gz ; bcftools view -H tmp.vcf.gz | sort -s -k1,1d -k2,2n) | bgzip > mc.biallele.uniq_id.vcfwave.sort.vcf.gz
annotate_var_id.py:
This script assign unique IDs in format of [BUBBLE_ID].[TYPE].[No.] to each variants. The original variant ID (i.e., bubble ID) is stored in INFO/BUBBLE_ID. If the VCF has been processed by vcfwave, the separator (_) between bubble ID and suffix can be customized by --suffix-sep _.
usage: annotate_var_id.py [-h] -i VCF -o VCF [--suffix-sep SUFFIX_SEP]
Annotate and assign unique variant ID for pangenome VCF
options:
-i VCF, --input VCF Input VCF
-o VCF, --output VCF Output VCF
--suffix-sep SUFFIX_SEP
Separator between bubble ID and suffix, e.g., "_" for vcfwave processed VCF (default: None)
After variant decomposition and left align, the VCF can contain overlapping variants at the same position. For example:
chr1 100 var1 C G 1|0
chr1 100 var2 C G 0|1
chr1 100 var3 C CAA 1|0
chr1 100 var4 C CAA 1|0
In this example:
var1andvar2are duplicates, as they share the samePOS,REF, andALT. This is mainly due to variant decomposition.var1andvar3/var4overlap on the first haplotype, as there are three alternative alleles at the samePOS. This is due to left align.
merge_duplicates.py can clean up duplicated and overlapping variants:
python scripts/merge_duplicates.py \
-i mc.biallele.uniq_id.vcfwave.sort.vcf.gz \
-o mc.biallele.uniq_id.vcfwave.sort.merge_dup.vcf.gz \
-c repeat \
-t ID
- It first concatenates overlapping tandem repeats (specified by
-c repeat) using the algorithm described in the documentation. For example,var3andvar4are concatenated intoC CAAAA. - After concatenating all overlapping variants, it merges duplicates into a single record and also updates the phased genotypes.
-t IDtracks how the overlapping variants are concatenated (INFO/CONCAT) and how duplicates are merged (INFO/DUP). These information is required for downstream SV merging.
Output:
chr1 100 var1 C G 1|1
chr1 100 chr1:100_0 C CAAAA 1|0
Note: When using merge_duplicates.py -c position, it concatenates any overlapping variants at the same position. This method reconstructs the local haplotypes and significantly increasing polymorphism, which may not be suitable for SV merging. Additionally, it requires the input VCF sorted by CHROM and POS only (not guaranteed by recent bcftools, see documentation for details). Since merge_duplicates.py -c position is included in the Minigraph-Cactus pipeline when --vcfwave is used, it is suggested to directly use the output VCF from Minigraph-Cactus if you are interested in this feature.
Arguments:
usage: merge_duplicates.py [-h] -i VCF -o VCF [-c {position,repeat,none}] [-m MAX_REPEAT] [-t {ID,AT}] [--merge-mis-as-ref] [--keep-order] [--debug]
Merge duplicated variants in phased VCF
options:
-i VCF, --invcf VCF Input VCF, sorted and phased
-o VCF, --outvcf VCF Output VCF
-c {position,repeat,none}, --concat {position,repeat,none}
Concatenate variants when they have identical "position" (default) or "repeat" motif, "none" to skip
-m MAX_REPEAT, --max-repeat MAX_REPEAT
Maximum size a variant to search for repeat motif (default: None)
-t {ID,AT}, --track {ID,AT}
Track how variants are merged by "ID" or "AT" (default: None)
--merge-mis-as-ref Convert missing to ref when merging missing genotypes with non-missing genotypes
--keep-order keep the order of variants in the input VCF (default: sort by chr, pos, alleles)
--debug Debug mode
To perform SV merging, run collapse_bubble.py:
python collapse_bubble.py \
-i mc.biallele.uniq_id.vcfwave.sort.merge_dup.vcf.gz \
-o mc.biallele.uniq_id.vcfwave.sort.merge_dup.merge_sv.vcf.gz \
--map mc.biallele.uniq_id.vcfwave.sort.merge_dup.merge_sv.mapping
This will generate 3 output files:
1. VCF:
The output VCF (unsorted) merges similar SV records and genotypes into one representative SV, which is selected by the one with highest frequency. Moreover, the following INFO fields are added to the merged record:
INFO/ID_LIST: comma-separated list of SVs merged into this SVINFO/TYPE: type of variant (SNP, MNP, INS, DEL, INV, COMPLEX)INFO/REFLEN: len(REF)INFO/SVLEN: len(ALT) - len(REF)
For example:
#input:
chr1 10039 >123>456.INS.1 A ATTTTTT AC=2;AN=6;AF=0.333;BUBBLE_ID=>123>456 0|1 1|0 0|0
chr1 10039 >123>456.INS.2 A ATTTTTG AC=3;AN=6;AF=0.500;BUBBLE_ID=>123>456 1|0 0|1 1|0
#output
chr1 10039 >123>456.INS.2 A ATTTTTG AC=5;AN=6;AF=0.833;BUBBLE_ID=>123>456;ID_LIST=>123>456.INS.1;TYPE=INS;REFLEN=1;SVLEN=6 1|1 1|1 1|0
2. SV merging table:
A TSV file mapping original SVs (Variant_ID) to representative SVs (Collapse_ID). For example:
| Chrom | Position | Bubble_ID | Variant_ID | Collapse_ID | PctSeqSimilarity | PctSizeSimilarity | PctRecOverlap | SizeDiff | StartDistance | EndDistance | TruScore | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| chr22 | 16386947 | >38058649>38058909 | >38058649>38058909.DEL.33 | >38058649>38058909.DEL.34 | 0.997 | 1.000 | 0.990 | 0 | 3 | 3 | 99.5 | ||
| chr22 | 16386970 | >38058649>38058909 | >38058649>38058909.COMPLEX.41_2 | >38058649>38058909.COMPLEX.40_2 | 0.992 | 1.000 | 1.000 | 0 | 0 | 0 | 99.7 | ||
| chr22 | 16386973 | >38058649>38058909 | >38058649>38058909.INS.48 | >38058649>38058909.INS.51 | 0.950 | 0.954 | 0.948 | -7 | 0 | 0 | 95.0 | ||
| chr22 | 16387000 | >38058649>38058909 | >38058649>38058909.INS.56 | >38058649>38058909.INS.55 | 0.996 | 1.000 | 1.000 | 0 | 0 | 0 | 99.9 | ||
| chr22 | 16387057 | >38058649>38058909 | >38058649>38058909.INS.81 | >38058649>38058909.COMPLEX.78_2 | 1.000 | 1.000 | 1.000 | 0 | 0 | 0 | 100.0 |
In this table, the first row means the SV >38058649>38058909.DEL.33 at chr22:16386947 from bubble >38058649>38058909 is merged into >38058649>38058909.DEL.34. The results of SV comparison performed by Truvari are included in columns from PctSeqSimilarity to TruScore.
Other VCF INFO fields can be included as additional columns by specifying --info.
Note: If --info SVLEN is used, the output SVLEN in the tsv file will be REFLEN for COMPLEX and INV to indicate the actual value used for comparison. While in the output VCF, INFO/SVLEN is always calculated by len(alt) - len(ref).
3. Similar SV pairs with conflicting genotypes
A TSV file listing SVs (Variant_ID) that pass SV merging threshold but are not merged into the representative SV (Collapse_ID) due to conflicting genotypes.
Arguments:
usage: collapse_bubble.py [-h] -i VCF -o VCF -m PREFIX [--chr CHR] [--info TAG] [-l 50] [-r 100] [-p 0.9] [-P 0.9] [-O 0.9] [--debug]
Collapse biallelic SVs within the same bubble in VCF
Input / Output arguments:
-i VCF, --invcf VCF Input VCF
-o VCF, --outvcf VCF Output VCF
-m PREFIX, --map PREFIX
Write collapsed and conflicting SV tables to PREFIX.collapse.txt and PREFIX.conflict.txt.
--chr CHR chromosome to work on. Default: all
--info TAG Comma-separated INFO/TAG list to include in the output map. Default: None
Collapse arguments:
-l 50, --min-len 50 Minimum allele length of variants to be included, defined as max(len(alt), len(ref)). Default: 50
-r 100, --refdist 100
Max reference location distance. Default: 100
-p 0.9, --pctseq 0.9 Min percent sequence similarity (REF for DEL, ALT for other SVs). Default: 0.9
-P 0.9, --pctsize 0.9
Min percent size similarity (SVLEN for INS, DEL; REFLEN for INV, COMPLEX). Default: 0.9
-O 0.9, --pctovl 0.9 Min pct reciprocal overlap. Default: 0.9
Other arguments:
--debug Debug mode
- workflow script to run all analysis in the pipeline.