Nextflow-based pipeline for cophasing of GAM reads.
- GAM experiment samples in the BAM format,
- Reference genome in the FASTA format,
- Phased SNPs in the VCF format.
A set of tables of the form {genome_name}.{bin_size}.{hap}.coverage.tsv and {genome_name}.{bin_size}.{hap}.segregation.tsv for different bin sizes with the number of reads per bin per sample. Haplotype hap is either hap1, hap2, or both. The columns are:
chrom: chromosome namestart: start position of the binstop: the stop position of the bin[sample]: The number of reads in the bin forcoverage.tsv, and 1 or 0 forsegregation.tsv, with 1 meaning the bin has been determined to be covered after filtering, 0 the opposite.
Note: For segregation, only autosomes are considered. The filter requires that the chromosomes are named in either the UCSC (chr1-chr22) or Ensembl format (1-22).
- filter out positions without phasing information
- convert VCF to BED format
- find the closest SNP for each read and merge
- bin the genome into fixed sized, non-overlapping windows of desired resolution, eg 50kb, 100kb, 200kb
- for comparison, calculate the coverage of each window using the original unsplit GAM samples,
- calculate coverage files of all split GAM samples for each haplotype,
- combine coverage files of all samples into one coverage table, per resolution
The coverage table describes the number of bases covered by reads in each window; therefore, it is possible that the sum of coverage from both haplotypes is higher than the total number of bases covered in a window, as some bases may be covered by both haplotypes.
The segregation algorithm removes spurious bins in particular the following two steps:
- Finds a separation threshold for bin count to remove noise reads.
- Remove orphan bins (bins with no neighbours).
Use the provided conda environment file pcp-env.yaml to install all the required software.
To execute, run:
nextflow run main.nf [parameters]
By default the results are written to the ./out folder.
Random testing data are provided as a part of the package. Run the following command to test the pipeline:
nextflow run main.nf -c test_data.config
NOTE: The parameters for the execution are stored in the Nextflow configuration file test_data.config.
--bam pathalignment files either as.bamor.sam. This can be a glob pattern (e.g.sample*.bam). All files matching the pattern are used then.--fa pathreference file either as.faor.fa.gz.--vcf patha variant call file either as.vcfor.vcf.gz. Must contain phased GT information.
Note: GATK requires .gz files to be compressed with bgzip, not gzip.
--name stringthe prefix that will be given to the samples,default=<the name of the reference file>,--bins [int]the bin sizes to be used,default=[50000, 100000, 200000],--out patha path to a folder where the output is stored,default=./out,--min_depth inta minimum read depth per SNP to be included,default=1,--min_ratio inttests that the dominant base for a SNP is at leastmin_ratiotimes more often present than the remaining observed bases BCFTools,default=5--cutoff intmaximum distance from a read to a closest so that the read is still matched to the SNP,default=0.
A sample visualization method is shown in the notebook NMPI_matrix_vis.ipynb. This notebook loads a singular segregation table and plots a contact map for one chromosome in a specified region.
This pipeline has been developed at Max Delbrück Center for Molecular Medicine, Berlin. Authors:
- Dr. Adam Streck: pipeline development,
- Dr. Julia Markowski: creator of the co-phasing method,
- Dr. Alexander Kukalev: creator of the separation algorithm.
Email questions, feature requests and bug reports to Adam Streck, adam.streck@iccb-cologne.org.
This repository is available under the MIT License.
