Cophasing (v1.0.0)

Nextflow-based pipeline for cophasing of GAM reads.

Pipeline

Input:

GAM experiment samples in the BAM format,
Reference genome in the FASTA format,
Phased SNPs in the VCF format.

Output

A set of tables of the form {genome_name}.{bin_size}.{hap}.coverage.tsv and {genome_name}.{bin_size}.{hap}.segregation.tsv for different bin sizes with the number of reads per bin per sample. Haplotype hap is either hap1, hap2, or both. The columns are:

chrom: chromosome name
start: start position of the bin
stop: the stop position of the bin
[sample]: The number of reads in the bin for coverage.tsv, and 1 or 0 for segregation.tsv, with 1 meaning the bin has been determined to be covered after filtering, 0 the opposite.

Note: For segregation, only autosomes are considered. The filter requires that the chromosomes are named in either the UCSC (chr1-chr22) or Ensembl format (1-22).

Process

Align reads to the reference haplotype

filter out positions without phasing information
convert VCF to BED format
find the closest SNP for each read and merge

Create bins

bin the genome into fixed sized, non-overlapping windows of desired resolution, eg 50kb, 100kb, 200kb

Calculate coverage

for comparison, calculate the coverage of each window using the original unsplit GAM samples,
calculate coverage files of all split GAM samples for each haplotype,
combine coverage files of all samples into one coverage table, per resolution

The coverage table describes the number of bases covered by reads in each window; therefore, it is possible that the sum of coverage from both haplotypes is higher than the total number of bases covered in a window, as some bases may be covered by both haplotypes.

Segregation

The segregation algorithm removes spurious bins in particular the following two steps:

Finds a separation threshold for bin count to remove noise reads.
Remove orphan bins (bins with no neighbours).

Requirements:

Use the provided conda environment file pcp-env.yaml to install all the required software.

Execution

To execute, run:

nextflow run main.nf [parameters]

Output

By default the results are written to the ./out folder.

Test run

Random testing data are provided as a part of the package. Run the following command to test the pipeline:

nextflow run main.nf -c test_data.config

NOTE: The parameters for the execution are stored in the Nextflow configuration file test_data.config.

Parameters

Mandatory

--bam path alignment files either as .bam or .sam. This can be a glob pattern (e.g. sample*.bam). All files matching the pattern are used then.
--fa path reference file either as .fa or .fa.gz.
--vcf path a variant call file either as .vcf or .vcf.gz. Must contain phased GT information.

Note: GATK requires .gz files to be compressed with bgzip, not gzip.

Default

--name string the prefix that will be given to the samples, default=<the name of the reference file>,
--bins [int] the bin sizes to be used, default=[50000, 100000, 200000],
--out path a path to a folder where the output is stored, default=./out,
--min_depth int a minimum read depth per SNP to be included, default=1,
--min_ratio int tests that the dominant base for a SNP is at least min_ratio times more often present than the remaining observed bases BCFTools, default=5
--cutoff int maximum distance from a read to a closest so that the read is still matched to the SNP, default=0.

Visualization

A sample visualization method is shown in the notebook NMPI_matrix_vis.ipynb. This notebook loads a singular segregation table and plots a contact map for one chromosome in a specified region.

Authors

This pipeline has been developed at Max Delbrück Center for Molecular Medicine, Berlin. Authors:

Dr. Adam Streck: pipeline development,
Dr. Julia Markowski: creator of the co-phasing method,
Dr. Alexander Kukalev: creator of the separation algorithm.

Contact

Email questions, feature requests and bug reports to Adam Streck, adam.streck@iccb-cologne.org.

License

This repository is available under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
doc		doc
scripts		scripts
test_data		test_data
visualization		visualization
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.nf		main.nf
nextflow.config		nextflow.config
pcp-env.yaml		pcp-env.yaml
test_data.config		test_data.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Cophasing (v1.0.0)

Pipeline

Input:

Output

Process

Align reads to the reference haplotype

Create bins

Calculate coverage

Segregation

Requirements:

Execution

Output

Test run

Parameters

Mandatory

Default

Visualization

Authors

Contact

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Cophasing (v1.0.0)

Pipeline

Input:

Output

Process

Align reads to the reference haplotype

Create bins

Calculate coverage

Segregation

Requirements:

Execution

Output

Test run

Parameters

Mandatory

Default

Visualization

Authors

Contact

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages