nf-cmgg/preprocessing is a bioinformatics pipeline that demultiplexes and aligns raw sequencing data. It also performs basic QC and coverage analysis.
The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible.
Steps inlcude:
- Demultiplexing using
BCLconvert - Read QC and trimming using
fastp - Alignment using either
bwa,bwa-mem2,bowtie2,dragmaporsnapfor DNA-seq andSTARfor RNA-seq - Duplicate marking using
bamsormaduporsamtools markdup - Coverage analysis using
mosdepthandsamtools coverage - Alignment QC using
samtools flagstat,samtools stats,samtools idxstatsandpicard CollecHsMetrics,picard CollectWgsMetrics,picard CollectMultipleMetrics - QC aggregation using
multiqc
Note
If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.
The full documentation can be found here
First, prepare a samplesheet with your input data that looks as follows:
samplesheet.csv for fastq inputs:
id,samplename,organism,library,fastq_1,fastq_2
sample1,sample1,Homo sapiens,Library_Name,reads1.fq.gz,reads2.fq.gz
samplesheet.csv for flowcell inputs:
id,samplesheet,lane,flowcell,sample_info
flowcell_id,/path/to/illumina_samplesheet.csv,1,/path/to/sequencer_uploaddir,/path/to/sampleinfo.csv
sampleinfo.csv for use with flowcell inputs:
samplename,library,organism,tag
fc_sample1,test,Homo sapiens,WES
Now, you can run the pipeline using:
nextflow run nf-cmgg/preprocessing \
-profile <docker/singularity/.../institute> \
--igenomes_base /path/to/genomes \
--input samplesheet.csv \
--outdir <OUTDIR>Warning
Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters;
see docs.
nf-cmgg/preprocessing was originally written by the CMGG ICT team.
This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.


