This document provides a comprehensive overview of the commands used in the MutMap pipeline. By understanding each step's purpose, usage, and the default parameters employed in MutMap, users may be able to run the bash commands independently to troubleshoot and resolve issues more effectively. The goal is to help users better understand the internal workings of MutMap and ensure efficient execution.
- Using the Test Dataset
- Step 1: Directory Setup
- Step 2: Quality Control and Trimming
- Step 3: Reference Genome Indexing
- Step 4: Read Alignment
- Step 5: BAM Sorting and Indexing
- Step 6: Variant Calling with mpileup
- Step 7: SNP Filtering, SNP Index Calculation, and Visualization with MutPlot
- Using SnpEff for Variant Annotation
To get started with MutMap and verify the pipeline's functionality, you can use the test dataset available in the MutMap GitHub repository. The test dataset is located here, and it includes all the necessary files to run the pipeline.
By using this test dataset, you can execute the commands in this document step by step to ensure that the pipeline runs smoothly and correctly.
Before starting the pipeline, ensure that all necessary directories are created. This prevents errors from occurring when output directories are missing. The following structure should be in place:
mkdir -p output_directory/00_fastq
mkdir -p output_directory/10_ref
mkdir -p output_directory/20_bam
mkdir -p output_directory/30_vcf- 00_fastq: For storing filtered FASTQ files.
- 10_ref: For storing reference genome index files.
- 20_bam: For storing BAM files after alignment.
- 30_vcf: For storing the VCF files generated by variant calling.
Description:
Trimming is the process of removing low-quality sequences and adapters from raw sequencing data. This step is essential to ensure that the data used for alignment is clean and of high quality. In MutMap, trimming is performed using Trimmomatic on both the cultivar and bulk (mutant bulk) in the same manner.
Usage:
trimmomatic PE -threads 4 \
-phred33 mutmap_cultivar.1.fastq.gz mutmap_cultivar.2.fastq.gz \
output_directory/00_fastq/cultivar_R1_paired.fastq.gz \
output_directory/00_fastq/cultivar_R1_unpaired.fastq.gz \
output_directory/00_fastq/cultivar_R2_paired.fastq.gz \
output_directory/00_fastq/cultivar_R2_unpaired.fastq.gz \
ILLUMINACLIP:adapter.fasta:2:30:10 \
LEADING:20 \
TRAILING:20 \
SLIDINGWINDOW:4:15 \
MINLEN:75
trimmomatic PE -threads 4 \
-phred33 mutmap_bulk.1.fastq.gz mutmap_bulk.2.fastq.gz \
output_directory/00_fastq/bulk_R1_paired.fastq.gz \
output_directory/00_fastq/bulk_R1_unpaired.fastq.gz \
output_directory/00_fastq/bulk_R2_paired.fastq.gz \
output_directory/00_fastq/bulk_R2_unpaired.fastq.gz \
ILLUMINACLIP:adapter.fasta:2:30:10 \
LEADING:20 \
TRAILING:20 \
SLIDINGWINDOW:4:15 \
MINLEN:75Explanation:
PE: Specifies paired-end mode.ILLUMINACLIP: Trims adapter sequences based on the adapter file provided.LEADING: Trims low-quality bases from the beginning of the read.TRAILING: Trims low-quality bases from the end of the read.SLIDINGWINDOW: Performs sliding window trimming, trimming when the average quality within the window falls below a threshold.MINLEN: Discards reads that are shorter than the specified length.
Additional Information:
-
Trimmomatic GitHub Repository: For more information on Trimmomatic, refer to its GitHub repository.
-
Adapter Sequences: The adapter sequences used in the trimming process can be found on the Trimmomatic GitHub.
-
Quality Control: After trimming, it is recommended to perform quality control (QC) on the resulting FASTQ files using software like FastQC.
-
Alternative Software - fastp: If you are unsure about which adapter sequences were used, the software fastp may be useful. It includes the
--detect_adapter_for_peoption, which automatically detects and removes adapter sequences. However, please note that MutMap has not been extensively tested with fastp, so its compatibility cannot be guaranteed.
Description:
Before aligning reads to the reference genome, the genome must be indexed. Indexing creates necessary data structures that make the alignment process more efficient. In MutMap, BWA and SAMtools are used for reference genome indexing.
For more information, please refer to the BWA manual here and the SAMtools manual here.
Reference Genome Indexing:
# Create a symbolic link to reference genome in 10_ref
ln -s $(pwd)/mutmap_ref.fasta output_directory/10_ref/mutmap_ref.fasta
# Index the reference genome using BWA
bwa index output_directory/10_ref/mutmap_ref.fasta
# Index the reference genome using SAMtools
samtools faidx output_directory/10_ref/mutmap_ref.fastaDescription:
Alignment is performed on both the cultivar and bulk in the same manner. After trimming and indexing the reference genome, the next step is to align the sequencing reads to the reference using BWA, followed by SAMtools for converting the output to BAM format.
For more information on BWA, refer to its manual here. Additionally, refer to the SAMtools manual here.
Usage:
bwa mem -t 4 output_directory/10_ref/mutmap_ref.fasta \
output_directory/00_fastq/cultivar_R1_paired.fastq.gz \
output_directory/00_fastq/cultivar_R2_paired.fastq.gz | \
samtools view -b \
-o output_directory/20_bam/cultivar.unsorted.bam
bwa mem -t 4 output_directory/10_ref/mutmap_ref.fasta \
output_directory/00_fastq/bulk_R1_paired.fastq.gz \
output_directory/00_fastq/bulk_R2_paired.fastq.gz | \
samtools view -b \
-o output_directory/20_bam/bulk.unsorted.bamExplanation:
mem: The algorithm used by BWA to align paired-end reads.-t 4: Specifies the number of threads to use (in this case, 4 threads).view -b: Converts the data into BAM format.-o: Specifies the output BAM file location.
Description:
Sorting and indexing are performed on the cultivar and bulk in the same manner. After aligning the reads, SAMtools is used to sort and index the resulting BAM files for efficient access and analysis. Sorting ensures that the reads are ordered by their position in the genome, and indexing creates a .bai or .csi file that allows for fast retrieval of specific regions during analysis.
Usage:
samtools fixmate -m output_directory/20_bam/cultivar.unsorted.bam - | \
samtools sort -m 1G -@ 4 | \
samtools markdup -r - - | \
samtools view -b -f 2 -F 2048 -o output_directory/20_bam/cultivar.bam
samtools fixmate -m output_directory/20_bam/bulk.unsorted.bam - | \
samtools sort -m 1G -@ 4 | \
samtools markdup -r - - | \
samtools view -b -f 2 -F 2048 -o output_directory/20_bam/bulk.bamExplanation:
fixmate -m: Adjusts mate-pair information and addsms(mate score) tags, used bymarkdupto select the best reads for duplicate marking.sort: Sorts the BAM file by genomic coordinates.markdup: Marks duplicate reads by setting a DUP flag (without-r, duplicate reads are marked but not removed). To remove duplicates, the-roption can be added.view -b: Outputs the sorted and duplicate-marked data in BAM format.
Indexing BAM files:
samtools index output_directory/20_bam/cultivar.bam
samtools index output_directory/20_bam/bulk.bamFor large genomes (e.g., wheat), attempting to create a standard .bai index file using samtools index without the -c option may result in an error due to file size limitations. In such cases, it is necessary to use the -c option, which generates a .csi index that supports large reference genomes:
samtools index -c output_directory/20_bam/cultivar.bam
samtools index -c output_directory/20_bam/bulk.bamThis .csi index is suitable for handling large genomes that exceed the size limitations of .bai.
Description:
Once the BAM files are indexed, the next step is to call variants using bcftools mpileup and bcftools call. The AD, ADF, and ADR fields are critical for MutMap, as they provide allele depth information that is essential for the analysis. This command identifies SNPs and other variants in the sequencing data.
For more information on BCFtools, refer to the manual here.
bcftools mpileup -a AD,ADF,ADR -B -q 40 -Q 18 -C 50 -O u -f output_directory/10_ref/mutmap_ref.fasta \
output_directory/20_bam/cultivar.bam output_directory/20_bam/bulk.bam | \
bcftools call -vm -f GQ,GP -O u | \
bcftools filter -i "INFO/MQ>=40" -O z -o output_directory/30_vcf/mutmap.vcf.gzmpileup: Generates per-base information for each position in the reference genome.-a AD,ADF,ADR: Includes allele depth information in the output. These fields are crucial for MutMap analysis.-B: Disables BAQ computation.-q 40: Filters reads with mapping quality less than 40.-Q 18: Filters bases with base quality less than 18.-C 50: Adjusts the mapping quality to account for the use of BWA during alignment.
When generating the VCF file, the order in which the cultivar and bulk BAM files are passed into the bcftools mpileup command is important. The cultivar should come first, followed by the bulk. This order is critical for downstream analysis, especially when using MutPlot, as the software expects the cultivar data to be the first sample in the VCF file.
The -r option in bcftools mpileup allows for chromosome-specific processing, which can be used to parallelize the VCF generation for each chromosome individually. This can help speed up the variant calling process for large genomes.
bcftools mpileup -r test_chr -a AD,ADF,ADR -B -q 40 -Q 18 -C 50 -O u -f output_directory/10_ref/mutmap_ref.fasta \
output_directory/20_bam/cultivar.bam output_directory/20_bam/bulk.bam | \
bcftools call -vm -f GQ,GP -O u | \
bcftools filter -i "INFO/MQ>=40" -O z -o output_directory/30_vcf/mutmap.test_chr.vcf.gzIf the VCF files are generated separately for each chromosome, you can use bcftools concat to concatenate them into a single VCF file for easier analysis.
bcftools concat -O z -o output_directory/30_vcf/mutmap_combined.vcf.gz output_directory/30_vcf/mutmap.*.vcf.gzThis command will concatenate all chromosome-specific VCF files (mutmap.*.vcf.gz) into a single compressed VCF file (mutmap_combined.vcf.gz).
For more information on Tabix, refer to the manual here.
tabix -f -p vcf output_directory/30_vcf/mutmap.vcf.gzFor large genomes like wheat, using the default tabix indexing without the -C option may result in an error due to file size limitations. In such cases, it is necessary to use the -C option, which generates a .csi index that supports large reference genomes:
tabix -C -f -p vcf output_directory/30_vcf/mutmap.vcf.gzThis .csi index is suitable for handling large genomes that exceed the size limitations of standard .tbi indexes.
Description:
MutPlot is used to filter SNPs, calculate SNP indices, and visualize the results. It helps identify significant SNPs and their impact on the genome, producing various files that assist in analyzing and visualizing the data.
Usage:
mutplot -v output_directory/30_vcf/mutmap.vcf.gz -o output_directory/40_plot -n 20Required Options:
-v: Specifies the input VCF file generated by the variant calling step.-o: Specifies the output directory for the visualization results.-n: Specifies the number of individuals in the mutant bulk.
Optional: Using MutPlot with SnpEff:
MutPlot can be run in combination with SnpEff to annotate variants for their potential effects. This allows for a more detailed understanding of how SNPs may impact gene function. If you would like to build your own SnpEff database, or learn more about SnpEff, please refer to the SnpEff Documentation. Additionally, the following guide may be useful: Guide to Build SnpEff Database.
When plotting the results, variants classified as MODERATE by SnpEff are marked with a + symbol, while variants classified as HIGH are marked with an x symbol in the plot.
Usage:
mutplot -v output_directory/30_vcf/mutmap.vcf.gz -o output_directory/40_plot -n 20 -e databaseAdditional Option:
-e database: Specifies the SnpEff database to use for annotating variants. Ensure that the correct SnpEff database is installed and available before running this command.
We would like to thank Shoya Komura for his helpful advice regarding the use of the -c option in samtools index and the -C option in tabix for handling large genomes.