Skip to content

Latest commit

 

History

History
306 lines (211 loc) · 14.6 KB

File metadata and controls

306 lines (211 loc) · 14.6 KB

MutMap Commands Breakdown

This document provides a comprehensive overview of the commands used in the MutMap pipeline. By understanding each step's purpose, usage, and the default parameters employed in MutMap, users may be able to run the bash commands independently to troubleshoot and resolve issues more effectively. The goal is to help users better understand the internal workings of MutMap and ensure efficient execution.

Table of Contents

  1. Using the Test Dataset
  2. Step 1: Directory Setup
  3. Step 2: Quality Control and Trimming
  4. Step 3: Reference Genome Indexing
  5. Step 4: Read Alignment
  6. Step 5: BAM Sorting and Indexing
  7. Step 6: Variant Calling with mpileup
  8. Step 7: SNP Filtering, SNP Index Calculation, and Visualization with MutPlot
  9. Using SnpEff for Variant Annotation

Using the Test Dataset

To get started with MutMap and verify the pipeline's functionality, you can use the test dataset available in the MutMap GitHub repository. The test dataset is located here, and it includes all the necessary files to run the pipeline.

By using this test dataset, you can execute the commands in this document step by step to ensure that the pipeline runs smoothly and correctly.


Step 1: Directory Setup

Before starting the pipeline, ensure that all necessary directories are created. This prevents errors from occurring when output directories are missing. The following structure should be in place:

mkdir -p output_directory/00_fastq
mkdir -p output_directory/10_ref
mkdir -p output_directory/20_bam
mkdir -p output_directory/30_vcf
  • 00_fastq: For storing filtered FASTQ files.
  • 10_ref: For storing reference genome index files.
  • 20_bam: For storing BAM files after alignment.
  • 30_vcf: For storing the VCF files generated by variant calling.

Step 2: Quality Control and Trimming

Description:
Trimming is the process of removing low-quality sequences and adapters from raw sequencing data. This step is essential to ensure that the data used for alignment is clean and of high quality. In MutMap, trimming is performed using Trimmomatic on both the cultivar and bulk (mutant bulk) in the same manner.

Usage:

trimmomatic PE -threads 4 \
               -phred33 mutmap_cultivar.1.fastq.gz mutmap_cultivar.2.fastq.gz \
               output_directory/00_fastq/cultivar_R1_paired.fastq.gz \
               output_directory/00_fastq/cultivar_R1_unpaired.fastq.gz \
               output_directory/00_fastq/cultivar_R2_paired.fastq.gz \
               output_directory/00_fastq/cultivar_R2_unpaired.fastq.gz \
               ILLUMINACLIP:adapter.fasta:2:30:10 \
               LEADING:20 \
               TRAILING:20 \
               SLIDINGWINDOW:4:15 \
               MINLEN:75

trimmomatic PE -threads 4 \
               -phred33 mutmap_bulk.1.fastq.gz mutmap_bulk.2.fastq.gz \
               output_directory/00_fastq/bulk_R1_paired.fastq.gz \
               output_directory/00_fastq/bulk_R1_unpaired.fastq.gz \
               output_directory/00_fastq/bulk_R2_paired.fastq.gz \
               output_directory/00_fastq/bulk_R2_unpaired.fastq.gz \
               ILLUMINACLIP:adapter.fasta:2:30:10 \
               LEADING:20 \
               TRAILING:20 \
               SLIDINGWINDOW:4:15 \
               MINLEN:75

Explanation:

  • PE: Specifies paired-end mode.
  • ILLUMINACLIP: Trims adapter sequences based on the adapter file provided.
  • LEADING: Trims low-quality bases from the beginning of the read.
  • TRAILING: Trims low-quality bases from the end of the read.
  • SLIDINGWINDOW: Performs sliding window trimming, trimming when the average quality within the window falls below a threshold.
  • MINLEN: Discards reads that are shorter than the specified length.

Additional Information:

  1. Trimmomatic GitHub Repository: For more information on Trimmomatic, refer to its GitHub repository.

  2. Adapter Sequences: The adapter sequences used in the trimming process can be found on the Trimmomatic GitHub.

  3. Quality Control: After trimming, it is recommended to perform quality control (QC) on the resulting FASTQ files using software like FastQC.

  4. Alternative Software - fastp: If you are unsure about which adapter sequences were used, the software fastp may be useful. It includes the --detect_adapter_for_pe option, which automatically detects and removes adapter sequences. However, please note that MutMap has not been extensively tested with fastp, so its compatibility cannot be guaranteed.


Step 3: Reference Genome Indexing

Description:
Before aligning reads to the reference genome, the genome must be indexed. Indexing creates necessary data structures that make the alignment process more efficient. In MutMap, BWA and SAMtools are used for reference genome indexing.

For more information, please refer to the BWA manual here and the SAMtools manual here.

Reference Genome Indexing:

# Create a symbolic link to reference genome in 10_ref
ln -s $(pwd)/mutmap_ref.fasta output_directory/10_ref/mutmap_ref.fasta

# Index the reference genome using BWA
bwa index output_directory/10_ref/mutmap_ref.fasta

# Index the reference genome using SAMtools
samtools faidx output_directory/10_ref/mutmap_ref.fasta

Step 4: Read Alignment

Description:
Alignment is performed on both the cultivar and bulk in the same manner. After trimming and indexing the reference genome, the next step is to align the sequencing reads to the reference using BWA, followed by SAMtools for converting the output to BAM format.

For more information on BWA, refer to its manual here. Additionally, refer to the SAMtools manual here.

Usage:

bwa mem -t 4 output_directory/10_ref/mutmap_ref.fasta \
    output_directory/00_fastq/cultivar_R1_paired.fastq.gz \
    output_directory/00_fastq/cultivar_R2_paired.fastq.gz | \
samtools view -b \
              -o output_directory/20_bam/cultivar.unsorted.bam

bwa mem -t 4 output_directory/10_ref/mutmap_ref.fasta \
    output_directory/00_fastq/bulk_R1_paired.fastq.gz \
    output_directory/00_fastq/bulk_R2_paired.fastq.gz | \
samtools view -b \
              -o output_directory/20_bam/bulk.unsorted.bam

Explanation:

  • mem: The algorithm used by BWA to align paired-end reads.
  • -t 4: Specifies the number of threads to use (in this case, 4 threads).
  • view -b: Converts the data into BAM format.
  • -o: Specifies the output BAM file location.

Step 5: BAM Sorting and Indexing

Description:
Sorting and indexing are performed on the cultivar and bulk in the same manner. After aligning the reads, SAMtools is used to sort and index the resulting BAM files for efficient access and analysis. Sorting ensures that the reads are ordered by their position in the genome, and indexing creates a .bai or .csi file that allows for fast retrieval of specific regions during analysis.

Usage:

samtools fixmate -m output_directory/20_bam/cultivar.unsorted.bam - | \
samtools sort -m 1G -@ 4 | \
samtools markdup -r - - | \
samtools view -b -f 2 -F 2048 -o output_directory/20_bam/cultivar.bam

samtools fixmate -m output_directory/20_bam/bulk.unsorted.bam - | \
samtools sort -m 1G -@ 4 | \
samtools markdup -r - - | \
samtools view -b -f 2 -F 2048 -o output_directory/20_bam/bulk.bam

Explanation:

  • fixmate -m: Adjusts mate-pair information and adds ms (mate score) tags, used by markdup to select the best reads for duplicate marking.
  • sort: Sorts the BAM file by genomic coordinates.
  • markdup: Marks duplicate reads by setting a DUP flag (without -r, duplicate reads are marked but not removed). To remove duplicates, the -r option can be added.
  • view -b: Outputs the sorted and duplicate-marked data in BAM format.

Indexing BAM files:

samtools index output_directory/20_bam/cultivar.bam
samtools index output_directory/20_bam/bulk.bam

For large genomes (e.g., wheat), attempting to create a standard .bai index file using samtools index without the -c option may result in an error due to file size limitations. In such cases, it is necessary to use the -c option, which generates a .csi index that supports large reference genomes:

samtools index -c output_directory/20_bam/cultivar.bam
samtools index -c output_directory/20_bam/bulk.bam

This .csi index is suitable for handling large genomes that exceed the size limitations of .bai.


Step 6: Variant Calling with mpileup

Description:
Once the BAM files are indexed, the next step is to call variants using bcftools mpileup and bcftools call. The AD, ADF, and ADR fields are critical for MutMap, as they provide allele depth information that is essential for the analysis. This command identifies SNPs and other variants in the sequencing data.

For more information on BCFtools, refer to the manual here.

Usage:

bcftools mpileup -a AD,ADF,ADR -B -q 40 -Q 18 -C 50 -O u -f output_directory/10_ref/mutmap_ref.fasta \
output_directory/20_bam/cultivar.bam output_directory/20_bam/bulk.bam | \
bcftools call -vm -f GQ,GP -O u | \
bcftools filter -i "INFO/MQ>=40" -O z -o output_directory/30_vcf/mutmap.vcf.gz

Explanation:

  • mpileup: Generates per-base information for each position in the reference genome.
  • -a AD,ADF,ADR: Includes allele depth information in the output. These fields are crucial for MutMap analysis.
  • -B: Disables BAQ computation.
  • -q 40: Filters reads with mapping quality less than 40.
  • -Q 18: Filters bases with base quality less than 18.
  • -C 50: Adjusts the mapping quality to account for the use of BWA during alignment.

Important Note:

When generating the VCF file, the order in which the cultivar and bulk BAM files are passed into the bcftools mpileup command is important. The cultivar should come first, followed by the bulk. This order is critical for downstream analysis, especially when using MutPlot, as the software expects the cultivar data to be the first sample in the VCF file.

Chromosome-specific processing:

The -r option in bcftools mpileup allows for chromosome-specific processing, which can be used to parallelize the VCF generation for each chromosome individually. This can help speed up the variant calling process for large genomes.

Usage for Chromosome-specific processing:

bcftools mpileup -r test_chr -a AD,ADF,ADR -B -q 40 -Q 18 -C 50 -O u -f output_directory/10_ref/mutmap_ref.fasta \
output_directory/20_bam/cultivar.bam output_directory/20_bam/bulk.bam | \
bcftools call -vm -f GQ,GP -O u | \
bcftools filter -i "INFO/MQ>=40" -O z -o output_directory/30_vcf/mutmap.test_chr.vcf.gz

Concatenating VCF Files:

If the VCF files are generated separately for each chromosome, you can use bcftools concat to concatenate them into a single VCF file for easier analysis.

Usage for Concatenating VCF Files:

bcftools concat -O z -o output_directory/30_vcf/mutmap_combined.vcf.gz output_directory/30_vcf/mutmap.*.vcf.gz

This command will concatenate all chromosome-specific VCF files (mutmap.*.vcf.gz) into a single compressed VCF file (mutmap_combined.vcf.gz).

Index the VCF file with Tabix:

For more information on Tabix, refer to the manual here.

tabix -f -p vcf output_directory/30_vcf/mutmap.vcf.gz

Note for Large Genomes (e.g., Wheat):

For large genomes like wheat, using the default tabix indexing without the -C option may result in an error due to file size limitations. In such cases, it is necessary to use the -C option, which generates a .csi index that supports large reference genomes:

tabix -C -f -p vcf output_directory/30_vcf/mutmap.vcf.gz

This .csi index is suitable for handling large genomes that exceed the size limitations of standard .tbi indexes.


Step 7: SNP Filtering, SNP Index Calculation, and Visualization with MutPlot

Description:
MutPlot is used to filter SNPs, calculate SNP indices, and visualize the results. It helps identify significant SNPs and their impact on the genome, producing various files that assist in analyzing and visualizing the data.

Usage:

mutplot -v output_directory/30_vcf/mutmap.vcf.gz -o output_directory/40_plot -n 20

Required Options:

  • -v: Specifies the input VCF file generated by the variant calling step.
  • -o: Specifies the output directory for the visualization results.
  • -n: Specifies the number of individuals in the mutant bulk.

Using SnpEff for Variant Annotation

Optional: Using MutPlot with SnpEff:
MutPlot can be run in combination with SnpEff to annotate variants for their potential effects. This allows for a more detailed understanding of how SNPs may impact gene function. If you would like to build your own SnpEff database, or learn more about SnpEff, please refer to the SnpEff Documentation. Additionally, the following guide may be useful: Guide to Build SnpEff Database.

When plotting the results, variants classified as MODERATE by SnpEff are marked with a + symbol, while variants classified as HIGH are marked with an x symbol in the plot.

Usage:

mutplot -v output_directory/30_vcf/mutmap.vcf.gz -o output_directory/40_plot -n 20 -e database

Additional Option:

  • -e database: Specifies the SnpEff database to use for annotating variants. Ensure that the correct SnpEff database is installed and available before running this command.

Acknowledgement

We would like to thank Shoya Komura for his helpful advice regarding the use of the -c option in samtools index and the -C option in tabix for handling large genomes.