Skip to content

Itsbosire/Mycobacterium_tuberculosis_Variant_calling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MTB Variant Calling Project

This repository contains scripts, data, and results for a bioinformatics pipeline focused on variant calling for Mycobacterium tuberculosis (MTB). The pipeline includes data preprocessing, quality control, variant calling, and various output visualizations. The project is designed to be modular and reproducible, allowing for easy adaptation to similar datasets.


Project Structure

Data/

Contains raw genome files and reference data.

  • Mtb_ref.gbk: Reference genome in GenBank format.
  • Mtb_ref.fasta: Reference genome in FASTA format.
  • TB_R1.fastq: Raw sequencing reads (forward).
  • TB_R2.fastq: Raw sequencing reads (reverse).
  • urls.txt: URLs for downloading datasets.

Code/

Contains Bash scripts used in the pipeline.

  • Data Scripts:

    • get_data.sh: Script for downloading required datasets.
    • gbk_fasta.sh: Converts .gbk files to .fasta format using seqret.
  • Quality Control Scripts:

    • run_fastqc.sh: Runs FastQC for quality control of raw sequencing reads.
    • run_fastp.sh: Performs read trimming and filtering using Fastp.
    • multiqc.sh: Aggregates quality control results using MultiQC.
  • Variant Calling Scripts:

    • variant_calling.sh: Performs variant calling using BWA, Samtools, and BCFtools.

Quality_Control/

Contains results from quality control steps.

  • fastqc_results/: Results from FastQC.
  • fastp_results/: Results from Fastp.
  • multiqc_results/: Aggregated results from MultiQC.

Results/

Contains final results from the pipeline.

  • Var_Cal/:

    • sample_sorted.bam: Sorted BAM file.
    • variants.vcf: Variant call file.
  • Var_Outputs/:

    • all_variants_vs_transversions.png: Comparison of all variants vs transversions.
    • indels_vs_transversions.png: Comparison of indels vs transversions.
    • snps_summary.png: Summary of SNPs.
    • summarize_dist_snps.png: Distribution of SNPs across the genome.
    • summarize_dist_indels.png: Distribution of indels across the genome.
    • summarize_dist_vars.png: Overall distribution of variants.
    • summarize_overall.png: Overview of all variant types combined.
    • summarize_stat_one.png: Summary statistics for variants.
    • Rplot.png: Statistical plot generated using R.
    • Summary_Statistics.csv: Tabular data summarizing variant statistics.

Pipeline Overview

1. Data Preparation

  • Download raw sequencing reads and reference files using get_data.sh.
  • Convert reference genome from .gbk to .fasta using gbk_fasta.sh.

2. Quality Control

  • Run FastQC (run_fastqc.sh) to assess raw read quality.
  • Perform read trimming and filtering using Fastp (run_fastp.sh).
  • Aggregate QC results using MultiQC (multiqc.sh).

3. Variant Calling

  • Align reads to the reference genome using BWA.
  • Process alignments with Samtools.
  • Call variants using BCFtools (variant_calling.sh).

Summary of Outputs

Rplot

Summary

Requirements

Tools

  • FastQC
  • Fastp
  • MultiQC
  • BWA
  • Samtools
  • BCFtools
  • EMBOSS (for seqret)

Installation

conda create ngs_pipeline -c bioconda fastqc fastp multiqc bwa samtools bcftools emboss
conda activate ngs_pipeline

Environment

  • Bash shell
  • Anaconda or similar environment for managing dependencies

Usage

1. Clone the Repository

git clone https://github.com/Itsbosire/Mycobacterium_tuberculosis_Variant_calling
cd MTB

2. Run the Pipeline Step by Step

  • Download required datasets:
    bash Code/get_data.sh
  • Convert .gbk to .fasta:
    bash Code/Data/gbk_fasta.sh
  • Perform quality control:
    bash Code/Qc/run_fastqc.sh
    bash Code/Qc/run_fastp.sh
    bash Code/Qc/multiqc.sh
  • Perform variant calling:
    bash Code/variant/variant_calling.sh

Data Sources

The datasets used in this project are publicly available and can be downloaded using the URLs provided in Data/urls.txt.


Project Structure

MTB/
├── data/                  # Raw genome files and reference data
│   ├── Mtb_ref.gbk          # Reference genome in GenBank format
│   ├── Mtb_ref.fasta        # Reference genome in FASTA format
│   ├── TB_R1.fastq          # Raw sequencing reads (forward)
│   ├── TB_R2.fastq          # Raw sequencing reads (reverse)
│   ├── urls.txt             # URLs for downloading datasets
├── code/                  # Bash scripts used in the pipeline
│   ├── gbk_fasta.sh         # Convert `.gbk` files to `.fasta` format
│   ├── get_data.sh          # Download required datasets
│   ├── run_fastqc.sh        # Run FastQC for quality control
│   ├── run_fastp.sh         # Perform read trimming and filtering
│   ├── mutltiqc.sh           # Aggregate QC results using MultiQC
│   ├── variant_calling.sh   # Perform variant calling using BWA, Samtools, and BCFtools
├── qc_reports/            # Quality control outputs
│   ├── fastqc_results/      # FastQC results
│   ├── fastp_results/       # Fastp results
│   ├── multiqc_results/     # Aggregated QC reports
├── results/               # Final results from the pipeline
│   ├── Var_Cal/             # Variant calling outputs
│   │   ├── sample_sorted.bam # Sorted BAM file
│   │   ├── variants.vcf     # Variant call file
│   ├── Var_Outputs/         # Visualizations and summary statistics
│   │   ├── all_variants_vs_transversions.png  # Comparison of all variants vs transversions
│   │   ├── indels_vs_transversions.png        # Comparison of indels vs transversions
│   │   ├── snps_summary.png                   # Summary of SNPs
│   │   ├── summarize_dist_snps.png            # Distribution of SNPs across the genome
│   │   ├── summarize_dist_indels.png          # Distribution of indels across the genome
│   │   ├── summarize_dist_vars.png            # Overall distribution of variants
│   │   ├── summarize_overall.png              # Overview of all variant types combined
│   │   ├── summarize_stat_one.png             # Summary statistics for variants
│   │   ├── Rplot.png                          # Statistical plot generated using R
│   │   ├── Summary_Statistics.csv             # Tabular data summarizing variant statistics
└── README.md              # Project documentation

About

This repository provides a ngs workflow of conducting variant calling and visualization, or Mtb raw reads.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors