This repository contains scripts, data, and results for a bioinformatics pipeline focused on variant calling for Mycobacterium tuberculosis (MTB). The pipeline includes data preprocessing, quality control, variant calling, and various output visualizations. The project is designed to be modular and reproducible, allowing for easy adaptation to similar datasets.
Contains raw genome files and reference data.
Mtb_ref.gbk: Reference genome in GenBank format.Mtb_ref.fasta: Reference genome in FASTA format.TB_R1.fastq: Raw sequencing reads (forward).TB_R2.fastq: Raw sequencing reads (reverse).urls.txt: URLs for downloading datasets.
Contains Bash scripts used in the pipeline.
-
Data Scripts:
get_data.sh: Script for downloading required datasets.
-
gbk_fasta.sh: Converts.gbkfiles to.fastaformat usingseqret.
-
Quality Control Scripts:
run_fastqc.sh: Runs FastQC for quality control of raw sequencing reads.run_fastp.sh: Performs read trimming and filtering using Fastp.multiqc.sh: Aggregates quality control results using MultiQC.
-
Variant Calling Scripts:
variant_calling.sh: Performs variant calling using BWA, Samtools, and BCFtools.
Contains results from quality control steps.
- fastqc_results/: Results from FastQC.
- fastp_results/: Results from Fastp.
- multiqc_results/: Aggregated results from MultiQC.
Contains final results from the pipeline.
-
Var_Cal/:
sample_sorted.bam: Sorted BAM file.variants.vcf: Variant call file.
-
Var_Outputs/:
all_variants_vs_transversions.png: Comparison of all variants vs transversions.indels_vs_transversions.png: Comparison of indels vs transversions.snps_summary.png: Summary of SNPs.summarize_dist_snps.png: Distribution of SNPs across the genome.summarize_dist_indels.png: Distribution of indels across the genome.summarize_dist_vars.png: Overall distribution of variants.summarize_overall.png: Overview of all variant types combined.summarize_stat_one.png: Summary statistics for variants.Rplot.png: Statistical plot generated using R.Summary_Statistics.csv: Tabular data summarizing variant statistics.
- Download raw sequencing reads and reference files using
get_data.sh. - Convert reference genome from
.gbkto.fastausinggbk_fasta.sh.
- Run FastQC (
run_fastqc.sh) to assess raw read quality. - Perform read trimming and filtering using Fastp (
run_fastp.sh). - Aggregate QC results using MultiQC (
multiqc.sh).
- Align reads to the reference genome using BWA.
- Process alignments with Samtools.
- Call variants using BCFtools (
variant_calling.sh).
- FastQC
- Fastp
- MultiQC
- BWA
- Samtools
- BCFtools
- EMBOSS (for
seqret)
conda create ngs_pipeline -c bioconda fastqc fastp multiqc bwa samtools bcftools emboss
conda activate ngs_pipeline
- Bash shell
- Anaconda or similar environment for managing dependencies
git clone https://github.com/Itsbosire/Mycobacterium_tuberculosis_Variant_calling
cd MTB- Download required datasets:
bash Code/get_data.sh
- Convert
.gbkto.fasta:bash Code/Data/gbk_fasta.sh
- Perform quality control:
bash Code/Qc/run_fastqc.sh bash Code/Qc/run_fastp.sh bash Code/Qc/multiqc.sh
- Perform variant calling:
bash Code/variant/variant_calling.sh
The datasets used in this project are publicly available and can be downloaded using the URLs provided in Data/urls.txt.
MTB/
├── data/ # Raw genome files and reference data
│ ├── Mtb_ref.gbk # Reference genome in GenBank format
│ ├── Mtb_ref.fasta # Reference genome in FASTA format
│ ├── TB_R1.fastq # Raw sequencing reads (forward)
│ ├── TB_R2.fastq # Raw sequencing reads (reverse)
│ ├── urls.txt # URLs for downloading datasets
├── code/ # Bash scripts used in the pipeline
│ ├── gbk_fasta.sh # Convert `.gbk` files to `.fasta` format
│ ├── get_data.sh # Download required datasets
│ ├── run_fastqc.sh # Run FastQC for quality control
│ ├── run_fastp.sh # Perform read trimming and filtering
│ ├── mutltiqc.sh # Aggregate QC results using MultiQC
│ ├── variant_calling.sh # Perform variant calling using BWA, Samtools, and BCFtools
├── qc_reports/ # Quality control outputs
│ ├── fastqc_results/ # FastQC results
│ ├── fastp_results/ # Fastp results
│ ├── multiqc_results/ # Aggregated QC reports
├── results/ # Final results from the pipeline
│ ├── Var_Cal/ # Variant calling outputs
│ │ ├── sample_sorted.bam # Sorted BAM file
│ │ ├── variants.vcf # Variant call file
│ ├── Var_Outputs/ # Visualizations and summary statistics
│ │ ├── all_variants_vs_transversions.png # Comparison of all variants vs transversions
│ │ ├── indels_vs_transversions.png # Comparison of indels vs transversions
│ │ ├── snps_summary.png # Summary of SNPs
│ │ ├── summarize_dist_snps.png # Distribution of SNPs across the genome
│ │ ├── summarize_dist_indels.png # Distribution of indels across the genome
│ │ ├── summarize_dist_vars.png # Overall distribution of variants
│ │ ├── summarize_overall.png # Overview of all variant types combined
│ │ ├── summarize_stat_one.png # Summary statistics for variants
│ │ ├── Rplot.png # Statistical plot generated using R
│ │ ├── Summary_Statistics.csv # Tabular data summarizing variant statistics
└── README.md # Project documentation

