Skip to content

Abbas24-AI/Genomics-Part-1-Automated-RNA-seq-data-analysis-from-Data-to-Differential-Expression-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Genomics Part 1 — Automated RNA-seq Analysis (FASTQ → Differential Expression)

This repository provides a beginner-friendly RNA-seq pipeline that processes all FASTQ samples automatically and outputs a gene×sample count matrix suitable for downstream differential expression analysis (DESeq2, edgeR, and limma).


Installation

1. Install Conda (if not installed)

curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh

Verify installation:

conda --version

2. Create RNA-seq Environment & Install Tools

conda create -n rnaseq python=3.12 -y
conda activate rnaseq

conda install -c bioconda -c conda-forge \
    fastqc \
    trim-galore \
    cutadapt \
    multiqc \
    star \
    bowtie2 \
    samtools \
    subread \
    sra-tools -y

Project Directory Setup

This pipeline assumes the following directory structure:

rnaseq_project/
├── data/
│   ├── fastq/
│   ├── trimmed/
│   └── bam/
├── reference/
│   ├── star_index/
│   └── annotation.gtf
├── scripts/
│   ├── 1_QC_trim.sh
│   ├── 2_run_star_alignment.sh
│   └── 3_featurescount.sh
├── results/
│   ├── qc/
│   ├── counts/
│   └── multiqc/
└── run_pipeline.sh

Create directories:

mkdir -p rnaseq_project/{data/fastq,data/trimmed,data/bam,reference,results/{qc,counts,multiqc},scripts}

Reference Genome Setup

RNA-seq requires a reference FASTA and GTF file.

Human (Option A)

Download:

  • Genome FASTA: GRCh38 / hg38
  • GTF annotation: GENCODE v44

Place inside:

reference/genome.fa
reference/annotation.gtf

Mouse (Option B)

Download:

  • Genome FASTA: GRCm38 / mm10
  • GTF annotation: GENCODE vM25

Place inside:

reference/genome.fa
reference/annotation.gtf

Other Species (Option C)

Download species-specific:

  • Genome FASTA
  • GTF or GFF annotation

STAR Index Requirement

STAR needs a prebuilt index. If you do not have one, build it:

STAR --runThreadN 8 \
     --runMode genomeGenerate \
     --genomeDir reference/star_index \
     --genomeFastaFiles reference/genome.fa \
     --sjdbGTFfile reference/annotation.gtf \
     --sjdbOverhang 100

sjdbOverhang = read_length - 1 (100 for 101bp reads)


Preparing Input FASTQ Files

Place raw FASTQ files into:

data/fastq/

Required naming format (paired-end):

SAMPLE_R1.fastq.gz
SAMPLE_R2.fastq.gz

Example:

SRR28119110_R1.fastq.gz
SRR28119110_R2.fastq.gz

The pipeline automatically detects and processes all samples matching this pattern.


Running the Pipeline

Activate environment:

conda activate rnaseq
cd rnaseq_project

Step 1 — QC & Trimming

Purpose

This step ensures raw sequencing data quality and removes adapter contamination.

Actions:

  • Runs FastQC to assess read quality
  • Runs Trim Galore to trim adapters
  • Runs FastQC again on trimmed reads for validation

Script: scripts/1_QC_trim.sh

Contents:

#!/usr/bin/env bash
RAW="data/fastq"
TRIM="data/trimmed"
QC="results/qc"

mkdir -p "$TRIM" "$QC"

for R1 in $RAW/*_R1.fastq.gz; do
    BASE=$(basename $R1 _R1.fastq.gz)
    R2="$RAW/${BASE}_R2.fastq.gz"

    echo "[QC] Running FastQC on raw reads for sample: $BASE"
    fastqc "$R1" "$R2" -o "$QC"

    echo "[Trim] Running Trim Galore for sample: $BASE"
    trim_galore --paired --fastqc \
        "$R1" "$R2" \
        -o "$TRIM"
done

Run:

bash scripts/1_QC_trim.sh

Outputs:

results/qc/       → QC reports (HTML + ZIP)
data/trimmed/     → trimmed FASTQs (*_val_1.fq.gz, *_val_2.fq.gz)

Step 2 — STAR Alignment

Purpose

Align trimmed reads to reference genome and produce sorted BAM files.

STAR aligns:

  • Reads → Genome
  • Produces sorted BAM for further counting

Script: scripts/2_run_star_alignment.sh

Contents:

#!/usr/bin/env bash
TRIM="data/trimmed"
BAM="data/bam"
IDX="reference/star_index"

mkdir -p "$BAM"

for R1 in $TRIM/*_R1*.fq.gz; do
    BASE=$(basename $R1 _R1_val_1.fq.gz)
    R2="$TRIM/${BASE}_R2_val_2.fq.gz"

    echo "[STAR] Aligning sample: $BASE"

    STAR --runThreadN 8 \
         --genomeDir "$IDX" \
         --readFilesIn "$R1" "$R2" \
         --readFilesCommand zcat \
         --outSAMtype BAM SortedByCoordinate \
         --outFileNamePrefix "$BAM/${BASE}_"
done

Run:

bash scripts/2_run_star_alignment.sh

Outputs:

data/bam/SAMPLE_Aligned.sortedByCoord.out.bam

Step 3 — Gene Counting (featureCounts)

Purpose

Convert aligned reads into gene-level counts using GTF annotation.

featureCounts:

  • Groups reads per gene
  • Produces gene×sample count matrix

Script: scripts/3_featurescount.sh

Contents:

#!/usr/bin/env bash
BAM="data/bam"
GTF="reference/annotation.gtf"
OUT="results/counts/gene_counts.txt"

mkdir -p results/counts

echo "[featureCounts] Quantifying reads..."
featureCounts -T 8 -p -t exon -g gene_id \
    -a "$GTF" \
    -o "$OUT" \
    $BAM/*.bam

Run:

bash scripts/3_featurescount.sh

Outputs:

results/counts/gene_counts.txt
results/counts/gene_counts.txt.summary

Step 4 — MultiQC Summary (Optional)

Purpose

Aggregate QC metrics from:

  • FastQC
  • Trim Galore
  • STAR alignment

Run:

multiqc results/qc data/trimmed data/bam -o results/multiqc/

Output:

results/multiqc/multiqc_report.html

Open report in browser for QC summary across all samples.


Batch Execution (One Command)

To run all steps automatically, create:

File: run_pipeline.sh

Contents:

#!/usr/bin/env bash
set -e

echo "[PIPELINE] Starting RNA-seq pipeline..."
conda activate rnaseq

bash scripts/1_QC_trim.sh
bash scripts/2_run_star_alignment.sh
bash scripts/3_featurescount.sh

echo "[PIPELINE] Running MultiQC..."
multiqc results/qc data/trimmed data/bam -o results/multiqc/

echo "[DONE] Pipeline Completed Successfully!"
echo "Counts → results/counts/gene_counts.txt"
echo "MultiQC → results/multiqc/multiqc_report.html"

Make it executable:

chmod +x run_pipeline.sh

Run the entire workflow:

./run_pipeline.sh

Main Outputs

Output Location Usage
Gene count matrix results/counts/gene_counts.txt DESeq2 / edgeR
Sorted BAM files data/bam/*.bam IGV / QC
QC reports results/qc/ Raw & trimmed QC
MultiQC report results/multiqc/multiqc_report.html Aggregate QC

Downstream Analysis (DESeq2)

Part 4 — Differential Expression & Functional Enrichment (R Script)

Once you have generated the gene-level counts using featureCounts (Step 3), you can perform:

  • Differential expression analysis (DESeq2)
  • Heatmap visualization of top DEGs
  • Functional enrichment (g:Profiler: GO/KEGG/Reactome)
  • Simple correlation-based regulatory network inference

This is implemented in the R script:

4_diff_exp.R


## Tool Citations

- Andrews S. (2010) *FastQC*
- Krueger F. *Trim Galore*
- Dobin et al. (2013) *STAR*
- Liao et al. (2014) *featureCounts*
- Ewels et al. (2016) *MultiQC*

## acknowledgment
I gratefully acknowledge the educational and tutorial resources provided by the NGS Learning Hub (https://ngs101.com), which offers comprehensive, beginner-friendly guides and analyses covering next-generation sequencing (NGS) workflows including bulk RNA-seq, transcriptomics, and related computational methods. The curated tutorials and resource collections on this site have significantly aided the development and refinement of our RNA-seq data analysis pipeline and helped contextualize steps from raw data processing to downstream interpretation and enrichment analysis.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages