Genomics Part 1 — Automated RNA-seq Analysis (FASTQ → Differential Expression)

This repository provides a beginner-friendly RNA-seq pipeline that processes all FASTQ samples automatically and outputs a gene×sample count matrix suitable for downstream differential expression analysis (DESeq2, edgeR, and limma).

Installation

1. Install Conda (if not installed)

curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh

Verify installation:

conda --version

2. Create RNA-seq Environment & Install Tools

conda create -n rnaseq python=3.12 -y
conda activate rnaseq

conda install -c bioconda -c conda-forge \
    fastqc \
    trim-galore \
    cutadapt \
    multiqc \
    star \
    bowtie2 \
    samtools \
    subread \
    sra-tools -y

Project Directory Setup

This pipeline assumes the following directory structure:

rnaseq_project/
├── data/
│   ├── fastq/
│   ├── trimmed/
│   └── bam/
├── reference/
│   ├── star_index/
│   └── annotation.gtf
├── scripts/
│   ├── 1_QC_trim.sh
│   ├── 2_run_star_alignment.sh
│   └── 3_featurescount.sh
├── results/
│   ├── qc/
│   ├── counts/
│   └── multiqc/
└── run_pipeline.sh

Create directories:

mkdir -p rnaseq_project/{data/fastq,data/trimmed,data/bam,reference,results/{qc,counts,multiqc},scripts}

Reference Genome Setup

RNA-seq requires a reference FASTA and GTF file.

Human (Option A)

Download:

Genome FASTA: GRCh38 / hg38
GTF annotation: GENCODE v44

Place inside:

reference/genome.fa
reference/annotation.gtf

Mouse (Option B)

Download:

Genome FASTA: GRCm38 / mm10
GTF annotation: GENCODE vM25

Place inside:

reference/genome.fa
reference/annotation.gtf

Other Species (Option C)

Download species-specific:

Genome FASTA
GTF or GFF annotation

STAR Index Requirement

STAR needs a prebuilt index. If you do not have one, build it:

STAR --runThreadN 8 \
     --runMode genomeGenerate \
     --genomeDir reference/star_index \
     --genomeFastaFiles reference/genome.fa \
     --sjdbGTFfile reference/annotation.gtf \
     --sjdbOverhang 100

sjdbOverhang = read_length - 1 (100 for 101bp reads)

Preparing Input FASTQ Files

Place raw FASTQ files into:

data/fastq/

Required naming format (paired-end):

SAMPLE_R1.fastq.gz
SAMPLE_R2.fastq.gz

Example:

SRR28119110_R1.fastq.gz
SRR28119110_R2.fastq.gz

The pipeline automatically detects and processes all samples matching this pattern.

Running the Pipeline

Activate environment:

conda activate rnaseq
cd rnaseq_project

Step 1 — QC & Trimming

Purpose

This step ensures raw sequencing data quality and removes adapter contamination.

Actions:

Runs FastQC to assess read quality
Runs Trim Galore to trim adapters
Runs FastQC again on trimmed reads for validation

Script: `scripts/1_QC_trim.sh`

Contents:

#!/usr/bin/env bash
RAW="data/fastq"
TRIM="data/trimmed"
QC="results/qc"

mkdir -p "$TRIM" "$QC"

for R1 in $RAW/*_R1.fastq.gz; do
    BASE=$(basename $R1 _R1.fastq.gz)
    R2="$RAW/${BASE}_R2.fastq.gz"

    echo "[QC] Running FastQC on raw reads for sample: $BASE"
    fastqc "$R1" "$R2" -o "$QC"

    echo "[Trim] Running Trim Galore for sample: $BASE"
    trim_galore --paired --fastqc \
        "$R1" "$R2" \
        -o "$TRIM"
done

Run:

bash scripts/1_QC_trim.sh

Outputs:

results/qc/       → QC reports (HTML + ZIP)
data/trimmed/     → trimmed FASTQs (*_val_1.fq.gz, *_val_2.fq.gz)

Step 2 — STAR Alignment

Purpose

Align trimmed reads to reference genome and produce sorted BAM files.

STAR aligns:

Reads → Genome
Produces sorted BAM for further counting

Script: `scripts/2_run_star_alignment.sh`

Contents:

#!/usr/bin/env bash
TRIM="data/trimmed"
BAM="data/bam"
IDX="reference/star_index"

mkdir -p "$BAM"

for R1 in $TRIM/*_R1*.fq.gz; do
    BASE=$(basename $R1 _R1_val_1.fq.gz)
    R2="$TRIM/${BASE}_R2_val_2.fq.gz"

    echo "[STAR] Aligning sample: $BASE"

    STAR --runThreadN 8 \
         --genomeDir "$IDX" \
         --readFilesIn "$R1" "$R2" \
         --readFilesCommand zcat \
         --outSAMtype BAM SortedByCoordinate \
         --outFileNamePrefix "$BAM/${BASE}_"
done

Run:

bash scripts/2_run_star_alignment.sh

Outputs:

data/bam/SAMPLE_Aligned.sortedByCoord.out.bam

Step 3 — Gene Counting (featureCounts)

Purpose

Convert aligned reads into gene-level counts using GTF annotation.

featureCounts:

Groups reads per gene
Produces gene×sample count matrix

Script: `scripts/3_featurescount.sh`

Contents:

#!/usr/bin/env bash
BAM="data/bam"
GTF="reference/annotation.gtf"
OUT="results/counts/gene_counts.txt"

mkdir -p results/counts

echo "[featureCounts] Quantifying reads..."
featureCounts -T 8 -p -t exon -g gene_id \
    -a "$GTF" \
    -o "$OUT" \
    $BAM/*.bam

Run:

bash scripts/3_featurescount.sh

Outputs:

results/counts/gene_counts.txt
results/counts/gene_counts.txt.summary

Step 4 — MultiQC Summary (Optional)

Purpose

Aggregate QC metrics from:

FastQC
Trim Galore
STAR alignment

Run:

multiqc results/qc data/trimmed data/bam -o results/multiqc/

Output:

results/multiqc/multiqc_report.html

Open report in browser for QC summary across all samples.

Batch Execution (One Command)

To run all steps automatically, create:

File: `run_pipeline.sh`

Contents:

#!/usr/bin/env bash
set -e

echo "[PIPELINE] Starting RNA-seq pipeline..."
conda activate rnaseq

bash scripts/1_QC_trim.sh
bash scripts/2_run_star_alignment.sh
bash scripts/3_featurescount.sh

echo "[PIPELINE] Running MultiQC..."
multiqc results/qc data/trimmed data/bam -o results/multiqc/

echo "[DONE] Pipeline Completed Successfully!"
echo "Counts → results/counts/gene_counts.txt"
echo "MultiQC → results/multiqc/multiqc_report.html"

Make it executable:

chmod +x run_pipeline.sh

Run the entire workflow:

./run_pipeline.sh

Main Outputs

Output	Location	Usage
Gene count matrix	`results/counts/gene_counts.txt`	DESeq2 / edgeR
Sorted BAM files	`data/bam/*.bam`	IGV / QC
QC reports	`results/qc/`	Raw & trimmed QC
MultiQC report	`results/multiqc/multiqc_report.html`	Aggregate QC

Downstream Analysis (DESeq2)

Part 4 — Differential Expression & Functional Enrichment (R Script)

Once you have generated the gene-level counts using featureCounts (Step 3), you can perform:

Differential expression analysis (DESeq2)
Heatmap visualization of top DEGs
Functional enrichment (g:Profiler: GO/KEGG/Reactome)
Simple correlation-based regulatory network inference

This is implemented in the R script:

4_diff_exp.R


## Tool Citations

- Andrews S. (2010) *FastQC*
- Krueger F. *Trim Galore*
- Dobin et al. (2013) *STAR*
- Liao et al. (2014) *featureCounts*
- Ewels et al. (2016) *MultiQC*

## acknowledgment
I gratefully acknowledge the educational and tutorial resources provided by the NGS Learning Hub (https://ngs101.com), which offers comprehensive, beginner-friendly guides and analyses covering next-generation sequencing (NGS) workflows including bulk RNA-seq, transcriptomics, and related computational methods. The curated tutorials and resource collections on this site have significantly aided the development and refinement of our RNA-seq data analysis pipeline and helped contextualize steps from raw data processing to downstream interpretation and enrichment analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
0_envdir_setup.sh		0_envdir_setup.sh
1_QC_trim.sh		1_QC_trim.sh
2_run_star_alignment.sh		2_run_star_alignment.sh
3_featurescount.sh		3_featurescount.sh
4_diff_ex.R		4_diff_ex.R
DESeq2_all_genes.csv		DESeq2_all_genes.csv
DESeq2_results.csv		DESeq2_results.csv
DESeq2_significant_padj0.05.csv		DESeq2_significant_padj0.05.csv
Heatmap_top40.png		Heatmap_top40.png
PCA_plot.png		PCA_plot.png
README.md		README.md
Volcano_plot.png		Volcano_plot.png
alignment_report.pdf		alignment_report.pdf
deseq2_counts.tsv		deseq2_counts.tsv
gProfiler_dotplot.html		gProfiler_dotplot.html
gProfiler_results.csv		gProfiler_results.csv
gprofiler_dotplot.png		gprofiler_dotplot.png
gprofiler_results.csv		gprofiler_results.csv
multiqc_trimming_report.pdf		multiqc_trimming_report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Genomics Part 1 — Automated RNA-seq Analysis (FASTQ → Differential Expression)

Installation

1. Install Conda (if not installed)

2. Create RNA-seq Environment & Install Tools

Project Directory Setup

Reference Genome Setup

Human (Option A)

Mouse (Option B)

Other Species (Option C)

STAR Index Requirement

Preparing Input FASTQ Files

Running the Pipeline

Step 1 — QC & Trimming

Purpose

Script: `scripts/1_QC_trim.sh`

Step 2 — STAR Alignment

Purpose

Script: `scripts/2_run_star_alignment.sh`

Step 3 — Gene Counting (featureCounts)

Purpose

Script: `scripts/3_featurescount.sh`

Step 4 — MultiQC Summary (Optional)

Purpose

Batch Execution (One Command)

File: `run_pipeline.sh`

Main Outputs

Downstream Analysis (DESeq2)

Part 4 — Differential Expression & Functional Enrichment (R Script)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Genomics Part 1 — Automated RNA-seq Analysis (FASTQ → Differential Expression)

Installation

1. Install Conda (if not installed)

2. Create RNA-seq Environment & Install Tools

Project Directory Setup

Reference Genome Setup

Human (Option A)

Mouse (Option B)

Other Species (Option C)

STAR Index Requirement

Preparing Input FASTQ Files

Running the Pipeline

Step 1 — QC & Trimming

Purpose

Script: scripts/1_QC_trim.sh

Step 2 — STAR Alignment

Purpose

Script: scripts/2_run_star_alignment.sh

Step 3 — Gene Counting (featureCounts)

Purpose

Script: scripts/3_featurescount.sh

Step 4 — MultiQC Summary (Optional)

Purpose

Batch Execution (One Command)

File: run_pipeline.sh

Main Outputs

Downstream Analysis (DESeq2)

Part 4 — Differential Expression & Functional Enrichment (R Script)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Script: `scripts/1_QC_trim.sh`

Script: `scripts/2_run_star_alignment.sh`

Script: `scripts/3_featurescount.sh`

File: `run_pipeline.sh`

Packages