This repository provides a beginner-friendly RNA-seq pipeline that processes all FASTQ samples automatically and outputs a gene×sample count matrix suitable for downstream differential expression analysis (DESeq2, edgeR, and limma).
curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).shVerify installation:
conda --versionconda create -n rnaseq python=3.12 -y
conda activate rnaseq
conda install -c bioconda -c conda-forge \
fastqc \
trim-galore \
cutadapt \
multiqc \
star \
bowtie2 \
samtools \
subread \
sra-tools -yThis pipeline assumes the following directory structure:
rnaseq_project/
├── data/
│ ├── fastq/
│ ├── trimmed/
│ └── bam/
├── reference/
│ ├── star_index/
│ └── annotation.gtf
├── scripts/
│ ├── 1_QC_trim.sh
│ ├── 2_run_star_alignment.sh
│ └── 3_featurescount.sh
├── results/
│ ├── qc/
│ ├── counts/
│ └── multiqc/
└── run_pipeline.sh
Create directories:
mkdir -p rnaseq_project/{data/fastq,data/trimmed,data/bam,reference,results/{qc,counts,multiqc},scripts}RNA-seq requires a reference FASTA and GTF file.
Download:
- Genome FASTA:
GRCh38/hg38 - GTF annotation: GENCODE
v44
Place inside:
reference/genome.fa
reference/annotation.gtf
Download:
- Genome FASTA:
GRCm38/mm10 - GTF annotation: GENCODE
vM25
Place inside:
reference/genome.fa
reference/annotation.gtf
Download species-specific:
- Genome FASTA
- GTF or GFF annotation
STAR needs a prebuilt index. If you do not have one, build it:
STAR --runThreadN 8 \
--runMode genomeGenerate \
--genomeDir reference/star_index \
--genomeFastaFiles reference/genome.fa \
--sjdbGTFfile reference/annotation.gtf \
--sjdbOverhang 100sjdbOverhang = read_length - 1 (100 for 101bp reads)
Place raw FASTQ files into:
data/fastq/
Required naming format (paired-end):
SAMPLE_R1.fastq.gz
SAMPLE_R2.fastq.gz
Example:
SRR28119110_R1.fastq.gz
SRR28119110_R2.fastq.gz
The pipeline automatically detects and processes all samples matching this pattern.
Activate environment:
conda activate rnaseq
cd rnaseq_projectThis step ensures raw sequencing data quality and removes adapter contamination.
Actions:
- Runs FastQC to assess read quality
- Runs Trim Galore to trim adapters
- Runs FastQC again on trimmed reads for validation
Contents:
#!/usr/bin/env bash
RAW="data/fastq"
TRIM="data/trimmed"
QC="results/qc"
mkdir -p "$TRIM" "$QC"
for R1 in $RAW/*_R1.fastq.gz; do
BASE=$(basename $R1 _R1.fastq.gz)
R2="$RAW/${BASE}_R2.fastq.gz"
echo "[QC] Running FastQC on raw reads for sample: $BASE"
fastqc "$R1" "$R2" -o "$QC"
echo "[Trim] Running Trim Galore for sample: $BASE"
trim_galore --paired --fastqc \
"$R1" "$R2" \
-o "$TRIM"
doneRun:
bash scripts/1_QC_trim.shOutputs:
results/qc/ → QC reports (HTML + ZIP)
data/trimmed/ → trimmed FASTQs (*_val_1.fq.gz, *_val_2.fq.gz)
Align trimmed reads to reference genome and produce sorted BAM files.
STAR aligns:
- Reads → Genome
- Produces sorted BAM for further counting
Contents:
#!/usr/bin/env bash
TRIM="data/trimmed"
BAM="data/bam"
IDX="reference/star_index"
mkdir -p "$BAM"
for R1 in $TRIM/*_R1*.fq.gz; do
BASE=$(basename $R1 _R1_val_1.fq.gz)
R2="$TRIM/${BASE}_R2_val_2.fq.gz"
echo "[STAR] Aligning sample: $BASE"
STAR --runThreadN 8 \
--genomeDir "$IDX" \
--readFilesIn "$R1" "$R2" \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix "$BAM/${BASE}_"
doneRun:
bash scripts/2_run_star_alignment.shOutputs:
data/bam/SAMPLE_Aligned.sortedByCoord.out.bam
Convert aligned reads into gene-level counts using GTF annotation.
featureCounts:
- Groups reads per gene
- Produces gene×sample count matrix
Contents:
#!/usr/bin/env bash
BAM="data/bam"
GTF="reference/annotation.gtf"
OUT="results/counts/gene_counts.txt"
mkdir -p results/counts
echo "[featureCounts] Quantifying reads..."
featureCounts -T 8 -p -t exon -g gene_id \
-a "$GTF" \
-o "$OUT" \
$BAM/*.bamRun:
bash scripts/3_featurescount.shOutputs:
results/counts/gene_counts.txt
results/counts/gene_counts.txt.summary
Aggregate QC metrics from:
- FastQC
- Trim Galore
- STAR alignment
Run:
multiqc results/qc data/trimmed data/bam -o results/multiqc/Output:
results/multiqc/multiqc_report.html
Open report in browser for QC summary across all samples.
To run all steps automatically, create:
Contents:
#!/usr/bin/env bash
set -e
echo "[PIPELINE] Starting RNA-seq pipeline..."
conda activate rnaseq
bash scripts/1_QC_trim.sh
bash scripts/2_run_star_alignment.sh
bash scripts/3_featurescount.sh
echo "[PIPELINE] Running MultiQC..."
multiqc results/qc data/trimmed data/bam -o results/multiqc/
echo "[DONE] Pipeline Completed Successfully!"
echo "Counts → results/counts/gene_counts.txt"
echo "MultiQC → results/multiqc/multiqc_report.html"Make it executable:
chmod +x run_pipeline.shRun the entire workflow:
./run_pipeline.sh| Output | Location | Usage |
|---|---|---|
| Gene count matrix | results/counts/gene_counts.txt |
DESeq2 / edgeR |
| Sorted BAM files | data/bam/*.bam |
IGV / QC |
| QC reports | results/qc/ |
Raw & trimmed QC |
| MultiQC report | results/multiqc/multiqc_report.html |
Aggregate QC |
Once you have generated the gene-level counts using featureCounts (Step 3), you can perform:
- Differential expression analysis (DESeq2)
- Heatmap visualization of top DEGs
- Functional enrichment (g:Profiler: GO/KEGG/Reactome)
- Simple correlation-based regulatory network inference
This is implemented in the R script:
4_diff_exp.R
## Tool Citations
- Andrews S. (2010) *FastQC*
- Krueger F. *Trim Galore*
- Dobin et al. (2013) *STAR*
- Liao et al. (2014) *featureCounts*
- Ewels et al. (2016) *MultiQC*
## acknowledgment
I gratefully acknowledge the educational and tutorial resources provided by the NGS Learning Hub (https://ngs101.com), which offers comprehensive, beginner-friendly guides and analyses covering next-generation sequencing (NGS) workflows including bulk RNA-seq, transcriptomics, and related computational methods. The curated tutorials and resource collections on this site have significantly aided the development and refinement of our RNA-seq data analysis pipeline and helped contextualize steps from raw data processing to downstream interpretation and enrichment analysis.