RNA-seq Analysis Pipeline

End-to-end bulk RNA-seq analysis workflow used for Plasmodium berghei – WT vs SMC4 knockdown transcriptomics experiments. Public data obtained from Tewari, R., et al. Cell Rep., 2020

Reads must be downloaded from SRA PRJNA542367 and placed under reads/untrimmed/ – We do this below.

Pipeline steps:

Quality control (FastQC)
Adapter and reads trimming (fastp)
Contamination screening (FastQ Screen)
Alignment (HISAT2)
Deduplication (marking only)
Feature counting (featureCounts)
Differential expression (DESeq2 + edgeR)
Visualization (heatmap, volcano plots)

Downloading the data

You need the following reads, or just use your own – Adjust config/samples.tsv and config/background.tsv for a custom analysis

WT: SRR9041561 – SRR9041562
SMC4KD: SRR9041565 – SRR9041566

First, download the SRA toolkit from https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit

Download the SRA toolkit

wget --output-document sratoolkit.tar.gz https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-ubuntu64.tar.gz

Extract the contents of the tar file:

tar -xzf sratoolkit.3.4.0-ubuntu64.tar.gz

Add the binaries to your PATH (adjust the file path for yourself)

echo 'export PATH=$PATH:$HOME/projects/research/NGS/sratoolkit.3.4.0-ubuntu64/bin' >> ~/.bashrc

source ~/.bashrc

Check that the system recognizes the new version
```
prefetch --version
```
Configure the toolkit
```
vdb-config --interactive
```
- In the text-based menu, navigate to the "Cache" tab (press C).
- Ensure "Enable Local Caching" is toggled on.
- Check the "Location of user-repository" to make sure it points to a directory where you have plenty of disk space (important for large RNA-seq datasets).
- Press "S" to save and "X" to exit.

Add exec priviliges to the download script and run it

chmod +x ./download_sra.sh

./download_sra.sh accessions.txt

Download the background genomes for contamination analysis

chmod +x ./download_background.sh

./download_background.sh

Run the pipeline

Create and activate the rnaseq environment

conda env create -f environment.yaml
conda activate rnaseq

Configure paths in: workflow/00_vars.sh

Run workflow

chmod +x ./run_workflow.sh
bash ./run_workflow.sh

Note 1: If you are using your own data, it is important to run the scripts one by one:

chmod +x ./workflow/02_fastqc.sh
./workflow/02_fastqc.sh

Then, check results/untrimmed/multiqc_report.html to see if you need to trim the ends of the reads, and by how much. Once you determine this, open workflow/03_trimming.sh and edit

fastp \
...
--trim_tail1 1 \
--trim_tail2 1 \
--trim_front1 11 \
--trim_front2 11 \

to trim off your read ends however you see fit.

Note 2: If you are using your own data, it is important to check RNA strandedness after running workflow/08_find_rna_strand.sh. Once this script finishes running, you will have sample_strandedness.txt under results/trimmed/strand/sample_name/sample_strandedness.txt. Inspect this file. For my data, I had:

Fraction of reads explained by "1++,1--,2+-,2-+": 0.0328
Fraction of reads explained by "1+-,1-+,2++,2--": 0.9671

which is RF for hisat2, and 2 for featureCounts, for reversely stranded. The script in workflows/01_common.sh automatically infers this for you with a default threshold of 0.8. So, in this case, Fraction of reads explained by "1+-,1-+,2++,2--": 0.9671 wins and strandedness is set to RF and 2. You may modify this threshold for downstream analysis after running workflow/08_find_rna_strand.sh. The inference function is used in workflow/09_hisat2.sh and workflow/14_featurecounts.sh.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RNA-seq Analysis Pipeline

Pipeline steps:

Downloading the data

Run the pipeline

Example Output

Volcano plot

Differential expression heatmap

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
config		config
data/genomic		data/genomic
results		results
workflow		workflow
.gitignore		.gitignore
README.md		README.md
accessions.txt		accessions.txt
download_background.sh		download_background.sh
download_sra.sh		download_sra.sh
environment.yaml		environment.yaml
run_workflow.sh		run_workflow.sh

Folders and files

Latest commit

History

Repository files navigation

RNA-seq Analysis Pipeline

Pipeline steps:

Downloading the data

Run the pipeline

Example Output

Volcano plot

Differential expression heatmap

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages