End-to-end bulk RNA-seq analysis workflow used for Plasmodium berghei – WT vs SMC4 knockdown transcriptomics experiments. Public data obtained from Tewari, R., et al. Cell Rep., 2020
Reads must be downloaded from SRA PRJNA542367 and placed under reads/untrimmed/ – We do this below.
- Quality control (FastQC)
- Adapter and reads trimming (fastp)
- Contamination screening (FastQ Screen)
- Alignment (HISAT2)
- Deduplication (marking only)
- Feature counting (featureCounts)
- Differential expression (DESeq2 + edgeR)
- Visualization (heatmap, volcano plots)
You need the following reads, or just use your own – Adjust config/samples.tsv and config/background.tsv for a custom analysis
WT: SRR9041561 – SRR9041562
SMC4KD: SRR9041565 – SRR9041566
-
First, download the SRA toolkit from https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit
-
Download the SRA toolkit
wget --output-document sratoolkit.tar.gz https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-ubuntu64.tar.gz -
Extract the contents of the tar file:
tar -xzf sratoolkit.3.4.0-ubuntu64.tar.gz -
Add the binaries to your PATH (adjust the file path for yourself)
echo 'export PATH=$PATH:$HOME/projects/research/NGS/sratoolkit.3.4.0-ubuntu64/bin' >> ~/.bashrc source ~/.bashrc -
Check that the system recognizes the new version
prefetch --version -
Configure the toolkit
vdb-config --interactive- In the text-based menu, navigate to the "Cache" tab (press C).
- Ensure "Enable Local Caching" is toggled on.
- Check the "Location of user-repository" to make sure it points to a directory where you have plenty of disk space (important for large RNA-seq datasets).
- Press "S" to save and "X" to exit.
-
Add exec priviliges to the download script and run it
chmod +x ./download_sra.sh ./download_sra.sh accessions.txt -
Download the background genomes for contamination analysis
chmod +x ./download_background.sh ./download_background.sh
- Create and activate the
rnaseqenvironmentconda env create -f environment.yaml conda activate rnaseq - Configure paths in:
workflow/00_vars.sh - Run workflow
chmod +x ./run_workflow.sh bash ./run_workflow.sh
Note 1: If you are using your own data, it is important to run the scripts one by one:
chmod +x ./workflow/02_fastqc.sh
./workflow/02_fastqc.sh
Then, check results/untrimmed/multiqc_report.html to see if you need to trim the ends of the reads, and by how much. Once you determine this, open workflow/03_trimming.sh and edit
fastp \
...
--trim_tail1 1 \
--trim_tail2 1 \
--trim_front1 11 \
--trim_front2 11 \
to trim off your read ends however you see fit.
Note 2: If you are using your own data, it is important to check RNA strandedness after running workflow/08_find_rna_strand.sh. Once this script finishes running, you will have sample_strandedness.txt under results/trimmed/strand/sample_name/sample_strandedness.txt. Inspect this file. For my data, I had:
Fraction of reads explained by "1++,1--,2+-,2-+": 0.0328
Fraction of reads explained by "1+-,1-+,2++,2--": 0.9671
which is RF for hisat2, and 2 for featureCounts, for reversely stranded. The script in workflows/01_common.sh automatically infers this for you with a default threshold of 0.8. So, in this case, Fraction of reads explained by "1+-,1-+,2++,2--": 0.9671 wins and strandedness is set to RF and 2. You may modify this threshold for downstream analysis after running workflow/08_find_rna_strand.sh. The inference function is used in workflow/09_hisat2.sh and workflow/14_featurecounts.sh.

