Skip to content

AmirUCR/RNA-seq-Pipeline

Repository files navigation

RNA-seq Analysis Pipeline

End-to-end bulk RNA-seq analysis workflow used for Plasmodium berghei – WT vs SMC4 knockdown transcriptomics experiments. Public data obtained from Tewari, R., et al. Cell Rep., 2020

Reads must be downloaded from SRA PRJNA542367 and placed under reads/untrimmed/ – We do this below.

Pipeline steps:

  1. Quality control (FastQC)
  2. Adapter and reads trimming (fastp)
  3. Contamination screening (FastQ Screen)
  4. Alignment (HISAT2)
  5. Deduplication (marking only)
  6. Feature counting (featureCounts)
  7. Differential expression (DESeq2 + edgeR)
  8. Visualization (heatmap, volcano plots)

Downloading the data

You need the following reads, or just use your own – Adjust config/samples.tsv and config/background.tsv for a custom analysis

WT: SRR9041561 – SRR9041562
SMC4KD: SRR9041565 – SRR9041566
  1. First, download the SRA toolkit from https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit

  2. Download the SRA toolkit

    wget --output-document sratoolkit.tar.gz https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-ubuntu64.tar.gz
    
  3. Extract the contents of the tar file:

    tar -xzf sratoolkit.3.4.0-ubuntu64.tar.gz
    
  4. Add the binaries to your PATH (adjust the file path for yourself)

    echo 'export PATH=$PATH:$HOME/projects/research/NGS/sratoolkit.3.4.0-ubuntu64/bin' >> ~/.bashrc
    
    source ~/.bashrc
    
  5. Check that the system recognizes the new version

    prefetch --version
    
  6. Configure the toolkit

    vdb-config --interactive
    
    • In the text-based menu, navigate to the "Cache" tab (press C).
    • Ensure "Enable Local Caching" is toggled on.
    • Check the "Location of user-repository" to make sure it points to a directory where you have plenty of disk space (important for large RNA-seq datasets).
    • Press "S" to save and "X" to exit.
  7. Add exec priviliges to the download script and run it

    chmod +x ./download_sra.sh
    
    ./download_sra.sh accessions.txt
    
  8. Download the background genomes for contamination analysis

    chmod +x ./download_background.sh
    
    ./download_background.sh
    

Run the pipeline

  1. Create and activate the rnaseq environment
    conda env create -f environment.yaml
    conda activate rnaseq
    
  2. Configure paths in: workflow/00_vars.sh
  3. Run workflow
    chmod +x ./run_workflow.sh
    bash ./run_workflow.sh
    

Note 1: If you are using your own data, it is important to run the scripts one by one:

chmod +x ./workflow/02_fastqc.sh
./workflow/02_fastqc.sh

Then, check results/untrimmed/multiqc_report.html to see if you need to trim the ends of the reads, and by how much. Once you determine this, open workflow/03_trimming.sh and edit

fastp \
...
--trim_tail1 1 \
--trim_tail2 1 \
--trim_front1 11 \
--trim_front2 11 \

to trim off your read ends however you see fit.

Note 2: If you are using your own data, it is important to check RNA strandedness after running workflow/08_find_rna_strand.sh. Once this script finishes running, you will have sample_strandedness.txt under results/trimmed/strand/sample_name/sample_strandedness.txt. Inspect this file. For my data, I had:

Fraction of reads explained by "1++,1--,2+-,2-+": 0.0328
Fraction of reads explained by "1+-,1-+,2++,2--": 0.9671

which is RF for hisat2, and 2 for featureCounts, for reversely stranded. The script in workflows/01_common.sh automatically infers this for you with a default threshold of 0.8. So, in this case, Fraction of reads explained by "1+-,1-+,2++,2--": 0.9671 wins and strandedness is set to RF and 2. You may modify this threshold for downstream analysis after running workflow/08_find_rna_strand.sh. The inference function is used in workflow/09_hisat2.sh and workflow/14_featurecounts.sh.

Example Output

Volcano plot

Volcano plot

Differential expression heatmap

DE Heatmap

About

RNA-seq pipeline for bulk transcriptomics – From quality control to volcano plots

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors