rmskProfiler

This package provides an end-to-end solution for accurately quantifying transposable elements from RNA-seq data at the locus-level and importing into R for downstream analysis.

Installation

You can install the development version of rmskProfiler from GitHub with:

# install.packages("pak")
pak::pak("coriell-research/rmskProfiler")

Salmon dependency

This package assumes that you have a recent version of Salmon installed and available on your PATH. If not, please follow the latest installation instructions before using this package.

Overview of the method

This package simply creates a Salmon index using GENCODE v48 annotated transcripts, unique RepeatMasker repeat regions, and a full genomic decoy and provides utilities for importing Salmon quants into R. The imported counts are stored along with their annotation information (including GenomicRanges) as a SummarizedExperiment object that is ready for downstream analysis.

By default, not all repeats are included in the generated index. “Simple_repeat”, “Low_complexity”, “Satellite”, “RNA”, “rRNA”, “snRNA”, “scRNA”, “srpRNA”, “tRNA”, and “Unknown” loci are excluded. Additionally, TE loci with lengths <=31 bp are removed (as these are the size of the minimum k-mer length used by Salmon).

After the index has been created, reads are then quantified using a minimum of 30 Gibbs re-samples. The Gibbs re-sampled quants are imported into R using edgeR::catchSalmon(). The original counts from Salmon along with the overdispersion corrected counts from catchSalmon() are stored as separate assays in the resulting SummarizedExperiment object.

All intermediate files and and the Salmon index are stored in the created resources directory for further inspection and re-use.

Usage

The package has three main components:

Generate a Salmon index using unique RepeatMasker elements + transcripts + genomic decoy
Quantify reads with Salmon using Gibbs sampling
Import the Salmon quants as a SummarizedExperiment for downstream differential expression analysis

A full pipeline for generation of the index, quantification, and importing of quants looks like:

library(rmskProfiler)


# Generate the Salmon index for humans using 12 threads 
# and save to directory named 'hg38-resources'
generateIndex("hg38-resources", species = "Hs", threads = 12)

fq1 <- c("/path/to/sample1.R1.fq.gz", "/path/to/sample2.R1.fq.gz", "/path/to/sample3.R1.fq.gz")
fq2 <- c("/path/to/sample1.R2.fq.gz", "/path/to/sample2.R2.fq.gz", "/path/to/sample3.R2.fq.gz")
sample_names <- c("sample1", "sample2", "sample3")

# Perform quantification with Salmon on fastq files
# NOTE: Once the index is built you can quantify reads using any workflow
#       just make sure you perform Gibbs re-sampling
salmonQuant(
  fq1 = fq1, 
  fq2 = fq2, 
  sample_names = sample_names, 
  resource_dir = "hg38-resources", 
  out_dir = "quants", 
  n_gibbs = 30,
  "--gcBias",       # Additional arguments can be passed as character strings
  "--seqBias",
  "--posBias",
  "--threads 12"
  )

# Import the transcripts and TE loci counts as a SummarizedExperiment object
# rowData contains transcript and TE annotations and GRanges
se <- importQuants("quants", resource_dir = "hg38-resources")

# Proceed to downstream analysis using edgeR

Filtering and aggregating

The imported object contains information on the TE-loci and transcript levels but sometimes a gene/subfamily-level analysis is desired. The following code can be used to filter out TE-loci assigned to exons and then summarize counts for TE-loci to the subfamily level and transcripts to the gene-level.

Below, any TE-loci that overlaps with an exon, 3’UTR, 5’UTR on either strand or has more than one exact sequence location in the genome are removed from downstream analysis. Protein coding transcripts are selected for downstream analysis.

library(rmskProfiler)
library(SummarizedExperiment)


# Select loci to keep
loci <- subset(
  rowData(se), 
  !is.na(Hash) & 
  hasUnstrandedExonic == FALSE & 
  hasUnstranded3UTR == FALSE & 
  hasUnstranded5UTR == FALSE &
  N_Loci == 1,
  select = "Hash",
  drop = TRUE
  )

# Select only protein coding transcripts
tx <- subset(
  rowData(se), 
  !is.na(transcript_id) & gene_type == "protein_coding", 
  "transcript_id", 
  drop = TRUE
  )

# Filter the Summarized experiment object
keep <- rownames(se) %in% c(tx, loci)
filtered <- se[keep, ]

# Sum assays to the gene/subfamily level
aggregated <- aggregateCounts(filtered, resource_dir = "hg38-resources", level = "subfamily")

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
R		R
inst/python		inst/python
man		man
tests		tests
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md
rmskProfiler.Rproj		rmskProfiler.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rmskProfiler

Installation

Salmon dependency

Overview of the method

Usage

Filtering and aggregating

About

Licenses found

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

rmskProfiler

Installation

Salmon dependency

Overview of the method

Usage

Filtering and aggregating

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages