-
Notifications
You must be signed in to change notification settings - Fork 17
Detailed usage: RNA seq coverage tracks
The BAMscale "scale" program can generate a scaled coverage track directly from a BAM file.
We implemented three options to generate coverage tracks from RNA-seq data. A more detailed description about the algorithm differences compared to the default version can be found below, as well as some normalization thoughts and suggestions.
In short, to run BAMscale with RNA-seq data, the operation flag has to be set with the following possibilities:
To set the RNA-seq options use the operation parameter:
1. --operation rna
2. --operation strandrna
3. --operation strandrnaR
"rna" denotes unstranded RNA-seq coverage. "strandrna" outputs stranded coverage of positive and negative strands. "strandrnaR" outputs stranded coverages, and the negative strand is negated.
In this case, one bigWig file will be created, where the coverages are scaled to the genome size. This means that the total number of aligned bases are scaled to the number of bases in the genome.
./BAMscale scale --bam <RNAseq.bam> --operation rna
When stranded RNA-seq is available, stranded coverage tracks have a better representation of gene expression. To obtain the stranded coverage tracks, simply run:
./BAMscale scale --bam <RNAseq.bam> --operation strandrna
In this case two bigwig files will be created for reads aligning on the positive and negative strand.
In some cases, better visualization can be achieved if the negative strand coverages have negative values. This allows the possibility to overlay the two tracks with a genome browser (such as IGV). Additionally, we can specify a custom scaling factor (eg. obtained with DESeq2) with the following command:
./BAMscale scale --bam <RNAseq.bam> --operation strandrnaR --scale custom --factor 0.76
In this case two bigwig files will be created for reads aligning on the positive and negative strand. The negative strand coverage track will have negative values. The coverage of both tracks will be scaled to "0.76".
The default version of BAMscale simply splits the genome into equal size bins. This is good for non-RNA based sequencing, but in case of RNA-seq, base coverages change at single base resolution mainly at exon-intron boundaries. For this reason, we implemented RNA-seq coverage track creation to BAMscale. The main difference compared to the default version is that in case where the coverage of two adjacent bases in a bin have a coverage different of at least 5 reads, that bin will be outputted in single base resolution (summarized in the figure below).
The BAMscale coverage track can be run in two main ways for RNA-seq: unstranded, stranded. There are two options for stranded coverages: in one option both the forward and reverse strand coverage tracks have positive values, in the second option the negative strand scores are negated (this allows the possibility to overlay the two strands when visualizing data).
To set the RNA-seq options use the operation parameter:
1. --operation rna
2. --operation strandrna
3. --operation strandrnaR
"rna" denotes unstranded RNA-seq coverage. "strandrna" outputs stranded coverage of positive and negative strands. "strandrnaR" outputs stranded coverages, and the negative strand is negated.
In this case, the scale parameter has to be set "--scale custom" and the factor has to be changed as well to the scaling factor. Lets say we want to scale the coverage by "0.76". In this case the scaling command woulf be:
./BAMscale scale --bam <RNAseq.bam> --scale custom --factor 0.76 --operation strandrna
which creates stranded coverage track, where the coverages were downscaled by "0.76".
Currently we tested two ways for normalization. The most basic method is to normalize the sequenced bases to the genome size. The advantage of this approach is the simplicity, as no other calculations have to be performed before we create the scaled coverage tracks.
The second approach involves using the scaling of DESeq2 or any other RNA-seq differential expression analysis tools. This is probably way better, but requires more steps as well. A major drawback is that samples can not be normalized independently, since tools like DESeq2 calculate the scaling factors for a set of samples.
This approach is slightly more complex, as first we need to calculate raw reads counts for every gene and every sample. We will prepare a detailed manual for an experiment below (work in progress).
The main steps are:
1. count raw read counts of every gene in every sample (eg. TPMcalculator [https://github.com/ncbi/TPMCalculator])
2. Merge gene level read counts to a matrix (columns are samples, rows are genes)
3. Import data to R and loading the DESeq2 library
4. Create a DESeq object (in R)
5. Get size factors using the 1/sizeFactors() function (in R)
Once the size factors are obtained, these can be used as scaling factors for the samples.
- Main page
- Home
- Installation
- Benchmarking
- Brief examples
- Detailed Manuals
- Visualization scripts