diff --git a/vignettes/01_04_rmarkdown.Rmd b/vignettes/01_04_rmarkdown.Rmd index 55c4cc1..3397a35 100644 --- a/vignettes/01_04_rmarkdown.Rmd +++ b/vignettes/01_04_rmarkdown.Rmd @@ -47,7 +47,7 @@ knit_hooks$set(setPch = function(before, options, envir) { opts_chunk$set(setPch = TRUE) ``` -We will again hide this in future documents, but this shows how to set a default figure width and height, as well as setting plot margians. +We will again hide this in future documents, but this shows how to set a default figure width and height, as well as setting plot margins. ```{r global-plot,warning=FALSE, message=FALSE} diff --git a/vignettes/01_10_exploratory-analysis.Rmd b/vignettes/01_10_exploratory-analysis.Rmd index 03de4cb..8c0dd5c 100644 --- a/vignettes/01_10_exploratory-analysis.Rmd +++ b/vignettes/01_10_exploratory-analysis.Rmd @@ -53,11 +53,11 @@ biocLite("alyssafrazee/RSkittleBrewer") * Use plots as often as possible * Use tables for phenotype data -* Look for - * Missing values - * Outlier values - * Mislabeled samples - * Naming consistency +* Look for: + * Missing values + * Outlier values + * Mislabeled samples + * Naming consistency ### Make the plots pretty @@ -142,15 +142,15 @@ dim(edata) Here we see that there are a lot of outliers ```{r boxplot} -boxplot(log2(edata+1),col=2,range=0) +boxplot(log2(edata+1),col=2,range=0, las=2, cex.axis=0.8) ``` We can also look at this sample by sample with histograms ```{r histograms} par(mfrow=c(1,2)) -hist(log2(edata[,1]+1),col=2) -hist(log2(edata[,2]+1),col=2) +hist(log2(edata[,1]+1),col=2, cex.main=0.8) +hist(log2(edata[,2]+1),col=2, cex.main=0.8) ``` Or with density plots @@ -180,7 +180,7 @@ We can remove rows that are mostly zero and notice any differences in the distri ```{r} edata = as.data.frame(edata) filt_edata = filter(edata,rowMeans(edata) > 1) -boxplot(as.matrix(log2(filt_edata+1)),col=2) +boxplot(as.matrix(log2(filt_edata+1)),col=2, las=2, cex.axis=0.8) ``` ### Check for obvious data mixups diff --git a/vignettes/01_11_transforms.html b/vignettes/01_11_transforms.html deleted file mode 100644 index cc05719..0000000 --- a/vignettes/01_11_transforms.html +++ /dev/null @@ -1,236 +0,0 @@ - - - - - - - - - - - - - -Data transforms - - - - - - - - - - - - - - - - - - - - - -
- - - - -
- -
- -
-

Dependencies

-

This document depends on the following packages:

-
  library(devtools)
-  library(Biobase)
-  library(dplyr)
-

To install these packages you can use the code (or if you are compiling the document, remove the eval=FALSE from the chunk.)

-
install.packages(c("devtools"))
-source("http://www.bioconductor.org/biocLite.R")
-biocLite(c("Biobase"))
-
-
-

General principles

- -
-
-

Load some data

-

We will use this expression set to look at how we use plots and tables to check for different characteristics

-
con =url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/bodymap_eset.RData")
-load(file=con)
-close(con)
-bm = bodymap.eset
-pdata=pData(bm)
-edata=as.data.frame(exprs(bm))
-fdata = fData(bm)
-ls()
-
## [1] "bm"           "bodymap.eset" "con"          "edata"       
-## [5] "fdata"        "pdata"        "tropical"
-
-
-

Skewed distributions

-

We would like continuous data to be nice and symmetric like a normal distribution for two reasons:(1) plots are easier to see this way and (2) most statistical methods are designed to work better for non-skewed data.

-
hist(rnorm(1000),col=2)
-

-

Realistically though most genomic data is skewed

-
hist(edata[,1],col=2,breaks=100)
-

-

One way to address this skew is to use a transformation. One common transformation is the log transform.

-
hist(log(edata[,1]),col=2,breaks=100)
-

-

One thing to be careful with is that values of zero become -inf when you apply the log transform because the log of zero isn’t defined.

-
min(log(edata))
-
## [1] -Inf
-

You can remove this problem by adding a small number to all the counts before taking the log. This doesn’t change the overall distribution much but resolves the zero problem

-
min(log(edata[,1] + 1))
-
## [1] 0
-
hist(log(edata[,1] + 1),breaks=100,col=2)
-

-

A common choice is to use the log2 transform. The reason is that data on the log2 scale can be expressed as fold changes. If log2(x) - log2(y) = z then there is a z fold change difference between x and y.

-
hist(log2(edata[,1] + 1),breaks=100,col=2)
-

-

Let’s zoom in on the values greater than zero (we’ll come back to the zero values in a minute)

-
hist(log2(edata[,1] + 1),breaks=100,col=2,xlim=c(1,15),ylim=c(0,400))
-

-

This is starting to look a little better but we still have the zeros problem.

-
-
-

Removing features without much data

-

In many types of sequencing data we observe that there is not much data for many features.

-
hist(rowSums(edata==0),col=2)
-

-

A common pre-processing technique is to remove features that don’t have much data. One way to identify and remove them is to use rowMeans and rowMedians

-
low_genes = rowMeans(edata) < 5
-table(low_genes)
-
## low_genes
-## FALSE  TRUE 
-## 11413 41167
-
filt_edata = filter(edata,!low_genes)
-dim(filt_edata)
-
## [1] 11413    19
-
low_genes2 = rowMedians(as.matrix(edata)) < 5
-table(low_genes2,low_genes)
-
##           low_genes
-## low_genes2 FALSE  TRUE
-##      FALSE  9844     2
-##      TRUE   1569 41165
-
filt_edata2 = filter(edata,!low_genes2)
-dim(filt_edata2)
-
## [1] 9846   19
-

After filtering we observe that using the log transform makes the distribution much more symmetric but there are still zeros. We could increase the filter to remove this issue if we needed to.

-
hist(log2(filt_edata[,1] + 1),col=2)
-

-
-
-

More transforms

-

There are a large number of other transforms depending on the type of data you are using. Some common ones are:

- -
-
-

Session information

-

Here is the session information

-
devtools::session_info()
-
##  setting  value                       
-##  version  R version 3.2.2 (2015-08-14)
-##  system   x86_64, darwin13.4.0        
-##  ui       X11                         
-##  language (EN)                        
-##  collate  en_GB.UTF-8                 
-##  tz       Europe/London               
-##  date     2015-09-17                  
-## 
-##  package      * version date       source        
-##  assertthat     0.1     2013-12-06 CRAN (R 3.2.0)
-##  Biobase      * 2.28.0  2015-04-17 Bioconductor  
-##  BiocGenerics * 0.14.0  2015-04-17 Bioconductor  
-##  DBI            0.3.1   2014-09-24 CRAN (R 3.2.0)
-##  devtools     * 1.9.1   2015-09-11 CRAN (R 3.2.0)
-##  digest         0.6.8   2014-12-31 CRAN (R 3.2.0)
-##  dplyr        * 0.4.3   2015-09-01 CRAN (R 3.2.2)
-##  evaluate       0.7.2   2015-08-13 CRAN (R 3.2.0)
-##  formatR        1.2     2015-04-21 CRAN (R 3.2.0)
-##  htmltools      0.2.6   2014-09-08 CRAN (R 3.2.0)
-##  knitr        * 1.11    2015-08-14 CRAN (R 3.2.2)
-##  lazyeval       0.1.10  2015-01-02 CRAN (R 3.2.0)
-##  magrittr       1.5     2014-11-22 CRAN (R 3.2.0)
-##  memoise        0.2.1   2014-04-22 CRAN (R 3.2.0)
-##  R6             2.1.1   2015-08-19 CRAN (R 3.2.2)
-##  Rcpp           0.12.1  2015-09-10 CRAN (R 3.2.0)
-##  rmarkdown      0.8     2015-08-30 CRAN (R 3.2.2)
-##  stringi        0.5-5   2015-06-29 CRAN (R 3.2.0)
-##  stringr        1.0.0   2015-04-30 CRAN (R 3.2.0)
-##  yaml           2.1.13  2014-06-12 CRAN (R 3.2.0)
-

It is also useful to compile the time the document was processed. This document was processed on: 2015-09-17.

-
- - -
- - - - - - - - diff --git a/vignettes/02_03_dimension-reduction.Rmd b/vignettes/02_03_dimension-reduction.Rmd index 4f121d5..2a520da 100644 --- a/vignettes/02_03_dimension-reduction.Rmd +++ b/vignettes/02_03_dimension-reduction.Rmd @@ -74,7 +74,7 @@ names(svd1) ## Look at the percent variance explained -The percent of variance explained is given by $\frac{d_{ii}}{\sum_{j}d_{jj}^2}$ +The percent of variance explained is given by $\frac{d_{ii}^2}{\sum_{j}d_{jj}^2}$ ```{r} plot(svd1$d,ylab="Singular value",col=2) plot(svd1$d^2/sum(svd1$d^2),ylab="Percent Variance Explained",col=2) diff --git a/vignettes/02_03_dimension-reduction.html b/vignettes/02_03_dimension-reduction.html new file mode 100644 index 0000000..ca6b71d --- /dev/null +++ b/vignettes/02_03_dimension-reduction.html @@ -0,0 +1,236 @@ + + + + + + + + + + + + + +Dimension reduction for genomics + + + + + + + + + + + + + + + + + + + + + +
+ + + + +
+ +
+ +
+

Dependencies

+

This document depends on the following packages:

+
  library(devtools)
+  library(Biobase)
+

To install these packages you can use the code (or if you are compiling the document, remove the eval=FALSE from the chunk.)

+
install.packages(c("devtools"))
+source("http://www.bioconductor.org/biocLite.R")
+biocLite(c("Biobase"))
+
+
+

General principles

+ +
+
+

Load some data

+

We will use this expression set that combines two studies Transcriptome genetics using second generation sequencing in a Caucasian population. and Understanding mechanisms underlying human gene expression variation with RNA sequencing.. These studies are different populations but we counted the same genes for both. Then we’ll explore the differences.

+
con =url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/montpick_eset.RData")
+load(file=con)
+close(con)
+mp = montpick.eset
+pdata=pData(mp)
+edata=as.data.frame(exprs(mp))
+fdata = fData(mp)
+ls()
+
## [1] "con"           "edata"         "fdata"         "montpick.eset"
+## [5] "mp"            "pdata"         "tropical"
+
+
+

Calculate the singular vectors

+

Here we calculate the singular vectors:

+
edata = edata[rowMeans(edata) > 100, ]
+edata = log2(edata + 1)
+edata_centered = edata - rowMeans(edata)
+svd1 = svd(edata_centered)
+names(svd1)
+
## [1] "d" "u" "v"
+
+
+

Look at the percent variance explained

+

The percent of variance explained is given by \(\frac{d_{ii}^2}{\sum_{j}d_{jj}^2}\)

+
plot(svd1$d,ylab="Singular value",col=2)
+

+
plot(svd1$d^2/sum(svd1$d^2),ylab="Percent Variance Explained",col=2)
+

+
+
+

Plot top two principal components

+
par(mfrow=c(1,2))
+plot(svd1$v[,1],col=2,ylab="1st PC")
+plot(svd1$v[,2],col=2,ylab="2nd PC")
+

+
+
+

Plot PC1 vs. PC2

+

A very common plot is to plot PC1 versus PC2 to see if you can see any “clusters” or “groups”.

+
plot(svd1$v[,1],svd1$v[,2],col=2,ylab="2nd PC",xlab="1st PC")
+

+

One thing you can do is color them by different variables to see if clusters stand out.

+
plot(svd1$v[,1],svd1$v[,2],ylab="2nd PC",
+     xlab="1st PC",col=as.numeric(pdata$study))
+

+

Another common plot is to make boxplots comparing the PC for different levels of known covariates (don’t forget to show the actual data!).

+
boxplot(svd1$v[,1] ~ pdata$study,border=c(1,2))
+points(svd1$v[,1] ~ jitter(as.numeric(pdata$study)),col=as.numeric(pdata$study))
+

+
+
+

PCs versus SVs

+

What we have been plotting is not exactly the principal components.

+
pc1 = prcomp(edata)
+plot(pc1$rotation[,1],svd1$v[,1])
+

+

To get the actual PCs you have to subtract the column means rather than the row means when normalizing.

+
edata_centered2 = t(t(edata) - colMeans(edata))
+svd2 = svd(edata_centered2)
+plot(pc1$rotation[,1],svd2$v[,1],col=2)
+

+

Despite this, it is most common for people to perform row-centering and then plot the singular vectors (sometimes labeling them PCs like I have done in this document)

+
+
+

Outliers

+

What happens if we introduce a single outlying gene

+
edata_outlier = edata_centered
+edata_outlier[6,] = edata_centered[6,] * 10000
+svd3 = svd(edata_outlier)
+plot(svd1$v[,1],svd3$v[,1],xlab="Without outlier",ylab="With outlier")
+

+

It turns out the new top singular vector is perfectly correlated with the outlying gene

+
plot(svd3$v[,1],edata_outlier[6,],col=4)
+

+
+
+

Further resources

+

There are a large number of resources available about PCA and SVD but the lecture notes from Advanced Statistics for the Life Sciences are the best set of lecture notes focused on genomics currently available.

+
+
+

Session information

+

Here is the session information

+
devtools::session_info()
+
##  setting  value                       
+##  version  R version 3.2.2 (2015-08-14)
+##  system   x86_64, darwin13.4.0        
+##  ui       X11                         
+##  language (EN)                        
+##  collate  en_GB.UTF-8                 
+##  tz       Europe/London               
+##  date     2015-09-19                  
+## 
+##  package      * version date       source        
+##  Biobase      * 2.28.0  2015-04-17 Bioconductor  
+##  BiocGenerics * 0.14.0  2015-04-17 Bioconductor  
+##  devtools     * 1.9.1   2015-09-11 CRAN (R 3.2.0)
+##  digest         0.6.8   2014-12-31 CRAN (R 3.2.0)
+##  evaluate       0.8     2015-09-18 CRAN (R 3.2.2)
+##  formatR        1.2.1   2015-09-18 CRAN (R 3.2.2)
+##  htmltools      0.2.6   2014-09-08 CRAN (R 3.2.0)
+##  knitr        * 1.11    2015-08-14 CRAN (R 3.2.2)
+##  magrittr       1.5     2014-11-22 CRAN (R 3.2.0)
+##  memoise        0.2.1   2014-04-22 CRAN (R 3.2.0)
+##  rmarkdown      0.8     2015-08-30 CRAN (R 3.2.2)
+##  stringi        0.5-5   2015-06-29 CRAN (R 3.2.0)
+##  stringr        1.0.0   2015-04-30 CRAN (R 3.2.0)
+##  yaml           2.1.13  2014-06-12 CRAN (R 3.2.0)
+

It is also useful to compile the time the document was processed. This document was processed on: 2015-09-19.

+
+ + +
+ + + + + + + +