diff --git a/vignettes/01_04_rmarkdown.Rmd b/vignettes/01_04_rmarkdown.Rmd index 55c4cc1..3397a35 100644 --- a/vignettes/01_04_rmarkdown.Rmd +++ b/vignettes/01_04_rmarkdown.Rmd @@ -47,7 +47,7 @@ knit_hooks$set(setPch = function(before, options, envir) { opts_chunk$set(setPch = TRUE) ``` -We will again hide this in future documents, but this shows how to set a default figure width and height, as well as setting plot margians. +We will again hide this in future documents, but this shows how to set a default figure width and height, as well as setting plot margins. ```{r global-plot,warning=FALSE, message=FALSE} diff --git a/vignettes/01_10_exploratory-analysis.Rmd b/vignettes/01_10_exploratory-analysis.Rmd index 03de4cb..8c0dd5c 100644 --- a/vignettes/01_10_exploratory-analysis.Rmd +++ b/vignettes/01_10_exploratory-analysis.Rmd @@ -53,11 +53,11 @@ biocLite("alyssafrazee/RSkittleBrewer") * Use plots as often as possible * Use tables for phenotype data -* Look for - * Missing values - * Outlier values - * Mislabeled samples - * Naming consistency +* Look for: + * Missing values + * Outlier values + * Mislabeled samples + * Naming consistency ### Make the plots pretty @@ -142,15 +142,15 @@ dim(edata) Here we see that there are a lot of outliers ```{r boxplot} -boxplot(log2(edata+1),col=2,range=0) +boxplot(log2(edata+1),col=2,range=0, las=2, cex.axis=0.8) ``` We can also look at this sample by sample with histograms ```{r histograms} par(mfrow=c(1,2)) -hist(log2(edata[,1]+1),col=2) -hist(log2(edata[,2]+1),col=2) +hist(log2(edata[,1]+1),col=2, cex.main=0.8) +hist(log2(edata[,2]+1),col=2, cex.main=0.8) ``` Or with density plots @@ -180,7 +180,7 @@ We can remove rows that are mostly zero and notice any differences in the distri ```{r} edata = as.data.frame(edata) filt_edata = filter(edata,rowMeans(edata) > 1) -boxplot(as.matrix(log2(filt_edata+1)),col=2) +boxplot(as.matrix(log2(filt_edata+1)),col=2, las=2, cex.axis=0.8) ``` ### Check for obvious data mixups diff --git a/vignettes/01_11_transforms.html b/vignettes/01_11_transforms.html deleted file mode 100644 index cc05719..0000000 --- a/vignettes/01_11_transforms.html +++ /dev/null @@ -1,236 +0,0 @@ - - - - -
- - - - - - - - -This document depends on the following packages:
- library(devtools)
- library(Biobase)
- library(dplyr)
-To install these packages you can use the code (or if you are compiling the document, remove the eval=FALSE from the chunk.)
install.packages(c("devtools"))
-source("http://www.bioconductor.org/biocLite.R")
-biocLite(c("Biobase"))
-We will use this expression set to look at how we use plots and tables to check for different characteristics
-con =url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/bodymap_eset.RData")
-load(file=con)
-close(con)
-bm = bodymap.eset
-pdata=pData(bm)
-edata=as.data.frame(exprs(bm))
-fdata = fData(bm)
-ls()
-## [1] "bm" "bodymap.eset" "con" "edata"
-## [5] "fdata" "pdata" "tropical"
-We would like continuous data to be nice and symmetric like a normal distribution for two reasons:(1) plots are easier to see this way and (2) most statistical methods are designed to work better for non-skewed data.
-hist(rnorm(1000),col=2)
-Realistically though most genomic data is skewed
-hist(edata[,1],col=2,breaks=100)
-One way to address this skew is to use a transformation. One common transformation is the log transform.
hist(log(edata[,1]),col=2,breaks=100)
-One thing to be careful with is that values of zero become -inf when you apply the log transform because the log of zero isn’t defined.
min(log(edata))
-## [1] -Inf
-You can remove this problem by adding a small number to all the counts before taking the log. This doesn’t change the overall distribution much but resolves the zero problem
-min(log(edata[,1] + 1))
-## [1] 0
-hist(log(edata[,1] + 1),breaks=100,col=2)
-A common choice is to use the log2 transform. The reason is that data on the log2 scale can be expressed as fold changes. If log2(x) - log2(y) = z then there is a z fold change difference between x and y.
hist(log2(edata[,1] + 1),breaks=100,col=2)
-Let’s zoom in on the values greater than zero (we’ll come back to the zero values in a minute)
-hist(log2(edata[,1] + 1),breaks=100,col=2,xlim=c(1,15),ylim=c(0,400))
-This is starting to look a little better but we still have the zeros problem.
-In many types of sequencing data we observe that there is not much data for many features.
-hist(rowSums(edata==0),col=2)
-A common pre-processing technique is to remove features that don’t have much data. One way to identify and remove them is to use rowMeans and rowMedians
low_genes = rowMeans(edata) < 5
-table(low_genes)
-## low_genes
-## FALSE TRUE
-## 11413 41167
-filt_edata = filter(edata,!low_genes)
-dim(filt_edata)
-## [1] 11413 19
-low_genes2 = rowMedians(as.matrix(edata)) < 5
-table(low_genes2,low_genes)
-## low_genes
-## low_genes2 FALSE TRUE
-## FALSE 9844 2
-## TRUE 1569 41165
-filt_edata2 = filter(edata,!low_genes2)
-dim(filt_edata2)
-## [1] 9846 19
-After filtering we observe that using the log transform makes the distribution much more symmetric but there are still zeros. We could increase the filter to remove this issue if we needed to.
hist(log2(filt_edata[,1] + 1),col=2)
-There are a large number of other transforms depending on the type of data you are using. Some common ones are:
-Here is the session information
-devtools::session_info()
-## setting value
-## version R version 3.2.2 (2015-08-14)
-## system x86_64, darwin13.4.0
-## ui X11
-## language (EN)
-## collate en_GB.UTF-8
-## tz Europe/London
-## date 2015-09-17
-##
-## package * version date source
-## assertthat 0.1 2013-12-06 CRAN (R 3.2.0)
-## Biobase * 2.28.0 2015-04-17 Bioconductor
-## BiocGenerics * 0.14.0 2015-04-17 Bioconductor
-## DBI 0.3.1 2014-09-24 CRAN (R 3.2.0)
-## devtools * 1.9.1 2015-09-11 CRAN (R 3.2.0)
-## digest 0.6.8 2014-12-31 CRAN (R 3.2.0)
-## dplyr * 0.4.3 2015-09-01 CRAN (R 3.2.2)
-## evaluate 0.7.2 2015-08-13 CRAN (R 3.2.0)
-## formatR 1.2 2015-04-21 CRAN (R 3.2.0)
-## htmltools 0.2.6 2014-09-08 CRAN (R 3.2.0)
-## knitr * 1.11 2015-08-14 CRAN (R 3.2.2)
-## lazyeval 0.1.10 2015-01-02 CRAN (R 3.2.0)
-## magrittr 1.5 2014-11-22 CRAN (R 3.2.0)
-## memoise 0.2.1 2014-04-22 CRAN (R 3.2.0)
-## R6 2.1.1 2015-08-19 CRAN (R 3.2.2)
-## Rcpp 0.12.1 2015-09-10 CRAN (R 3.2.0)
-## rmarkdown 0.8 2015-08-30 CRAN (R 3.2.2)
-## stringi 0.5-5 2015-06-29 CRAN (R 3.2.0)
-## stringr 1.0.0 2015-04-30 CRAN (R 3.2.0)
-## yaml 2.1.13 2014-06-12 CRAN (R 3.2.0)
-It is also useful to compile the time the document was processed. This document was processed on: 2015-09-17.
-This document depends on the following packages:
+ library(devtools)
+ library(Biobase)
+To install these packages you can use the code (or if you are compiling the document, remove the eval=FALSE from the chunk.)
install.packages(c("devtools"))
+source("http://www.bioconductor.org/biocLite.R")
+biocLite(c("Biobase"))
+We will use this expression set that combines two studies Transcriptome genetics using second generation sequencing in a Caucasian population. and Understanding mechanisms underlying human gene expression variation with RNA sequencing.. These studies are different populations but we counted the same genes for both. Then we’ll explore the differences.
+con =url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/montpick_eset.RData")
+load(file=con)
+close(con)
+mp = montpick.eset
+pdata=pData(mp)
+edata=as.data.frame(exprs(mp))
+fdata = fData(mp)
+ls()
+## [1] "con" "edata" "fdata" "montpick.eset"
+## [5] "mp" "pdata" "tropical"
+Here we calculate the singular vectors:
+edata = edata[rowMeans(edata) > 100, ]
+edata = log2(edata + 1)
+edata_centered = edata - rowMeans(edata)
+svd1 = svd(edata_centered)
+names(svd1)
+## [1] "d" "u" "v"
+The percent of variance explained is given by \(\frac{d_{ii}^2}{\sum_{j}d_{jj}^2}\)
+plot(svd1$d,ylab="Singular value",col=2)
+plot(svd1$d^2/sum(svd1$d^2),ylab="Percent Variance Explained",col=2)
+par(mfrow=c(1,2))
+plot(svd1$v[,1],col=2,ylab="1st PC")
+plot(svd1$v[,2],col=2,ylab="2nd PC")
+A very common plot is to plot PC1 versus PC2 to see if you can see any “clusters” or “groups”.
+plot(svd1$v[,1],svd1$v[,2],col=2,ylab="2nd PC",xlab="1st PC")
+One thing you can do is color them by different variables to see if clusters stand out.
+plot(svd1$v[,1],svd1$v[,2],ylab="2nd PC",
+ xlab="1st PC",col=as.numeric(pdata$study))
+Another common plot is to make boxplots comparing the PC for different levels of known covariates (don’t forget to show the actual data!).
+boxplot(svd1$v[,1] ~ pdata$study,border=c(1,2))
+points(svd1$v[,1] ~ jitter(as.numeric(pdata$study)),col=as.numeric(pdata$study))
+What we have been plotting is not exactly the principal components.
+pc1 = prcomp(edata)
+plot(pc1$rotation[,1],svd1$v[,1])
+To get the actual PCs you have to subtract the column means rather than the row means when normalizing.
+edata_centered2 = t(t(edata) - colMeans(edata))
+svd2 = svd(edata_centered2)
+plot(pc1$rotation[,1],svd2$v[,1],col=2)
+Despite this, it is most common for people to perform row-centering and then plot the singular vectors (sometimes labeling them PCs like I have done in this document)
+What happens if we introduce a single outlying gene
+edata_outlier = edata_centered
+edata_outlier[6,] = edata_centered[6,] * 10000
+svd3 = svd(edata_outlier)
+plot(svd1$v[,1],svd3$v[,1],xlab="Without outlier",ylab="With outlier")
+It turns out the new top singular vector is perfectly correlated with the outlying gene
+plot(svd3$v[,1],edata_outlier[6,],col=4)
+There are a large number of resources available about PCA and SVD but the lecture notes from Advanced Statistics for the Life Sciences are the best set of lecture notes focused on genomics currently available.
+Here is the session information
+devtools::session_info()
+## setting value
+## version R version 3.2.2 (2015-08-14)
+## system x86_64, darwin13.4.0
+## ui X11
+## language (EN)
+## collate en_GB.UTF-8
+## tz Europe/London
+## date 2015-09-19
+##
+## package * version date source
+## Biobase * 2.28.0 2015-04-17 Bioconductor
+## BiocGenerics * 0.14.0 2015-04-17 Bioconductor
+## devtools * 1.9.1 2015-09-11 CRAN (R 3.2.0)
+## digest 0.6.8 2014-12-31 CRAN (R 3.2.0)
+## evaluate 0.8 2015-09-18 CRAN (R 3.2.2)
+## formatR 1.2.1 2015-09-18 CRAN (R 3.2.2)
+## htmltools 0.2.6 2014-09-08 CRAN (R 3.2.0)
+## knitr * 1.11 2015-08-14 CRAN (R 3.2.2)
+## magrittr 1.5 2014-11-22 CRAN (R 3.2.0)
+## memoise 0.2.1 2014-04-22 CRAN (R 3.2.0)
+## rmarkdown 0.8 2015-08-30 CRAN (R 3.2.2)
+## stringi 0.5-5 2015-06-29 CRAN (R 3.2.0)
+## stringr 1.0.0 2015-04-30 CRAN (R 3.2.0)
+## yaml 2.1.13 2014-06-12 CRAN (R 3.2.0)
+It is also useful to compile the time the document was processed. This document was processed on: 2015-09-19.
+