-
Notifications
You must be signed in to change notification settings - Fork 14
Part 1: Exploratory Compositional PCA Biplot
Biplots summarize the variance relationships between genes, and distances between samples in the data
-
Gloor GB, Reid G. Compositional analysis: a valid approach to analyze microbiome high-throughput sequencing data. Can J Microbiol. 2016 Apr 12:1-12. (http://www.nrcresearchpress.com/doi/pdfplus/10.1139/cjm-2015-0821)
-
Gloor GB, Wu JR, Pawlowsky-Glahn V, Egozcue JJ. It's all relative: analyzing microbiome data as compositions. Ann Epidemiol. 2016 May;26(5):322-9. doi: 10.1016/j.annepidem.2016.03.003. (http://www.sciencedirect.com/science/article/pii/S1047279716300734)
-
Aitchison J, Greenacre M. Biplots of Compositional Data. Journal of the Royal Statistical Society: Series C 2002 51(4): 375-392 (http://onlinelibrary.wiley.com/doi/10.1111/1467-9876.00275/abstract)
In R, you load your OTU table into a data frame. You need to have some idea of what your table looks like before attempting to import. You may want to open the table in Excel to see what format it has and what headers. We will import both an OTU table and a table with corresponding taxon names. The OTU table is derived from the HMP oral dataset and was produced using the code in chunk/setup.R.
# This is the read counts table
d<-read.table("data/barton_agg.tsv", header=T, sep="\t", stringsAsFactors=F, quote = "", check.names=F, row.names=1, comment.char="")
-
dis the name we are using for our dataframe -
header=Tour table has a header row (with sample names) -
sep="\t"The columns are tab-separated -
stringsAsFactors=F, quote = "", check.names=Fthese are here for parsing -
row.names=1we are using the gene IDs as rownames -
comment.char=""this ignores the comment # in the headers -
check.names=Fthis tells R not to modify our column and rownames starting with a numeric
Whenever you import a data table, you'll want to inspect it to ensure the data looks correct. Try some of the following commands:
head(d)dim(d)nrow(d)colnames(d)-
d[1:3,1:5]# display the first 3 rows and first 5 columns of the data
Q: How many samples are there in this data table? How many genes?
We are going to generate a smaller random set of samples, and then filter our OTU table to remove very low count reads and zeros.
# get a small sample of the data so that we are working with a manageable table size
# working with the full table for some manipulations can take up to 5 minutes,
# so we are going to keep this quick.
# get the columns containing the two groups
# Grep can search and match strings. In this case, we are searching the
# column headers for samples named with "ak" or "op"
ak <- grep("ak", colnames(d))
op <- grep("op", colnames(d))
# pull a random sample of 10 each
d.r <- data.frame(d[,sample(ak, 10)], d[,sample(op, 10)], stringsAsFactors=FALSE)
# We now have 10 "ak" samples and 10 "op" samples in a new data frame called "d.r"Count the number of zeros in the table:
sum(d.r == 0)
Since there are lots of zeros in this table, we should filter out low abundance OTUs. There are a number of ways you could filter some examples here
# Remove OTUs with a mean read count across all samples less than or equal to our cutoff
# Note: we are now working with our subsampled data frame (d.r),
# the original data (d) remains unaltered
count <- 10 #this is the chosen cutoff
d.1 <- data.frame(d.r[which(apply(d.r, 1, function(x){mean(x)}) > count),],
check.names=F)Q: How many OTUs are left?
We are using the the centred log-ratio, or CLR (Aitchison). One of the issues when working with proportions, relative abundances, or normalized counts is that the differences between OTU values is not linear. This will turn our read counts into a ratio abundance (abundance as a ratio to the geometric mean of all OTUs per sample). This will allow us to retain the relationships between the individual components, but also puts the data in linear space where we can perform familiar statistics (and not violate CoDa). The geometric mean will also sale our values so that zero is average relative abundance, and OTUs will be higher (positive) or lower (negative) than the geometric mean of zero.
We are taking a log so the zeros must be replaced with an estimate value
# samples must be ROWS, so the data frame will be transposed using t()
# These functions are from the zCompositons package,
# remember to load it!: library(zCompositions)!
d.czm <- cmultRepl(t(d.1), label=0, method="CZM")Calculate the CLR
# The table needs to be transposed again (samples as COLUMNS)
d.clr <- t(apply(d.czm, 1, function(x){log(x) - mean(log(x))}))
# The output will have samples as ROWSPrincipal component analysis (PCA)
# Samples must be ROWs and features/OTUs as COLUMNS
d.pcx <- prcomp(d.clr)Calculate the total variance and percent variance explained for our PCA axes. The total variance is the sum of the squared standard deviations for each principal component.
# Sum the total variance
d.mvar <- sum(d.pcx$sdev^2)
# Calculate the PC1 and PC2 variance
PC1 <- paste("PC1: ", round(sum(d.pcx$sdev[1]^2)/d.mvar, 3))
PC2 <- paste("PC2: ", round(sum(d.pcx$sdev[2]^2)/d.mvar, 3))
# We are pasting together the component name and the variance to make an axes label
# Look at the PC1 variable by typing it in your consoleLet's plot the CoDa PCA biplot
# The basic biplot function:
# Plus we are printing the variance explained on the axes labels (as calculated before)
biplot(d.pcx, var.axes=T, scale=0, xlab=PC1, ylab=PC2)-
scale=0will scale the plot based on the samples. Usescale=1to scale based on features -
var.axes=TIf TRUE the second set of points have arrows representing them as (unscaled) axes
The basic plot is a little hard to read and interpret. We will modify the code to make it look better
# These are some hacks to get the plot to look OK
# Make the number of points equal to the number of features (for labels)
# Use: "O" for the op samples, "A" for the ak samples, and "." for the taxa
points <- c(rep(".", length(dimnames(d.pcx$rotation)[[1]])))
# We can also make the samples as points instead of labels
samples <- c(rep("A", length(grep("ak",rownames(d.pcx$x)))),
rep("O", length(grep("op",rownames(d.pcx$x)))) )
# Color and text size for labels and points (vector of 2)
# The first is the sample lables, the second is the points (OTUs).
col=c("black",rgb(1,0,0,0.2))
size=c(0.5, 2) #Relative scale, 1 is 100%
# Add all the label information into the biplot code
biplot(d.pcx, cex=size, col=col, var.axes=T,
xlab=PC1, ylab=PC2,
scale=0, ylabs=points, xlabs=samples
)It's ALWAYS important to keep checking your data. Try:
summary(d.pcx)
str(d.pcx)We can also generate a screeplot:
screeplot(d.pcx)
Q: What do these outputs tell you?
The second biplot contains the genus names, and no arrows.
col.g=c("black",rgb(1,0,0,0.4))
size.g=c(0.5, 0.4) #Relative scale, 1 is 100%
genera <- taxon[rownames(d.pcx$rotation), "genus"]
short.gen <- gsub("(^[A-Za-z]{3}).+", "\\1", genera, perl=TRUE)
biplot(d.pcx, cex=size.g, col=col.g, var.axes=F,
xlab=paste("PC1: ", round(sum(d.pcx$sdev[1]^2)/d.mvar, 3)),
ylab=paste("PC2: ", round(sum(d.pcx$sdev[2]^2)/d.mvar, 3)),
scale=0, ylabs=short.gen, xlabs=samples
)
abline(h=0, lty=2, col=rgb(0,0,0,0.1))
abline(v=0, lty=2, col=rgb(0,0,0,0.1))If you would like a PDF of either plot, do the following:
pdf("PCA_plot.pdf")
# The plotting code goes here
dev.off()You should also try a dendrogram
hc <- hclust(dist(d.clr))
plot(hc)
# We are using default parameters (dist=euclidean, hclust=complete linkage)
#line up labels
plot(hc, hang=-1)Q: How does this compare to the biplot?