Skip to content

Part 1: Exploratory Compositional PCA Biplot

Jean M Macklaim edited this page Oct 18, 2018 · 18 revisions

Biplots summarize the variance relationships between genes, and distances between samples in the data

References:


Read in the data table

In R, you load your OTU table into a data frame. You need to have some idea of what your table looks like before attempting to import. You may want to open the table in Excel to see what format it has and what headers. We will import both an OTU table and a table with corresponding taxon names. The OTU table is derived from the HMP oral dataset and was produced using the code in chunk/setup.R.

# This is the read counts table
d<-read.table("data/barton_agg.tsv", header=T, sep="\t", stringsAsFactors=F, quote = "", check.names=F, row.names=1, comment.char="")
  • d is the name we are using for our dataframe
  • header=T our table has a header row (with sample names)
  • sep="\t" The columns are tab-separated
  • stringsAsFactors=F, quote = "", check.names=F these are here for parsing
  • row.names=1 we are using the gene IDs as rownames
  • comment.char="" this ignores the comment # in the headers
  • check.names=F this tells R not to modify our column and rownames starting with a numeric

Whenever you import a data table, you'll want to inspect it to ensure the data looks correct. Try some of the following commands:

  • head(d)
  • dim(d)
  • nrow(d)
  • colnames(d)
  • d[1:3,1:5] # display the first 3 rows and first 5 columns of the data

Q: How many samples are there in this data table? How many genes?

We are going to generate a smaller random set of samples, and then filter our OTU table to remove very low count reads and zeros.

# get a small sample of the data so that we are working with a manageable table size
# working with the full table for some manipulations can take up to 5 minutes, 
# so we are going to keep this quick.

# get the columns containing the two groups
# Grep can search and match strings. In this case, we are searching the 
# column headers for samples named with "ak" or "op"
ak <- grep("ak", colnames(d))
op <- grep("op", colnames(d))

# pull a random sample of 10 each
d.r <- data.frame(d[,sample(ak, 10)], d[,sample(op, 10)], stringsAsFactors=FALSE)

# We now have 10 "ak" samples and 10 "op" samples in a new data frame called "d.r"

Count the number of zeros in the table:
sum(d.r == 0)

Since there are lots of zeros in this table, we should filter out low abundance OTUs. There are a number of ways you could filter some examples here

# Remove OTUs with a mean read count across all samples less than or equal to our cutoff
# Note: we are now working with our subsampled data frame (d.r), 
# the original data (d) remains unaltered

count <- 10 #this is the chosen cutoff
d.1 <- data.frame(d.r[which(apply(d.r, 1, function(x){mean(x)}) > count),], 
    check.names=F)

Q: How many OTUs are left?

Now we will transform the data

We are using the the centred log-ratio, or CLR (Aitchison). One of the issues when working with proportions, relative abundances, or normalized counts is that the differences between OTU values is not linear. This will turn our read counts into a ratio abundance (abundance as a ratio to the geometric mean of all OTUs per sample). This will allow us to retain the relationships between the individual components, but also puts the data in linear space where we can perform familiar statistics (and not violate CoDa). The geometric mean will also sale our values so that zero is average relative abundance, and OTUs will be higher (positive) or lower (negative) than the geometric mean of zero.

We are taking a log so the zeros must be replaced with an estimate value

# samples must be ROWS, so the data frame will be transposed using t()
# These functions are from the zCompositons package, 
# remember to load it!: library(zCompositions)!
d.czm <- cmultRepl(t(d.1),  label=0, method="CZM")

Calculate the CLR

# The table needs to be transposed again (samples as COLUMNS)
d.clr <- t(apply(d.czm, 1, function(x){log(x) - mean(log(x))}))
# The output will have samples as ROWS

Principal component analysis (PCA)

# Samples must be ROWs and features/OTUs as COLUMNS
d.pcx <- prcomp(d.clr)

Calculate the total variance and percent variance explained for our PCA axes. The total variance is the sum of the squared standard deviations for each principal component.

# Sum the total variance
d.mvar <- sum(d.pcx$sdev^2)
# Calculate the PC1 and PC2 variance
PC1 <- paste("PC1: ", round(sum(d.pcx$sdev[1]^2)/d.mvar, 3))
PC2 <- paste("PC2: ", round(sum(d.pcx$sdev[2]^2)/d.mvar, 3))
# We are pasting together the component name and the variance to make an axes label
# Look at the PC1 variable by typing it in your console

Let's plot the CoDa PCA biplot

# The basic biplot function:
# Plus we are printing the variance explained on the axes labels (as calculated before)
biplot(d.pcx, var.axes=T, scale=0, xlab=PC1, ylab=PC2)
  • scale=0 will scale the plot based on the samples. Use scale=1 to scale based on features
  • var.axes=T If TRUE the second set of points have arrows representing them as (unscaled) axes

The basic plot is a little hard to read and interpret. We will modify the code to make it look better

# These are some hacks to get the plot to look OK
# Make the number of points equal to the number of features (for labels)
#    Use: "O" for the op samples, "A" for the ak samples, and "." for the taxa

points <- c(rep(".", length(dimnames(d.pcx$rotation)[[1]])))

# We can also make the samples as points instead of labels

samples <- c(rep("A", length(grep("ak",rownames(d.pcx$x)))), 
    rep("O", length(grep("op",rownames(d.pcx$x)))) )

# Color and text size for labels and points (vector of 2)
#   The first is the sample lables, the second is the points (OTUs). 
col=c("black",rgb(1,0,0,0.2))
size=c(0.5, 2) #Relative scale, 1 is 100%

# Add all the label information into the biplot code
biplot(d.pcx, cex=size, col=col, var.axes=T,
    xlab=PC1, ylab=PC2,
    scale=0, ylabs=points, xlabs=samples
)

It's ALWAYS important to keep checking your data. Try:

summary(d.pcx)
str(d.pcx)

We can also generate a screeplot:
screeplot(d.pcx)

Q: What do these outputs tell you?

The second biplot contains the genus names, and no arrows.

col.g=c("black",rgb(1,0,0,0.4))
size.g=c(0.5, 0.4) #Relative scale, 1 is 100%
genera <- taxon[rownames(d.pcx$rotation), "genus"]
short.gen <- gsub("(^[A-Za-z]{3}).+", "\\1", genera, perl=TRUE)

biplot(d.pcx, cex=size.g, col=col.g, var.axes=F,
    xlab=paste("PC1: ", round(sum(d.pcx$sdev[1]^2)/d.mvar, 3)),
    ylab=paste("PC2: ", round(sum(d.pcx$sdev[2]^2)/d.mvar, 3)),
    scale=0, ylabs=short.gen, xlabs=samples
)
abline(h=0, lty=2, col=rgb(0,0,0,0.1))
abline(v=0, lty=2, col=rgb(0,0,0,0.1))

If you would like a PDF of either plot, do the following:

pdf("PCA_plot.pdf")
# The plotting code goes here
dev.off()

You should also try a dendrogram

hc <- hclust(dist(d.clr))
plot(hc)
# We are using default parameters (dist=euclidean, hclust=complete linkage)

#line up labels
plot(hc, hang=-1)

Q: How does this compare to the biplot?