CoDa_microbiome_tutorial/make_interpret_biplot.Rmd at master · ggloor/CoDa_microbiome_tutorial · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
title: "Simple biplot"
author: "gg"
date: '`r format(Sys.time(), "%d %B, %Y")`'
bibliography: /Users/ggloor/Library/texmf/bibtex/bib/bibdesk_refs.bib
fig_caption: true
output:
    pdf_document:
        fig_caption: yes
---

To run this file:
Rscript -e "rmarkdown::render('make_interpret_biplot.Rmd')"
# The compositional biplot

When analyzing and interpreting compositional data, it is important to remember that we are examining the variance in the ratios of the underlying data, and not directly examining abundance. The first tool that we will use is the compositional biplot. This is generated by the following set of steps:

1. remove essential 0 values (0s that are in all samples. i.e. nondetects)
2. perform any additional filtering (sparsity, minimal abundance, minimum sample count, etc)
3. adjust remaining 0 values with the zCompositions package
4. perform the clr transform on the data
5. conduct a singular value decomposition using prcomp
6. display the results in a principle component plot

Let us see how this works in principle. We will make a sample dataset that has 30 samples and only eight features. Samples will be in two groups. The first group of 20 will differ from the last group of 10 . Feature A will be more abundant in the first 20 samples, and less abundant in the last 10 , features C and D will be the opposite, with feature D having a greater difference between groups than feature C. Features B, and E to H will be highly variable, but the variation will be associated with samples randomly. For simplicity we will use a random-uniform distribution, but any other distribution would work as well.

```{r, echo=FALSE, results='show', fig.width=8, fig.height=8, message=FALSE,fig.cap="Counts were generated randomly for eight features labeled a_h and were free to range from the values indicated for each feature. Thirty samples were generated, and  features A,C and D were differentially abundant in the first twenty and last 10 samples. The others were of widely different abundances, but with totally random change between samples. Feature D is a slightly randomized version of feature C that is scaled 100 fold. From this we can see that the absolute abundance is not represented in the biplot."}

set.seed(7)
# runif(n, min, max)
A <- c(runif(20, 15, 30), runif(10,5,10))
B <- runif(30, 1, 50)
C <- c(runif(20, 5,10), runif(10,15,30))
D <- C * runif(30, 1000, 1250)
# D <- c(runif(20, 10, 20), runif(10,40,80))
E <- runif(30, 1, 50)
F <- runif(30, 1, 2)
G <- runif(30, 1000, 1200)
H <- runif(30, 1000, 2000)

a_h <- cbind(A,B,C,D,E,F,G,H)
a_h[a_h<0] <- 0.00001

# convert to proportions
a_h.prop <- t(apply(a_h, 1, function(x){x/sum(x)}))

# clr transform
a_h.clr <- t(apply(a_h.prop, 1, function(x){log(x) - mean(log(x))}))

# extract the principle components and loadings
a_h.pcx <- prcomp(a_h.clr)

# make a compositional biplot
#biplot(a_h.pcx, var.axes=TRUE, scale=0) # form
biplot(a_h.pcx, var.axes=TRUE, scale=0,
    xlab=paste("PC1", round(a_h.pcx$sdev[1]^2 / sum(a_h.pcx$sdev^2),3), sep=": "),
    ylab=paste("PC2", round(a_h.pcx$sdev[2]^2 / sum(a_h.pcx$sdev^2),3), sep=": ")
) # covariance

```

Principle component plots display the projection (shadow) of your multi-dimensional data onto a lower number of dimensions, and here the maximum number of dimensions possible is seven. The compositional biplot  displays the results of your experiment in a semi-quantitative way.

The first dimension is the one that displays the largest amount of variation in the data: for example, if your data had four features, it would have 3 dimensions (like a rugby ball) and the first dimension would represent the long axis.

The principle components are arranged in decreasing order of the variance explained. If you have a strong effect driven by a single experimental feature, then you will have a large amount of variation explained on the first axis, and a much smaller

From this plot we can determine the following:

1. The first two principle components explain about 80\% of the variance in the dataset. This is very good. The greater the variance explained, the greater the confidence one can have in the projection. If we thought of this as a shadow of the data, then most features and samples would have fairly distinct locations.
2. The samples (in black) partition into two groups: samples 1:20 on the left and samples 21-30 on the right with a clear separation between them. They are separated on PC1, which indicates a lack of confounding effects on the split between samples.
3. The length and direction of the arrows (feature locations) is proportional to the standard deviation of the feature in the dataset. So we can see that feature A is highly variable along the same direction as are samples 1-20. We can interpret this as feature A is relatively more abundant in these samples than in samples 21-30. Likewise for the other features.
4. features C,D are very close together; this is referred to as having a short link. The length of a link is proportional to the variance in their ratios. In other words, the variance of the ratios of these two features will be fairly constant. In other words again, these features have a high compositional association. These are the types of relationships we aim to identify with the propr package later on.
5. We would expect that features A, C and D are the most variable between samples, and this is the type of result we would test with ALDEx2 later on.


```{r, echo=FALSE, results='show', fig.width=8, fig.height=8, warning=FALSE, message=FALSE,fig.cap="The advantage of compositions is that the results are largely invariant to subsetting. That is, we get essentially the same answer with all (A-H), and with a subset of the data (A-G). We also see that in these data, the variance of the clr transformed values is much more informative than the absolute, or proportional values, and better represents the known structure of the data. "}

a_h <- cbind(A,B,C,D,E,F,G,H)
a_h.prop <- t(apply(a_h, 1, function(x){x/sum(x)}))
a_h.clr <- t(apply(a_h.prop, 1, function(x){log(x) - mean(log(x))}))
a_h.pcx <- prcomp(a_h.clr)
a_h.p.pcx <- prcomp(a_h.prop)
a_h.c.pcx <- prcomp(a_h)

a_g <- cbind(A,B,C,D,E,F,G)
a_g.prop <- t(apply(a_g, 1, function(x){x/sum(x)}))
a_g.clr <- t(apply(a_g.prop, 1, function(x){log(x) - mean(log(x))}))
a_g.pcx <- prcomp(a_g.clr)
a_g.p.pcx <- prcomp(a_g.prop)
a_g.c.pcx <- prcomp(a_g)

a_h.z <- apply(a_h, 2, function(x){ (x-mean(x))/sqrt(var(x)) })
a_h.z.prop <- apply(a_h.prop, 2, function(x){ (x-mean(x))/sqrt(var(x)) })
a_h.z.prop.pcx <- prcomp(a_h.z.prop)
a_h.z.pcx <- prcomp(a_h.z)

a_g.z <- apply(a_g, 2, function(x){ (x-mean(x))/sqrt(var(x)) })
a_g.z.prop <- apply(a_g.prop, 2, function(x){ (x-mean(x))/sqrt(var(x)) })
a_g.z.prop.pcx <- prcomp(a_g.z.prop)
a_g.z.pcx <- prcomp(a_g.z)

par(mfrow=c(2,3))
#biplot(a_h.pcx, var.axes=TRUE, scale=0) # form
biplot(a_h.pcx, var.axes=TRUE, scale=0,
    xlab=paste("PC1", round(a_h.pcx$sdev[1]^2 / sum(a_h.pcx$sdev^2),3), sep=": "),
    ylab=paste("PC2", round(a_h.pcx$sdev[2]^2 / sum(a_h.pcx$sdev^2),3), sep=": ")
    , main="A-H clr"
) # covariance

biplot(a_h.p.pcx, var.axes=TRUE, scale=0,
    xlab=paste("PC1", round(a_h.p.pcx$sdev[1]^2 / sum(a_h.p.pcx$sdev^2),3), sep=": "),
    ylab=paste("PC2", round(a_h.p.pcx$sdev[2]^2 / sum(a_h.p.pcx$sdev^2),3), sep=": ")
    , main="A-H prop"
) # covariance

biplot(a_h.c.pcx, var.axes=TRUE, scale=0,
    xlab=paste("PC1", round(a_h.c.pcx$sdev[1]^2 / sum(a_h.c.pcx$sdev^2),3), sep=": "),
    ylab=paste("PC2", round(a_h.c.pcx$sdev[2]^2 / sum(a_h.c.pcx$sdev^2),3), sep=": ")
    , main="A-H count"
) # covariance

biplot(a_g.pcx, var.axes=TRUE, scale=0,
    xlab=paste("PC1", round(a_g.pcx$sdev[1]^2 / sum(a_g.pcx$sdev^2),3), sep=": "),
    ylab=paste("PC2", round(a_g.pcx$sdev[2]^2 / sum(a_g.pcx$sdev^2),3), sep=": ")
    , main="A-G clr"
) # covariance

biplot(a_g.p.pcx, var.axes=TRUE, scale=0,
    xlab=paste("PC1", round(a_g.p.pcx$sdev[1]^2 / sum(a_g.p.pcx$sdev^2),3), sep=": "),
    ylab=paste("PC2", round(a_g.p.pcx$sdev[2]^2 / sum(a_g.p.pcx$sdev^2),3), sep=": ")
    , main="A-G prop"
) # covariance

biplot(a_g.c.pcx, var.axes=TRUE, scale=0,
    xlab=paste("PC1", round(a_g.c.pcx$sdev[1]^2 / sum(a_g.c.pcx$sdev^2),3), sep=": "),
    ylab=paste("PC2", round(a_g.c.pcx$sdev[2]^2 / sum(a_g.c.pcx$sdev^2),3), sep=": ")
    , main="A-G count"
) # covariance

```