CoDa_microbiome_tutorial/Exploratory_biplot.Rmd at master · ggloor/CoDa_microbiome_tutorial · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
---
title: "Exploratory biplot"
author: "gg"
date: '`r format(Sys.time(), "%d %B, %Y")`'
bibliography: /Users/ggloor/Library/texmf/bibtex/bib/bibdesk_refs.bib
fig_caption: true
output:
    pdf_document:
        fig_caption: yes
---

To run this file:
Rscript -e "rmarkdown::render('Exploratory_biplot.Rmd')"

## R packages needed

`zCompositions, ppclust, factoextra, cluster, fclust` CRAN

## The dataset and first biplot

We will use as an example a transcriptome dataset [@Schurch:2016aa;@Gierlinski:2015aa] containing 96 samples, 48 each from wt and SNF2 knockout strain These data have been filtered to include only those features that are present with a mean count of at least 0.1 across all samples.

The compositional biplot is the first exploratory data analysis tool that should be used whenever exploring a  dataset. It shows, in one plot, the essences of your results. Do my samples separate into groups? features are driving this separation? what features are irrelevant to the analysis?

Compositional biplots appear to be complex and intimidating, but with a little patience and practice they are easily interpretable [@aitchison2002biplots]. They are based on the variance of the ratios of the parts, and are substantially more informative that the commonly used PCoA plots that are driven largely by abundance [@Gorvitovskaia:2016aa].

```{r outlier_function, echo=F}
codaSeq.outlier <- function(x, plot.me=TRUE, col=rgb(1,0,0,0.3)){

  pcx <- prcomp(x)
  mv <- sum(pcx$sdev^2)

  sample.var <-  apply(pcx$x,1,function(y){sum(y^2/mv)})

  cut <- median(apply(pcx$x,1,function(x){sum(x^2/mv)})) + 2 * IQR(apply(pcx$x,1,function(x){sum(x^2/mv)}))

  bad <- names(which(apply(pcx$x,1,function(x){sum(x^2/mv)}) > cut))
  good <- names(which(apply(pcx$x,1,function(x){sum(x^2/mv)}) <= cut))
  if(plot.me == TRUE){
    hist(sample.var, breaks=100)
    boxplot(sample.var, horizontal=TRUE, col=col, add=TRUE)
    abline(v=cut, lty=2)
  }
  return(list(sample.var=sample.var, bad=bad, good=good) )
}
```

```{r biplot, echo=TRUE, results='as.is', fig.width=7, fig.height=7, error=FALSE, message=FALSE, warning=FALSE, fig.cap="The compositional biplot is the workhorse tool for CoDa. This plot summarizes the entire analysis in a qualitative manner. We can see that the op and ak samples separate very well, although the proportion of variance explained on component 1 is small. Furthermore, we can see the genus names of some of the features that are driving this divide. Finally, component 1 has substantially more variance than does componet 2, and we can explain this experiment as a simple two part comparison with the largest variance along the axis of the comparison."}

# read in the dataset and associated taxonomy file
# samples by row - now correct: thx VGalata
d.agg <- read.table("data/barton_agg.tsv", sep="\t", header=T, row.names=1)

# load the library zCompositions to perform 0 replacement
library(zCompositions)

# it is important to first filter to remove rows that are exclusively 0 values
d.filt <- d.agg[,colSums(d.agg) > 0,]

# we are using the Count Zero Multiplicative approach
d.n0 <- cmultRepl(d.filt, method="CZM", label=0)

# generate the centered log-ratio transformed data
# samples by row
d.clr <- apply(d.n0, 1, function(x) log(x) - mean(log(x)))

# apply a singular value decomposition to the dataset
# do not use princomp function in R!!
pcx <- prcomp(t(d.clr))

# get the labels for the first two components
PC1 <- paste("PC1: ", round(pcx$sdev[1]^2/sum(pcx$sdev^2),3), sep="")
PC2 <- paste("PC2: ", round(pcx$sdev[2]^2/sum(pcx$sdev^2),3), sep="")

par(fig=c(0,1,0,1), new=TRUE)
# generate a scree plot
par(fig=c(0,0.8,0,1), new=TRUE)
biplot(pcx, cex=c(0.6,0.6), col=c("black", rgb(1,0,0,0.2)), var.axes=F, scale=0,
    xlab=PC1, ylab=PC2)
abline(h=0, lty=2, lwd=2, col=rgb(0,0,0,0.3))
abline(v=0, lty=2, lwd=2, col=rgb(0,0,0,0.3))

par(fig=c(0.8,1,0,1), new=TRUE)
plot(pcx, main="hist")

```

\newpage

### Rules for interpreting compositional biplots:

- All interpretations are up to the limit of the variance explained. We can think of this as a shadow of the multidimensional dataset (4545 dimensions!) projected onto two dimensions. If the variance explained is high ( > 0.8) then the edges of the shadows are sharp, however, if the variance explained is low, as it is here, then we have little confidence in the exact placement of any individual sample or feature.

- The distance between samples is related to their multivariate similarity of the parts as ratios. If all components are relatively the same (ie, the ratios between all parts are identical), then two samples are in the same location.

- We must interpret the features as ratios. Abundance information is not directly available on these plots.

- The distance and direction of an feature from the origin is the standard deviation of the ratio of that feature to the geometric mean of all features.

- The line between any set of features is called a link. Links that pass through more than one feature are permitted and do not change the interpretation.

- Short links indicate a constant or near constant ratio between the two (or more) linked features in the dataset. This dataset is too complex to identify links easily

- Long links indicate a non-constant ratio between the joined features, and define a ratio relationship that can be inverse or random. There is no principled method to determine which is the case.

\newpage

## Finding outliers

We can see that there are a number of samples that appear to be outlier samples. Should we include SNF2.6 in the analysis or not? One of the messages of the Barton papers [@Schurch:2016aa;@Gierlinski:2015aa] was that about 10% of samples, even carefully prepared samples can be outliers for unknown methodological reasons. We approach outliers by finding those samples that contribute more variance than expected to the variance of the group. Outliers are defined as those samples that contribute greater than the median plus twice the interquartile range of the sample variance to the total variance of the group.


```{r outlier, message=FALSE, warning=FALSE, echo=FALSE, fig.cap='outliers', fig.height=4, fig.width=4}
# get the outliers from each group. See codaSeq.outlier function
# get WT indices
WT <- grep("WT", rownames(d.agg))
# subset
WT.agg <- d.agg[WT,]

# filter, samples by row
wt.gt0 <- WT.agg[,colSums(WT.agg) > 0]

# estimate 0 values (zCompositions)
# samples by row
wt.agg.n0 <- cmultRepl(wt.gt0, method="CZM", label=0)

# clr transform
wt.agg.n0.clr <- t(apply(wt.agg.n0, 1, function(x) log(x) - mean(log(x))))

# SVD
pcx.wt  <- prcomp(wt.agg.n0.clr)
mvar.wt.clr <- sum(pcx.wt$sdev^2)

# plot
par(mfrow=c(1,1))
biplot(pcx.wt, var.axes=FALSE, scale=0,  cex=c(1,.05))

# make a list of names to keep. found in $good
WT.g <- codaSeq.outlier(wt.agg.n0.clr, plot.me=TRUE)

SNF <- grep("SNF", rownames(d.agg))
# subset
SNF.agg <- d.agg[SNF,]

# filter, samples by row
SNF.gt0 <- SNF.agg[,colSums(SNF.agg) > 0]

# estimate 0 values (zCompositions)
# samples by row
SNF.agg.n0 <- cmultRepl(SNF.gt0, method="CZM", label=0)

# clr transform
SNF.agg.n0.clr <- t(apply(SNF.agg.n0, 1, function(x) log(x) - mean(log(x))))

# SVD
pcx.SNF  <- prcomp(SNF.agg.n0.clr)
mvar.SNF.clr <- sum(pcx.SNF$sdev^2)

# plot
par(mfrow=c(1,1))
biplot(pcx.SNF, var.axes=FALSE, scale=0,  cex=c(1,.05))

# make a list of names to keep. found in $good
SNF.g <- codaSeq.outlier(SNF.agg.n0.clr, plot.me=TRUE)

```
\clearpage
\newpage

## Biplot of non-outlier samples only

Now we can make a biplot of only those samples that are non-outliers. We see that the SNF2 KO group is more homogeneous than is the WT group. Almost certainly since the SNF2 group is clonal, and the WT group is likely grown from a frozen culture.

```{r good_data_pca, message=FALSE, warning=FALSE, echo=FALSE, fig.cap='outliers', fig.height=7, fig.width=7}

# make a dataset of only the non-outlier samples
d.good <- rbind(d.agg[SNF.g$good,],d.agg[WT.g$good,])

# filter
#d.good.gt0 <- codaSeq.filter(d.good,  min.count=1, samples.by.row=TRUE)
d.good.gt0 <- d.good[,colSums(d.good) > 0]

# estimate 0 values (zCompositions)
d.good.agg.n0 <- cmultRepl(d.good.gt0, method="CZM", label=0)

# clr transform
d.good.agg.n0.clr <- t(apply(d.good.agg.n0, 1, function(x) log(x) - mean(log(x))))

# SVD
pcx.good  <- prcomp(d.good.agg.n0.clr)
mvar.good <- sum(pcx.good$sdev^2)
# get the labels for the first two components
PC1.g <- paste("PC1: ", round(pcx.good$sdev[1]^2/sum(pcx.good$sdev^2),3), sep="")
PC2.g <- paste("PC2: ", round(pcx.good$sdev[2]^2/sum(pcx.good$sdev^2),3), sep="")

# plot and save
par(mfrow=c(1,1))
biplot(pcx.good, var.axes=FALSE, scale=0, cex=c(1,.5),  xlab=PC1.g, ylab=PC2.g)
abline(h=0, lty=2, lwd=2, col=rgb(0,0,0,0.3))
abline(v=0, lty=2, lwd=2, col=rgb(0,0,0,0.3))

# optionally save
# write.table(d.good.gt0, file="data/filtered_table.txt", sep="\t", quote=F, col.names=NA)
```
\clearpage

## We can also plot the biplot of the samples that were removed.

We can also examine those samples that were deemed to be outliers. Here we can see that the outlier samples have the axis of the experiment on PC2, which means that there problems in the dataset that led to these samples being outliers overwhelms the actual signal in the dataset. This confirms that these samples should be removed from the dataset as they are adding either noise, or some systematic information to the dataset.

```{r bad_data_pca, message=FALSE, warning=FALSE, echo=FALSE, fig.cap='outliers', fig.height=7, fig.width=7}

# make a dataset of only the non-outlier samples
d.bad <- rbind(d.agg[SNF.g$bad,],d.agg[WT.g$bad,])

# filter
d.bad.gt0 <- d.bad[,colSums(d.bad) > 0]

# estimate 0 values (zCompositions)
d.bad.agg.n0 <- cmultRepl(d.bad.gt0, method="CZM", label=0)

# clr transform
d.bad.agg.n0.clr <- t(apply(d.bad.agg.n0, 1, function(x) log(x) - mean(log(x))))

# SVD
pcx.bad  <- prcomp(d.bad.agg.n0.clr)
mvar.bad <- sum(pcx.bad$sdev^2)
# get the labels for the first two components
PC1.g <- paste("PC1: ", round(pcx.bad$sdev[1]^2/sum(pcx.bad$sdev^2),3), sep="")
PC2.g <- paste("PC2: ", round(pcx.bad$sdev[2]^2/sum(pcx.bad$sdev^2),3), sep="")

# plot and save
par(mfrow=c(1,1))
biplot(pcx.bad, var.axes=FALSE, scale=0, cex=c(1,.5),  xlab=PC1.g, ylab=PC2.g)
abline(h=0, lty=2, lwd=2, col=rgb(0,0,0,0.3))
abline(v=0, lty=2, lwd=2, col=rgb(0,0,0,0.3))

# optionally save
# write.table(d.good.gt0, file="data/filtered_table.txt", sep="\t", quote=F, col.names=NA)
```

\clearpage

## Additional filtering does not change the conclusions

We can do additional filtering. Examining the features, most contribute little, if anything, to the separation. These can be removed by filtering out low variance features. Note that we lose some resolution, but that we recapitulate the dataset with only half the features. We could do this iteratively.

```{r lowvar, , message=FALSE, warning=FALSE, echo=FALSE, fig.cap='outliers', fig.height=7, fig.width=7}
var.clr <- apply(d.good.agg.n0.clr, 2, var)
nms <- which(var.clr > median(var.clr)) #

d.lv <- d.good[, names(nms)]
# filter

# estimate 0 values (zCompositions)
d.lv.agg.n0 <- cmultRepl(d.lv, method="CZM", label=0)

# clr transform
d.lv.agg.n0.clr <- t(apply(d.lv.agg.n0, 1, function(x) log(x) - mean(log(x))))

# SVD
pcx.lv  <- prcomp(d.lv.agg.n0.clr)
mvar.lv <- sum(pcx.lv$sdev^2)
PC1.lv <- paste("PC1: ", round(pcx.lv$sdev[1]^2/sum(pcx.lv$sdev^2),3), sep="")
PC2.lv <- paste("PC2: ", round(pcx.lv$sdev[2]^2/sum(pcx.lv$sdev^2),3), sep="")

# plot and save
par(mfrow=c(1,1))
biplot(pcx.lv, var.axes=FALSE, scale=0, cex=c(1,.5),  xlab=PC1.lv, ylab=PC2.lv)
abline(h=0, lty=2, lwd=2, col=rgb(0,0,0,0.3))
abline(v=0, lty=2, lwd=2, col=rgb(0,0,0,0.3))
# get the labels for the first two components
```
\clearpage

## FUZZY CLUSTERING

We can plot the samples according to their kmeans cluster membership. For this we are using the fuzzy clustering package ppclust [@fuzzy:2018]. There is a good introduction to fuzzy clustering in  [@fernandez:2012]. Essentially, we are using a probabilistic (or possibilistic) approach to determine the number of clusters, and the cluster memberships. The vignette for this approach is at: https://cran.r-project.org/web/packages/ppclust/vignettes/fcm.html. As noted in the workshop, we get two clusters if we choose centers=2 or =3, but the SNF2 and WT groups split if we choose centers=4.

```{r fuzzy}
library(ppclust)
library(factoextra)
library(cluster)
library(fclust)


res.fcm <- fcm(d.lv.agg.n0.clr, centers=2)
#as.data.frame(res.fcm$u)
#summary(res.fcm)

res.fcm2 <- ppclust2(res.fcm, "kmeans")

factoextra::fviz_cluster(res.fcm2, data = d.lv.agg.n0.clr,
  ellipse.type = "norm", labelsize=10,  palette = "jco",
  repel = TRUE)
```
############

## References