microarray-and-RNAseq-analysis/microarrayLimma.Rmd at main · lmsalaudeen/microarray-and-RNAseq-analysis · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
---
title: "Microarray Analysis with Limma"
author: "Latifah Mojisola Salaudeen"
date: "2024-03-23"
output:
  html_document:
    toc: true
    toc_float: true
editor_options:
  markdown:
    wrap: 100
---

```{r setup, include=FALSE, message=FALSE, warning=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Microarray Analysis with Limma

## Setup

For this, I'll be using 3 libraries:

-   Biobase: to define an expression set

-   limma: the main library for the analysis

-   ggplot2: for data visualisation

```{r libraries, include=FALSE}
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install(c("Biobase", "limma"))

```

## Load Files

Files needed:

-   The series matrix, which contains the gene expression data.
-   The metadata which provides information about each sample and the disease status.
-   The annotation data which shows the corresponding gene name of each probe (Some gene expression
    data have a column with gene ids or other gene identifier, so this file may not be necessary).

For expression data from a repository like GEO, the metadata and expression data are usually in one
txt file. The metadata usually begins with "!", so this can be easily extracted (see python script).
[metadataScript.py](metadataScript.py)

```{r}
GSE20966SeqData <- read.delim("Data/GSE20966/GSE20966_series_matrix.txt", header=T, comment.char="!", row.names=1)

geneAnnotationFile   <- read.delim("Data/GSE20966/GPL1352-9802.txt", header=T, row.names = 1,sep="\t",comment.char = "#")
geneAnnotation <- geneAnnotationFile[row.names(GSE20966SeqData),]    # arrange according to expression row names

metadata <- read.delim("Data/GSE20966/GSE20966_metadata.txt", sep = "\t", header=F, row.names = 1)
```

The imported metadata file has two rows, hence, I'll transpose and label the columns

```{r}
metadata = data.frame(t(metadata)) #made it a dataframe so it's easy to access the columns with $

colnames(metadata) = c("diseaseStatus", "ID")
rownames(metadata) = metadata[,2]
```

## Group Samples

My dataset has 2 sample groups (non-diabetic and diabetic patients)

```{r Patient status}
groups = rep(NA, length(metadata$diseaseStatus))      # NA - intialize a variable with NAs, the length of which is the length of the column diseaseStatus in metadata

groups[grep('beta-cells_diabetic', metadata$diseaseStatus)] = 'T2diabetes'  #grep gets the index of where beta-cells_non-diabetic is found, and then replace those same index in groups variable with "T2diabetes"

groups[grep('beta-cells_non-diabetic', metadata$diseaseStatus)] = 'control'

metadata$status = groups  #forms a new column in metadata and fills it with items in groups

```

Alternaively, the diseaseStatus column can be updated instead of creating a new column
`metadata$diseaseStatus = groups`

## Normalization

Here, I check the distribution of the data using a boxplot.

One way to transform data is through log2 transformation (REF)

```{r boxplot, warning=FALSE}
#is there a need for log transformation/normalization?

nonlog_expression <- GSE20966SeqData

palette(c("yellow",  "orange"))
par(mar=c(6,3,3,2)) #boxplot margins: bottom, left, to, right

boxplot(nonlog_expression, main = "Gene expression of Control and diabetes Samples", notch=T, outline=F, las = 2, col = factor(metadata$status))


#log_expression = log2(nonlog_expression)
palette(c("pink",  "peachpuff"))
log_expression <- log2(GSE20966SeqData)
boxplot(log_expression, main = "Log 2 Gene expression of Control and diabetes Samples", notch=T, outline=F, las=2, col = factor(metadata$status))

```

## Differential Expression Analysis

### Expression set

The Biobase library has an ExpressionSet class, which can be used to combine the entities of our
microarray data into a data object.

To create an expression set, we need to define the metadata and feature data as an annotated data
frame. Then create an ExpressionSet using the `ExpressionSet(assayData, phenoData, featureData)`
function, and give our expression values as a matrix.

```{r Expression set, message=FALSE}
library(Biobase)

#Expression set
pData <- new("AnnotatedDataFrame",data = metadata)
fData <- new("AnnotatedDataFrame", data = geneAnnotation)

gse20966Set <- ExpressionSet(assayData = as.matrix(log_expression), phenoData = pData, featureData = fData)

#Removing missing values
gse20966Set <- gse20966Set[complete.cases(exprs(gse20966Set)), ] # skip missing values
```

### Linear model

We'll create a design matrix which includes separate coefficients for diabetes samples and normal
samples. This can be done using the `model.matrix()` function

```{r limma design matrix, message=FALSE}
library(limma)

design <- model.matrix(~ 0 + status, metadata)
```

status is a column in the metadata dataframe.

Another way to do this is by creating another variable:

`Group <- factor(metadata$status, levels=c("control","T2diabetes"))`

Then: `design <- model.matrix(~0 + Group)`

Reassigning the column names of design

```{r design colnames, message=FALSE}
colnames(design) = levels(factor(metadata$status))
```

When reassigning column names, careful not to mis-label and mix-up the control and the test
(diabetes).

The following 4 lines of codes computes the differentially expressed genes.

-   First, we call the `lmfit()` function and give it the expression set and the design matrix as
    arguments.
-   With the `makeContrasts()` function, we define the groups we're comparing. Here it's
    "T2diabetes" against the "control"
-   `contrasts.fit()` matrix can be used to expand the linear model fit to make comparison between
    the two groups
-   `eBayes()` computes the empirical Bayes statistics.

```{r limma model}
#fitting the model
fit <- lmFit(gse20966Set, design)
cont.matrix <- makeContrasts(contrasts = "T2diabetes-control", levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2)
```

A list of top genes differential expressed in T2diabetes versus control can be obtained from
`topTable()`

```{r top genes}
topGenes <- topTable(fit2)

topGenes[1:5, 16:21]
```

`topTable()` returns the top 10 results. One can parse in maximum number of genes to list with the
`number` argument.

With the `adjust.method` argument, one can specify what method to be used to adjust the p-values for
multiple testing. The default is the Benjamini & Hochberg method -- "BH" (a.k.a "fdr").

```{r top table Inf number}
limmaResult <- topTable(fit2, adjust.method ="fdr", number = Inf)
```

we can obtain the outcome of each hypothesis with `decideTest()`, then use a Venn diagram to show
the number of significant genes.

```{r Venn diagram}
vennDiagram(decideTests(fit2, lfc = 0.5, p.value = 0.05))
```

lfc and p.value arguments specify the desired log fold change and adjusted p value.

```{r}

#cutoff is logFC>o.5|<-0.5 & pval <0.05
dysregulated <- limmaResult[abs(limmaResult$logFC)> 0.5 & limmaResult$adj.P.Val < 0.05,]
cat("There are", nrow(dysregulated), "DEGs discovered (before filtering out empty entrez ids)\n")

# removing results with empty Entrez ids
limmaResult <- limmaResult[limmaResult$ENTREZ_GENE_ID !="", ]

#cutoff is logFC>o.5|<-0.5 & pval <0.05
dysregulated <- limmaResult[abs(limmaResult$logFC)> 0.5 & limmaResult$adj.P.Val < 0.05,]
cat("There are", nrow(dysregulated), "DEGs discovered\n")

upregulated <- dysregulated[dysregulated$logFC > 0.5,]
cat("There are", nrow(upregulated), "DEGs discovered\n")

downregulated <- dysregulated[dysregulated$logFC < -0.5,]
cat("There are", nrow(downregulated), "DEGs discovered")
```

## Plot

```{r}
library(ggplot2)

# volcano plot
limmaResult$dysregulated <- "not significant"
limmaResult$dysregulated[limmaResult$logFC > 0.5 & limmaResult$adj.P.Val <0.05] <- 'upregulated'
limmaResult$dysregulated[limmaResult$logFC < -0.5 & limmaResult$adj.P.Val <0.05] <- 'downregulated'

ggplot(limmaResult, aes(x = logFC, y= -log10(P.Value), col = dysregulated)) +
  geom_point() +
  scale_color_manual(values = c('purple', 'snow3', 'salmon3')) +
  ggtitle("Dysregulated genes in diabetes vs control (GSE20966)")


```