bulk_RNA_seq_workshop/tertiary_analysis_Part3.Rmd at dev · U-BDS/bulk_RNA_seq_workshop · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
---
title: "RNA-Seq Tertiary Analysis: Part 3"
author: "Bharat Mishra, Ph.D., Austyn Trull, Lara Ianov, Ph.D."
---


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Packages loaded globally

```{r message=FALSE}
# Set the seed so our results are reproducible:
set.seed(2020)

# Required packages

# pathway analysis
library(gprofiler2)

# We will need them for data handling
library(magrittr)
library(ggrepel)
library(dplyr)
library(tidyverse)
library(readr)

# plotting
library(ggplot2)
library(ComplexHeatmap)
library(RColorBrewer)
```


# Gene Ontology (GO)

Gene Ontology (GO) enrichment analysis is a computational method used to identify which biological, cellular, or molecular processes are over-represented (or under-represented) in a given list of genes. It leverages the structured vocabulary of the Gene Ontology to interpret the functions of genes within a larger biological context.

How it works:

* Input: You start with a list of genes of interest, often derived from experimental data like differentially expressed genes (DEGs) from a RNA-Seq experiment.

* Annotation: Each gene in your list is associated with GO terms that describe its known functions.

* Statistical Test: A statistical test (e.g., Fisher's exact test, hypergeometric test) is applied to determine if any GO terms are enriched in your gene list compared to what would be expected by chance.

* Output: The result is a list of GO terms along with their associated p-values (or adjusted p-values). Significantly enriched terms suggest that the biological processes they represent are particularly relevant to your gene list.

## Categories of GO Analysis

The Gene Ontology is organized into three main categories:

* Biological Process (BP): Represents a series of events accomplished by one or more ordered assemblies of molecular functions (e.g., DNA replication, cell cycle, signal transduction).

* Molecular Function (MF): Describes the elemental activities of a gene product at the molecular level, such as binding or catalysis (e.g., kinase activity, DNA binding, transporter activity).

* Cellular Component (CC): Refers to the location relative to cellular structures in which a gene product performs a function (e.g., nucleus, cytoplasm, plasma membrane).
Types of GO Enrichment Analysis:

For more information, we encourage further reading at:
<https://geneontology.org/docs/ontology-documentation/>

**There are two main approaches to GO enrichment analysis:**

* Over-Representation Analysis (ORA): This is the simpler method and focuses on identifying GO terms that are significantly over-represented in your gene list. It is based on the assumption that all genes in the list are independent.

* Functional Class Scoring (FCS): This approach considers the relative ranking of genes in your list (e.g., based on expression levels) and tests if gene sets associated with specific GO terms are enriched towards the top or bottom of the list. This can reveal subtle patterns that ORA might miss.

## Interpreting Results:

The enriched GO terms provide valuable insights into the biological functions that are most relevant to your gene list. This can help generate hypotheses about the underlying mechanisms involved and guide further research.


### Kyoto Encyclopedia of Genes and Genomes (KEGG)

KEGG is a collection of manually drawn pathway maps representing molecular interaction and reaction networks.

### Other gene sets
GO and KEGG are the most frequently used for functional analysis. They are typically the first choice because of their long-standing curation and availability for a wide range of species.

Other gene sets include but are not limited to Disease Ontology `DO`, Disease Gene Network `DisGeNET`, `wikiPathways`, Molecular Signatures Database `MSigDb`.

# Input DEG list and DEG filtering

```{r}
Transplant_vs_Naive_annotated <- read.csv(file = "./results/Transplant_vs_Naive/Transplant_vs_Naive_annotated_DEGlist.csv",
                                          row.names = 1)

normalized_counts_annotated <- read.csv(file="./results/Transplant_vs_Naive/normalized_counts.csv")
```


For our enrichment analysis, we will use DEGs filtered by the following cutoff:
`adj.P.Val <= 0.05, logFC <= -1 | logFC >= 1`.

```{r}
Transplant_vs_Naive_annotated_DEGs <- Transplant_vs_Naive_annotated %>%
  dplyr::filter(str_trim(external_gene_name) != "") %>%  # Remove empty strings
  dplyr::filter(padj <= 0.05, log2FoldChange <= -1 | log2FoldChange >= 1) %>% # THIS IS FILTERED ANALYSIS
  dplyr::arrange(dplyr::desc(abs(log2FoldChange))) %>%
  # Filter out the duplicated rows using `dplyr::distinct()`
  dplyr::distinct(external_gene_name, .keep_all = TRUE)
```

Quick look at the number of DEGs.

```{r}
nrow(Transplant_vs_Naive_annotated_DEGs)
```

Check for duplicated gene_ids

```{r}
any(duplicated(Transplant_vs_Naive_annotated_DEGs$external_gene_name))
```

# GO by `gprofiler2`

`gprofiler2` performs functional enrichment analysis, also known as over-representation analysis (ORA) on input gene list. It maps genes to known functional information sources and detects statistically significantly enriched terms.

In other words, `gprofiler2` takes a list of genes and analyzes them to see if they are enriched for certain functions. This can be helpful for understanding the biological processes that the genes are involved in.

Many GO enrichment analysis tools allow you to archive your results for future reference or sharing. `gprofiler2` maintains archived versions based on genome versions used for secondary analysis, specifically Ensembl. Thus, here we also set the archive
to Ensembl 109 version.

Archiving results is often optional, but it can be a valuable practice for reproducibility and collaboration.

```{r}
set_base_url("https://biit.cs.ut.ee/gprofiler_archive3/e109_eg56_p17")
```


Further, note that although `gprofiler2` accepts Ensembl IDs,
most biologists/scientist prefer gene names as they are more interpretable.
One caveat, is that the use of gene names can can lead to redundancies for cases
where a name corresponds to more than one id (e.g: haplotypes). However, for the
most part it is acceptable to use gene names. The example below is executed
with gene names.

Lastly, note we set the `custom_bg` (custom background) to the gene names present in our
normalized counts which is derived from the raw data that was pre-filtered. This is important due to the following reasons:

* Statistical Significance: The background set defines the domain of genes from which your input list was derived. By comparing the frequency of GO terms in your gene list to their frequency in the background set, `gprofiler2` can calculate p-values that indicate the likelihood of observing such enrichment by chance. Without a proper background set, these p-values would be unreliable.

* Context-Specific Analysis: The choice of background set can dramatically influence the results of your GO analysis. Using the entire genome as the background might not be appropriate if your gene list comes from a specific tissue, cell type, or experimental condition. In such cases, it's often more meaningful to use a custom background set that reflects the relevant biological context (in our case this corresponds to the total gene number after pre-filtering).

* Controlling False Positives: A well-defined background set helps control the false positive rate in your analysis. If your background set is too small or biased, you might detect spurious enrichment of GO terms that are not truly relevant to your gene list.

`gprofiler2` offers several options for choosing a background set: annotated genes, custom background, and predefined backgrounds.

```{r}
# select species
species <- "mmusculus"

#select data sources, these are our min. standards:

data_sources <- c("GO", "KEGG", "REAC", "MIRNA", "HP", "HPA", "WP")

gost_results <- gost(query = Transplant_vs_Naive_annotated_DEGs$external_gene_name,
                     organism = species,
                     ordered_query = FALSE,
                     multi_query = FALSE,
                     significant = TRUE,
                     exclude_iea = FALSE,
                     measure_underrepresentation = FALSE,
                     user_threshold = 0.05,
                     correction_method = "g_SCS",
                     domain_scope = "custom",
                     custom_bg = normalized_counts_annotated$external_gene_name, # genes names from normalized counts
                     numeric_ns = "",
                     sources = data_sources,
                     as_short_link = FALSE,
                     evcodes = TRUE)
```


**Challenge: Enable the url option for gost**

<details><summary>Click here for solution</summary>

explore the `gprofiler` results in detail through the url.

```{r}
gost_results_url <- gost(query = Transplant_vs_Naive_annotated_DEGs$external_gene_name,
                         organism = species,
                         ordered_query = FALSE,
                         multi_query = FALSE,
                         significant = TRUE,
                         exclude_iea = FALSE,
                         measure_underrepresentation = FALSE,
                         user_threshold = 0.05,
                         correction_method = "g_SCS",
                         domain_scope = "custom",
                         custom_bg = normalized_counts_annotated$external_gene_name, # genes names from normalized counts
                         numeric_ns = "",
                         sources = data_sources,
                         as_short_link = TRUE,
                         evcodes = TRUE)

gost_results_url
```

</details>


# Explore `gprofiler2` results

```{r}
gostplot(gost_results, capped = TRUE, interactive = TRUE)
```


```{r warning=FALSE}
publish_gosttable(gost_results, highlight_terms = gost_results$result[c(1:10),],
                  use_colors = TRUE,
                  show_columns = c("source", "term_name", "term_size", "intersection_size"),
                  filename = NULL)
```

Write `gprofiler2` results to csv

```{r}
go_results <- gost_results$result

go_results$parents <- as.character(gost_results$result$parents)
```


```{r}
write.csv(go_results,
          file = "./results/Transplant_vs_Naive/Transplant_vs_Naive_Gprofiler_padj_fc.csv",
          row.names = FALSE)
```

## session info

```{r}
sessionInfo()
```