strbio2022_23/ai.qmd at main · StrBio/strbio2022_23 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
---
title: "State-of-the-art protein modeling (in 2022)"
author: "Modesto Redrejo Rodríguez"
date: "`r Sys.Date()`"
toc: true
toc_float: true
format:
  html:
    theme: simplex
    toc: true
    toc-location: right
    toc-depth: 4
    number-sections: false
    code-overflow: wrap
    link-external-icon: true
    link-external-newwindow: true
bibliography: references.bib
editor_options:
  markdown:
    wrap: 80
---

::: callout-warning
## Under construction
:::

![](pics/8275a.jpg "Under construction...")

# The recent history of protein structure modeling telling by a contest (CASP)

Every two years since 1994, structural bioinformatics groups carry out a
worldwide experiment, predicting a set of unknown protein structures in a
controlled, blind-test-like competition and comparing their output with the
experimentally obtained structure. This is the **CASP** or [*Critical assessment
of Protein Structure Prediction*](https://predictioncenter.org/).

![Comparative Z-score of CASP13 participants. The score is based in the GDT_TS
(Global distance test).](pics/casp13.png "CASP13 results")

The best research groups in the field test their new methods and protocols in
CASP. However, in CASP13 (2018) an AI company called
[Deepmind](https://en.wikipedia.org/wiki/DeepMind) (Google Subsidiary) entered
in the scene. Their method, named Alphafold [@senior2020] clearly won CASP13.
Alphafold implemented some improvements in a few recently used approaches,
creating a new whole pipeline. Basically, instead of create contact maps from
the alignment to then fold the structure, they used a MRF unit (Markov Random
Field) to extract in advance the main features of sequence and the MSA and
process all of that info into a multilayer NN (called ResNet) that provides the
distant map and other information. Then, Alphafold uses all the possibly
obtained information to create the structure and then improve it by energy
minimization and substitution of portions with a selected DB of protein
fragments.

![Workflow of the first Alphafold method presented in CASP13. MSA stands for
multiple sequence alignment; PSSM indicates Position-specific-scoring matrix and
MRF stands for Markov Random Field (or Potts model). From the Sergei Ovchinnikov
and Martin Steinegger presentation of Alphafold2 to the Boston Protein Design
Group (slides and video in the [link
below](#links)).](pics/alphafold1.png "Alphafold1"){#fig-alphafold1 .figure}

After Alphafold, similar methods were also developed and made available to the
general public, like the *trRosetta* from Baker lab [@yang2020], available in
the [Robetta](https://robetta.bakerlab.org/) server. This led to some
controversy (mostly on Twitter) about the open access to the CASP software and
later on DeepMind publishes all the code on GitHub.

# CASP14 or when protein structure prediction come to age for (non structural) biologists

In CASP14 the expectation was very high and the guys from DeepMind did not
disappoint anyone. Alphafold2 highly outperformed all competitors, both in
relative (score respect the other groups) and absolute terms (lowest alpha-C
RMSD). As has been highlighted, the accuracy of many of the predicted structures
was within the error margin of experimental determination methods [see for
instance @mirdita2022].

[![Comparative CASP14 scores](pics/casp14.png "CASP14 scores"){#fig-casp14
.figure}](https://predictioncenter.org/casp14/zscores_final.cgi?formula=gdt_ts)

[![Performance of Alphafold2 the CASP14 dataset relative to the top-15 entries.
Data are median and the 95% confidence interval of the median, for alpha-carbom
RMSD. Panels b-c-d show example comparison between model and experimental
structures.](pics/jumper2021.png "Alphafold2"){#fig-alphafold2bench
.figure}](https://www.nature.com/articles/s41586-021-03819-2/figures/1)

Alphafold took some time (eight months, an eternity nowadays) to publish the
method (@jumper2021) and making it available on
[Github](https://github.com/deepmind/alphafold), but other new methods, like
RoseTTAfold (@baek2021) and C-I-Tasser (@zheng2021) could reproduce their
results and were available on public servers, which may have push Deepmind to
make everything available to the scientific community (see the Grace Huckins'
[article](https://www.wired.com/story/without-code-for-deepminds-protein-ai-this-lab-wrote-its-own/)
on Wired). Not surprisingly (at least for me), a group of independent scientists
([Sergey Ovchinnikov](https://twitter.com/sokrypton), [Milot
Mirdita](https://twitter.com/milot_mirdita), and [Martin
Steinegger](https://twitter.com/thesteinegger)), decided to implement Alphafold2
in a [Colab notebook](https://github.com/sokrypton/ColabFold), named ColabFold
@mirdita2022, freely available online. Other free implementations of Alphafold
have been and are available, but ColabFold has been the most widely discussed
and known. They implemented some tricks to accelerate the modeling, mainly the
use of
[MMSeqs2](https://docs.google.com/presentation/d/1mnffk23ev2QMDzGZ5w1skXEadTe54l8-Uei6ACce8eI/present#slide=id.ge58c80b745_0_15)
(developed by Martin Steinegger's group) to search for homolog structures on
Uniref30, which made Colabfold a quick method that made all the previous
advanced methods almost useless. This was the real breakthrough in the protein
structure field, making Alphafold2 available to every one and, also very
important, facilitate the evolution of the method, implementing new features,
like the prediction of protein complexes ( @evans2022), which was actually
mentioned first on
[Twitter](https://docs.google.com/presentation/d/1mnffk23ev2QMDzGZ5w1skXEadTe54l8-Uei6ACce8eI/present#slide=id.ge58c80b745_0_15).

# Alphafold2 as the paradigm of a New Era {#sec-AF2}

## Why is Alphafold2 so freaking accurate?

The philosophy behind Alphafold and related methods is treating the protein
folding problem as a machine learning problem, kind of of image processing. In
all these problems, the input to the Deep Learning model is a volume (3D
tensor). In the case of computer vision, 2D images expand as a volume because of
the RGB or HSV channels. Similarly, in the case of distance prediction,
predicted 1D and 2D features are transformed and packed into 3D volume with many
channels of inter-residue information [@pakhrin2021].

![From the perspective of Deep Learning method development, the problem of
protein distogram or real-valued distance prediction (bottom row) is similar to
the 'depth prediction problem' in computer vision. From
@pakhrin2021.](pics/machine_fold.png){#fig-deep .figure}

Alphafold2 can be explained as a pipeline with three interconected tasks (see
picture below). First, it queries several databases of protein sequences and
constructs an MSA that is used to select templates. In the second part of the
diagram, AlphaFold 2 takes the multiple sequence alignment and the templates,
and processes them in a *transformer*. This process has been referred by some
authors as inter-residue interaction map-threading [@bhattacharya2021]. The
objective of this part is to extract layers of information to generate residue
interaction maps. A better model of the MSA will improve the network's
characterization of the geometry, which simultaneously will help refine the
model of the MSA. Importantly, in the AF2 Evoformer, this process is iterative
and the information goes back and forth throughout the network. At every
recycling step, the complexity of the map increases and thus, the model improves
(the original model uses 3 cycles). As explained in the great
[post](https://www.blopig.com/blog/2021/07/alphafold-2-is-here-whats-behind-the-structure-prediction-miracle/)
from Carlos Outerial at the OPIG site:

> *This is easier to understand as an example. Suppose that you look at the
> multiple sequence alignment and notice a correlation between a pair of amino
> acids. Let's call them A and B. You hypothesize that A and B are close, and
> translate this assumption into your model of the structure. Subsequently, you
> examine said model and observe that, since A and B are close, there is a good
> chance that C and D should be close. This leads to another hypothesis, based
> on the structure, which can be confirmed by searching for correlations between
> C and D in the MSA. By repeating this several times, you can build a pretty
> good understanding of the structure.*

The third part of the pipeline is the structure building module, which uses the
information from the previous steps to construct a 3D model structure protein of
the query sequence. This network will give you a single model, without any
energy optimization step. Model building is based in a new concept of 3D
structures generation, named IPA (Invariant Point Attention) and the use of a
curated list of parametrised list of torsion angles to generate the side chains.

![Oxford Proteins Informatics Group Blog, modified
From](pics/alphafols2.png "Alphafold2"){#fig-af2 .figure}

Like for most of the previous methods Alphafold would give your better results
with proteins with related structures known and with a lot of homologs in Uniref
databases. However, comparing to nothing, it will likely give you (limited)
useful results for the so-called "dark genome". I work with phages and bacterial
mobile elements, and sequencing that is often frustrating as more than 50% of
the proteins have no homologs in the database. So you have a bunch of proteins
of unknown function... However, as we do know that structure is more conserved
than sequence, we may use the structure to find out the function of our dark
proteins. There are a few resources for this, I'd suggest you to try
[FoldSeek](https://search.foldseek.com/search) [@kempen] and
[Dali](http://ekhidna2.biocenter.helsinki.fi/dali/) [@holm2022] servers. You can
upload the PDB of your model and search for related structures in PDB and also
in Alphafold database.

As mentioned above, Colabfold aims to make the process faster by using MMSeqs in
the first block. Additionally, the number of recycling steps can also be
adapted. Moreover, different Colabfold notebooks have been developed (and
evolved) to allow some customization and other feature, like batch processing of
multiple proteins avoiding recompilation and identification of protein-protein
interactions [@mirdita2022].

Alphafold models can be evaluated by the mean **pLDDT**, a per-residue
confidence metric. It is stored in the B-factor fields of the mmCIF and PDB
files available for download (although unlike a B-factor, higher pLDDT is
better). The model confidence can vary greatly along a chain so it is important
to consult the confidence when interpreting structural features. Very often, the
lower confidence fragments are not product of a poor prediction but an indicator
of protein disorder [@wilson2022].

Alphafold also partnered with EMBL-EBI and Uniprot and generated a huge curated
database of proteins from model organisms [@varadi2022], the [Alphafold
database](https://alphafold.ebi.ac.uk/). This is an amazing resource that may be
also very helpful for you. Just consider that this database increased from 48%
to 76% the fraction of human proteome with structural data, and also it also
means great increases in the case of other model organisms, like, including
microorganisms and plants [@portapardo2022].

![Changes in protein structural coverage in model
organisms.](pics/journal.png "Changes in protein structural coverage in model organisms."){#fig-afDDBB
.figure width="433"}

## Let's try Alphafold2.

*Section under construction!*

As mentioned above, the grand breakthrough of Alphafold would not have been the
same without the Colabfold, a free open tool that made the state-of-the-art of
AI-fueled protein prediction available to everyone.

[![ColabFold GitHub
repository](pics/qrcode.png "ColabFold"){width="441"}](https://github.com/sokrypton/ColabFold)

The Colabfold repository on GitHub contains links to several Python "notebooks"
developed on [Google Colab](https://colab.research.google.com/), a platform to
develop and share Python scripts on a Jupyter Notebook format. Notebooks are
very important also for reproducibility in computer sciences, as they allow you
to have the background and details and the actual code in a single document and
also execute it. You can share those notebooks very easily and also update
quickly as they are stored in your Google Drive.

Colabfold allow you to run notebooks of Alphafold and RoseTTAfold for specific
applications, allowing even to run a bunch of proteins in batch. You can see a
more detailed description in @mirdita2022. We are using the
[Alphafold2_mmseqs2](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb)
notebook, that allow you most of the common features. You need to allow Colab to
use your Google account.

![Introducing your sequence in Colabfold](pics/colab1.jpg "Input sequence")

Then paste your sequence and chose a name. For more accurate models you can
click "use_amber" option. It will run a short [*Molecular
Dynamics*](https://en.wikipedia.org/wiki/Molecular_dynamics) protocol that
ultimately optimize the modeling, but it will also take some more time, so
better try at home.

As you can see, an this is a recent feature, you can also add your own template.
That will safe time, but of course without any guarantee. If you have a template
of a related protein, like an alternative splicing or a disease mutant, I'd
advise you to try with and without the template. You may surprise.

![Executing Colabfold](pics/colab_execute.jpg)

At this point, you may execute the whole pipeline or may some more
customization. MSA stage can be also optimized to reduce execution time, by
reducing database or even by providing your own MSA. Very often you may want to
fold a protein with different parameters, particularly in the [Advanced
Colabfold](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/beta/AlphaFold2_advanced.ipynb#scrollTo=rowN0bVYLe9n),
which may very convenient to reuse an MSA from a previous run (although they
recently updated servers for MMSeqs and made it really faster). If your proteins
are in the same operon or by any other reason you think that they should have
co-evolved, you prefer a "paired" alignment. But you can always do both.

Advanced settings are specially needed for protein-protein complexes. Also the
number of recycling steps will improve your model, particularly for targets with
no MSA info from used databases. Then you can just get your model (and companion
info and plots) in your GDrive or download it.

[**What do you think is the ideal protein for alphafold2? Do you think homology
modeling is dead?**]{style="color:green"}

## Corollary: Has Levinthal's paradox "folded"?

The development of Alphafold and the [Alphafold structures
Database](https://alphafold.ebi.ac.uk/) in collaboration with
[EMBL-EBI](https://www.ebi.ac.uk/about) has been the origin of a New Era.
Scientific publications and journals worldwide published long articles about the
meaning of this breakthrough in science and its applications in biotechnology
and biomedicine[^1] and DeepMind claimed to have [Solved a 50-years Grand
Challenge in
biochemistry](https://www.deepmind.com/blog/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology).
The coverage of the protein structure space has been greatly increased
[@portapardo2022].

[^1]: https://www.bbc.com/news/science-environment-57929095

    https://www.forbes.com/sites/robtoews/2021/10/03/alphafold-is-the-most-important-achievement-in-ai-ever/

    https://elpais.com/ciencia/2021-07-22/la-forma-de-los-ladrillos-basicos-de-la-vida-abre-una-nueva-era-en-la-ciencia.html

However, some scientists have claimed that Alphafold2 and RoseTTAfold actually
"cheat" a bit as it does not really solve the problem but generate a deep
learning pipeline that "bypass" the problem [@pederson2021]. In agreement with
that, it has been shown that machine learning methods actually do not reproduce
the expected folding pathways while improving the structures during the
recycling steps @outeiral.

In conclusion, I do believe that Levinthal's paradox has not been (yet) fully
solved, but clearly almost [@aljanabi2022], and solving it will probably reduce
the limitations of Alphafold2. However,
[CASP15](https://predictioncenter.org/casp15/index.cgi) is currently being held
and maybe I will have to change my mind later this year.

# Useful links {#links}

-   Introductory article to Neural Networks at the IBM site:
    <https://www.ibm.com/cloud/learn/neural-networks>

-   ColabFold Tutorial presented presented by Sergey Ovchinnikov and Martin
    Steineggerat the Boston Protein Design and Modeling Club (6 ago 2021).
    [\[video\]](https://www.youtube.com/watch?v=Rfw7thgGTwI)
    [\[slides\]](https://docs.google.com/presentation/d/1mnffk23ev2QMDzGZ5w1skXEadTe54l8-Uei6ACce8eI).

-   Post about Alphafold2 in the Oxford Protein Informatics Group site:
    <https://www.blopig.com/blog/2021/07/alphafold-2-is-here-whats-behind-the-structure-prediction-miracle/>

-   A very good digest article about the Alphafold2 paper:
    <https://moalquraishi.wordpress.com/2021/07/25/the-alphafold2-method-paper-a-fount-of-good-ideas/>

-   Post on the Alphafold2 revolution meaning in biomedicine at the the UK
    Institute for Cancer Research website:
    <https://www.icr.ac.uk/blogs/the-drug-discoverer/page-details/reflecting-on-deepmind-s-alphafold-artificial-intelligence-success-what-s-the-real-significance-for-protein-folding-research-and-drug-discovery>

-   A post that explain how Alphafold2 and RoseTTAfold code became publically
    available:
    <https://www.wired.com/story/without-code-for-deepminds-protein-ai-this-lab-wrote-its-own/>

# References