You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+20-20Lines changed: 20 additions & 20 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -82,7 +82,7 @@ docker run -it samirchar/dayhoff:latest
82
82
83
83
## Data and model availability
84
84
85
-
All Dayhoff models are available on [AzureAIFoundry](https://ai.azure.com/labs)
85
+
All Dayhoff models are available on [Azure AI Foundry](https://aka.ms/dayhoff/foundry)
86
86
87
87
Additionally, all Dayhoff models are also hosted on [Hugging Face](https://huggingface.co/collections/microsoft/dayhoff-atlas-6866d679465a2685b06ee969) 🤗. All datasets used in the paper, with the exception of OpenProteinSet are available on Hugging Face in three formats: FASTA, Arrow, and JSONL.
88
88
@@ -92,34 +92,34 @@ GigaRef, BackboneRef, and DayhoffRef are available under [CC BY License](https:/
92
92
### Training datasets
93
93
The Dayhoff models were trained on the Dayhoff Atlas, with varying data mixes which include:
94
94
95
-
**[UniRef50](https://www.uniprot.org/)** (**UR50**) - dataset from UniProt, clustered at 50% sequence identity, contains only cluster representatives.
96
-
*_Splits: train (25 GB), test (26 MB), valid (26 MB)_
95
+
***[UniRef50](https://www.uniprot.org/)** (**UR50**) - dataset from UniProt, clustered at 50% sequence identity, contains only cluster representatives.
96
+
*_Splits: train (25 GB), test (26 MB), valid (26 MB)_
97
97
98
-
**[UniRef90](https://www.uniprot.org/)** (**UR90**) - dataset from UniProt, clustered at 90% sequence identity, contains cluster representatives and members.
99
-
*_Splits: train (83 GB), test (90 MB), valid (87 MB)_
98
+
***[UniRef90](https://www.uniprot.org/)** (**UR90**) - dataset from UniProt, clustered at 90% sequence identity, contains cluster representatives and members.
99
+
*_Splits: train (83 GB), test (90 MB), valid (87 MB)_
100
100
101
101
102
-
**GigaRef** (**GR**)– 3.43B protein sequences across 1.7B clusters of metagenomic and natural protein sequences. There are two subsets of gigaref:
103
-
***GigaRef-clusters** (**GR**) - Only includes cluster representatives and members, no singletons
104
-
*_Splits: train (433 GB), test (22 MB)_
105
-
***GigaRef-singletons** (**GR-s**) - Only includes singletons
106
-
*_Splits: train (282 GB)_
102
+
***GigaRef** (**GR**)– 3.43B protein sequences across 1.7B clusters of metagenomic and natural protein sequences. There are two subsets of gigaref:
103
+
***GigaRef-clusters** (**GR**) - Only includes cluster representatives and members, no singletons
104
+
* _Splits: train (433 GB), test (22 MB)_
105
+
***GigaRef-singletons** (**GR-s**) - Only includes singletons
106
+
*_Splits: train (282 GB)_
107
107
108
-
**BackboneRef** (**BR**) – 46M structure-derived synthetic sequences from c.a. 240,000 de novo backbones, with three subsets containing 10M sequences each:
109
-
***BackboneRef unfiltered** (**BRu**) – 10M sequences randomly sampled from all 46M designs.
110
-
*_Splits: train (3 GB)_
111
-
***BackboneRef quality** (**BRq**) – 10M sequences sampled from 127,633 backbones whose average self-consistency RMSD ≤ 2 Å.
112
-
*_Splits: train(3 GB)_
113
-
***BackboneRef novelty** (**BRn**) – 10M sequences from 138,044 backbones with a max TM-score < 0.5 to any natural structure.
114
-
*_Splits: train (3GB)_
108
+
***BackboneRef** (**BR**) – 46M structure-derived synthetic sequences from c.a. 240,000 de novo backbones, with three subsets containing 10M sequences each:
109
+
***BackboneRef unfiltered** (**BRu**) – 10M sequences randomly sampled from all 46M designs.
110
+
* _Splits: train (3 GB)_
111
+
***BackboneRef quality** (**BRq**) – 10M sequences sampled from 127,633 backbones whose average self-consistency RMSD ≤ 2 Å.
112
+
*_Splits: train(3 GB)_
113
+
***BackboneRef novelty** (**BRn**) – 10M sequences from 138,044 backbones with a max TM-score < 0.5 to any natural structure.
114
+
*_Splits: train (3GB)_
115
115
116
-
**[OpenProteinSet](https://arxiv.org/abs/2308.05326)** (**HM**) – 16 million precomputed MSAs from 16M sequences in UniClust30 and 140,000 PDB chains.
116
+
***[OpenProteinSet](https://arxiv.org/abs/2308.05326)** (**HM**) – 16 million precomputed MSAs from 16M sequences in UniClust30 and 140,000 PDB chains.
117
117
118
118
### DayhoffRef
119
119
Given the potential for generative models to expand the space of proteins and their functions, we used the Dayhoff models to generate DayhoffRef, a PLM-generated database of synthetic protein sequences
120
120
121
-
**DayhoffRef**: dataset of 16 million synthetic protein sequences generated by the Dayhoff models: Dayhoff-3b-UR90, Dayhoff-3b-GR-HM, Dayhoff-3b-GR-HM-c, and Dayhoff-170m-UR50-BRn.
122
-
*_Splits: train (5 GB)_
121
+
***DayhoffRef**: dataset of 16 million synthetic protein sequences generated by the Dayhoff models: Dayhoff-3b-UR90, Dayhoff-3b-GR-HM, Dayhoff-3b-GR-HM-c, and Dayhoff-170m-UR50-BRn.
0 commit comments