You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+91-54Lines changed: 91 additions & 54 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
Dayhoff is an Atlas of both protein sequence data and generative language models — a centralized resource that brings together 3.34 billion protein sequences across 1.7 billion clusters of metagenomic and natural protein sequences (GigaRef), 46 million structure-based synthetic sequences (BackboneRef), and 16 million multiple sequence alignments (OpenProteinSet). These models can natively predict zero-shot mutation effects on fitness, scaffold structural motifs by conditioning on evolutionary or structural context, and perform guided generation of novel proteins within specified families. Learning from metagenomic and structure-based synthetic data from the Dayhoff Atlas increased the cellular expression rates of generated proteins, highlighting the real-world value of expanding the scale, diversity, and novelty of protein sequence data.
4
4
5
-
The Dayhoff model architecture combines state-space Mamba layers with Transformer self-attention, interleaved with Mixture-of-Experts modules to maximize capacity while preserving efficiency. It natively handles long contexts, allowing both single sequences and unrolled MSAs to be modeled. Trained with an autoregressive objective in both N→C and C→N directions, Dayhoff supports order-agnostic infilling and scales to billions of parameters.
5
+
The Dayhoff architecture is a hybrid of state-space Mamba layers and Transformer self-attention, interleaved with Mixture-of-Experts modules to maximize capacity while preserving efficiency. It natively handles long contexts, allowing both single sequences and unrolled MSAs to be modeled. Trained with an autoregressive objective in both N→C and C→N directions, Dayhoff supports order-agnostic infilling and scales to billions of parameters.
6
6
7
7
If you use the code, data, models, or results. please cite our [preprint](https://aka.ms/dayhoff/preprint).
8
8
@@ -12,6 +12,7 @@ If you use the code, data, models, or results. please cite our [preprint](https:
12
12
13
13
## Table of Contents
14
14
*[Dayhoff](#Dayhoff)
15
+
*[Usage](#Usage)
15
16
*[Installation](#Installation)
16
17
*[Data and Model availability](#Data-and-model-availability)
17
18
*[Datasets](#Datasets)
@@ -29,48 +30,77 @@ If you use the code, data, models, or results. please cite our [preprint](https:
29
30
*[Contributing](#Contributing)
30
31
*[Trademarks](#Trademarks)
31
32
33
+
## Usage
34
+
35
+
The simplest way to use these models and datasets is via the HuggingFace interface. Alternately, you can install this package or use our Docker. Either way, you will need PyTorch, mamba=ssm, causal-conv1d, and flash-attn.
It is sometimes challenging to properly install these packages just using pip. The following two errors are common when simply using pip install:
57
-
* packages installed correctly, but when loading models you get "ValueError: Fast Mamba kernels are not available. Make sure they are installed and that the mamba module is on a CUDA device."
58
-
* Package installation of causal-conv1d or mamba-ssm fails during the build
91
+
```bash
92
+
uv pip install dayhoff
93
+
```
59
94
60
-
If you encounter any of these errors, try installing using the following commands:
95
+
Or, to be able to run the example scripts, clone the repo and install.
For a fully functional containerized environment without needing to install dependencies manually, you can use the provided Docker image instead:
@@ -82,44 +112,44 @@ docker run -it samirchar/dayhoff:latest
82
112
83
113
## Data and model availability
84
114
85
-
All Dayhoff models are available on [Azure AI Foundry](https://aka.ms/dayhoff/foundry)
115
+
All Dayhoff models are available on [AzureAIFoundry](https://ai.azure.com/labs)
86
116
87
-
Additionally, all Dayhoff models are also hosted on [Hugging Face](https://huggingface.co/collections/microsoft/dayhoff-atlas-6866d679465a2685b06ee969) 🤗. All datasets used in the paper, with the exception of OpenProteinSet are available on Hugging Face in three formats: FASTA, Arrow, and JSONL. The PDB files for structures used to generate BackboneRef are available in Parquet format.
117
+
Additionally, all Dayhoff models are also hosted on [Hugging Face](https://huggingface.co/collections/microsoft/dayhoff-atlas-6866d679465a2685b06ee969) 🤗. All datasets used in the paper, with the exception of OpenProteinSet are available on Hugging Face in three formats: FASTA, Arrow, and JSONL.
88
118
89
119
GigaRef, BackboneRef, and DayhoffRef are available under [CC BY License](https://creativecommons.org/licenses/by/4.0/)
90
120
91
121
## Datasets
92
122
### Training datasets
93
123
The Dayhoff models were trained on the Dayhoff Atlas, with varying data mixes which include:
94
124
95
-
***[UniRef50](https://www.uniprot.org/)** (**UR50**) - dataset from UniProt, clustered at 50% sequence identity, contains only cluster representatives.
96
-
*_Splits: train (25 GB), test (26 MB), valid (26 MB)_
125
+
**[UniRef50](https://www.uniprot.org/)** (**UR50**) - dataset from UniProt, clustered at 50% sequence identity, contains only cluster representatives.
126
+
*_Splits: train (25 GB), test (26 MB), valid (26 MB)_
97
127
98
-
***[UniRef90](https://www.uniprot.org/)** (**UR90**) - dataset from UniProt, clustered at 90% sequence identity, contains cluster representatives and members.
99
-
*_Splits: train (83 GB), test (90 MB), valid (87 MB)_
128
+
**[UniRef90](https://www.uniprot.org/)** (**UR90**) - dataset from UniProt, clustered at 90% sequence identity, contains cluster representatives and members.
129
+
*_Splits: train (83 GB), test (90 MB), valid (87 MB)_
100
130
101
131
102
-
***GigaRef** (**GR**)– 3.43B protein sequences across 1.7B clusters of metagenomic and natural protein sequences. There are two subsets of gigaref:
103
-
***GigaRef-clusters** (**GR**) - Only includes cluster representatives and members, no singletons
104
-
* _Splits: train (433 GB), test (22 MB)_
105
-
***GigaRef-singletons** (**GR-s**) - Only includes singletons
106
-
*_Splits: train (282 GB)_
132
+
**GigaRef** (**GR**)– 3.43B protein sequences across 1.7B clusters of metagenomic and natural protein sequences. There are two subsets of gigaref:
133
+
***GigaRef-clusters** (**GR**) - Only includes cluster representatives and members, no singletons
134
+
*_Splits: train (433 GB), test (22 MB)_
135
+
***GigaRef-singletons** (**GR-s**) - Only includes singletons
136
+
*_Splits: train (282 GB)_
107
137
108
-
***BackboneRef** (**BR**) – 46M structure-derived synthetic sequences from c.a. 240,000 de novo backbones, with three subsets containing 10M sequences each:
109
-
***BackboneRef unfiltered** (**BRu**) – 10M sequences randomly sampled from all 46M designs.
110
-
* _Splits: train (3 GB)_
111
-
***BackboneRef quality** (**BRq**) – 10M sequences sampled from 127,633 backbones whose average self-consistency RMSD ≤ 2 Å.
112
-
*_Splits: train(3 GB)_
113
-
***BackboneRef novelty** (**BRn**) – 10M sequences from 138,044 backbones with a max TM-score < 0.5 to any natural structure.
114
-
*_Splits: train (3GB)_
138
+
**BackboneRef** (**BR**) – 46M structure-derived synthetic sequences from c.a. 240,000 de novo backbones, with three subsets containing 10M sequences each:
139
+
***BackboneRef unfiltered** (**BRu**) – 10M sequences randomly sampled from all 46M designs.
140
+
*_Splits: train (3 GB)_
141
+
***BackboneRef quality** (**BRq**) – 10M sequences sampled from 127,633 backbones whose average self-consistency RMSD ≤ 2 Å.
142
+
*_Splits: train(3 GB)_
143
+
***BackboneRef novelty** (**BRn**) – 10M sequences from 138,044 backbones with a max TM-score < 0.5 to any natural structure.
144
+
*_Splits: train (3GB)_
115
145
116
-
***[OpenProteinSet](https://arxiv.org/abs/2308.05326)** (**HM**) – 16 million precomputed MSAs from 16M sequences in UniClust30 and 140,000 PDB chains.
146
+
**[OpenProteinSet](https://arxiv.org/abs/2308.05326)** (**HM**) – 16 million precomputed MSAs from 16M sequences in UniClust30 and 140,000 PDB chains.
117
147
118
148
### DayhoffRef
119
149
Given the potential for generative models to expand the space of proteins and their functions, we used the Dayhoff models to generate DayhoffRef, a PLM-generated database of synthetic protein sequences
120
150
121
-
***DayhoffRef**: dataset of 16 million synthetic protein sequences generated by the Dayhoff models: Dayhoff-3b-UR90, Dayhoff-3b-GR-HM, Dayhoff-3b-GR-HM-c, and Dayhoff-170m-UR50-BRn.
122
-
*_Splits: train (5 GB)_
151
+
**DayhoffRef**: dataset of 16 million synthetic protein sequences generated by the Dayhoff models: Dayhoff-3b-UR90, Dayhoff-3b-GR-HM, Dayhoff-3b-GR-HM-c, and Dayhoff-170m-UR50-BRn.
152
+
*_Splits: train (5 GB)_
123
153
124
154
### Loading datasets in HuggingFace
125
155
@@ -166,20 +196,27 @@ Weights are available for the following models, as described in the [paper](http
166
196
167
197
## Unconditional generation
168
198
169
-
For most cases, use [src/generate.py](https://github.com/microsoft/dayhoff/blob/main/src/generate.py) to generate new protein sequences. Below is a sample code to generate 10 sequence with at most 100 residues:
199
+
For most cases, use [examples/generate.py](https://github.com/microsoft/dayhoff/blob/main/src/generate.py) to generate new protein sequences. Below is a sample command to generate 10 sequences with at most 100 residues and to place them in a fasta file in the directory `generations/`
The [generate_from_homologs](https://github.com/microsoft/dayhoff/blob/main/src/generate_from_homologs.py) script performs sequence generation conditioned on evolutionarily-related homologous sequences modeled as multiple sequence alignments (MSAs)
207
+
[examples/generate.py] includes an option to pass a fasta file, in which case it performs sequence generation conditioned on the sequences in the fasta file. The order of the conditioning sequences will be randomly shuffled for each generation.
The following code specifies the folder where MSAs in fasta format are stored and selects two specific MSAs for conditional generation. The list of MSAs within the MSAs dir can also be specified via an --include-pattern argument.
216
+
[examples/score.py] will compute backwards and forward average log likelihoods for every sequence in a fasta file.
@@ -272,4 +309,4 @@ contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additio
272
309
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow
0 commit comments