Skip to content

Commit e6157a9

Browse files
Merge pull request #29 from microsoft/kky/dev
Examples and instructions that actually work.
2 parents df949c0 + 36c1e88 commit e6157a9

6 files changed

Lines changed: 342 additions & 97 deletions

File tree

README.md

Lines changed: 91 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
Dayhoff is an Atlas of both protein sequence data and generative language models — a centralized resource that brings together 3.34 billion protein sequences across 1.7 billion clusters of metagenomic and natural protein sequences (GigaRef), 46 million structure-based synthetic sequences (BackboneRef), and 16 million multiple sequence alignments (OpenProteinSet). These models can natively predict zero-shot mutation effects on fitness, scaffold structural motifs by conditioning on evolutionary or structural context, and perform guided generation of novel proteins within specified families. Learning from metagenomic and structure-based synthetic data from the Dayhoff Atlas increased the cellular expression rates of generated proteins, highlighting the real-world value of expanding the scale, diversity, and novelty of protein sequence data.
44

5-
The Dayhoff model architecture combines state-space Mamba layers with Transformer self-attention, interleaved with Mixture-of-Experts modules to maximize capacity while preserving efficiency. It natively handles long contexts, allowing both single sequences and unrolled MSAs to be modeled. Trained with an autoregressive objective in both N→C and C→N directions, Dayhoff supports order-agnostic infilling and scales to billions of parameters.
5+
The Dayhoff architecture is a hybrid of state-space Mamba layers and Transformer self-attention, interleaved with Mixture-of-Experts modules to maximize capacity while preserving efficiency. It natively handles long contexts, allowing both single sequences and unrolled MSAs to be modeled. Trained with an autoregressive objective in both N→C and C→N directions, Dayhoff supports order-agnostic infilling and scales to billions of parameters.
66

77
If you use the code, data, models, or results. please cite our [preprint](https://aka.ms/dayhoff/preprint).
88

@@ -12,6 +12,7 @@ If you use the code, data, models, or results. please cite our [preprint](https:
1212

1313
## Table of Contents
1414
* [Dayhoff](#Dayhoff)
15+
* [Usage](#Usage)
1516
* [Installation](#Installation)
1617
* [Data and Model availability](#Data-and-model-availability)
1718
* [Datasets](#Datasets)
@@ -29,48 +30,77 @@ If you use the code, data, models, or results. please cite our [preprint](https:
2930
* [Contributing](#Contributing)
3031
* [Trademarks](#Trademarks)
3132

33+
## Usage
34+
35+
The simplest way to use these models and datasets is via the HuggingFace interface. Alternately, you can install this package or use our Docker. Either way, you will need PyTorch, mamba=ssm, causal-conv1d, and flash-attn.
36+
37+
**Prerequisites**
3238

33-
## Installation
3439
**Requirements**:
35-
* PyTorch: 2.2 and above (2.7 recommended)
36-
* CUDA 12.0 and above
37-
* Optionally install Flash Attention 2 following installation instructions here: https://github.com/Dao-AILab/flash-attention
40+
* PyTorch: 2.7.1
41+
* CUDA 12.8 and above
42+
43+
We recommend using [uv](https://docs.astral.sh/uv/getting-started/installation/#standalone-installer) and creating a clean environment.
3844

39-
We recommend creating a clean conda environment with Python 3.10
4045

4146
```bash
42-
conda create --name dayhoff python=3.10
47+
uv venv dayhoff
48+
source dayhoff/bin/activate
4349
```
4450

45-
In that new environment, install PyTorch, mamba-ssm, and causal-conv1d, then install Dayhoff. Optionally, install Flash Attention 2.
51+
In that new environment, install PyTorch 2.7.1.
52+
```bash
53+
uv pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu128
54+
```
55+
56+
Now, we need to install mamba-ssm, flash-attn, causal-conv1d, and their prerequisites.
4657

4758
```bash
48-
pip install dayhoff
59+
uv pip install wheel packaging
60+
uv pip install --no-build-isolation flash-attn causal-conv1d mamba-ssm
61+
```
62+
63+
To import from HuggingFace, you will need to install these versions:
64+
65+
```bash
66+
uv pip install datasets==3.2.0 #for HF datasets
67+
uv pip install transformers==4.51.0
68+
uv pip install huggingface_hub~=0.34.4
69+
```
70+
71+
Now, you can simply import the models or datasets into your code.
72+
73+
```python
74+
from transformers import SuppressTokensLogitsProcessor
75+
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
76+
from datasets import load_dataset
4977

50-
# For bleeding edge:
51-
pip install git+https://github.com/microsoft/dayhoff.git
78+
model = AutoModelForCausalLM.from_pretrained('microsoft/Dayhoff-3b-GR-HM-c')
79+
tokenizer = AutoTokenizer.from_pretrained('microsoft/Dayhoff-3b-GR-HM-c',
80+
trust_remote_code=True)
81+
82+
gigaref_clustered_train = load_dataset("microsoft/DayhoffDataset",
83+
name="gigaref_no_singletons",
84+
split="train")
5285
```
5386

54-
**Mamba-ssm and causal-conv1d recommendations**
87+
## Installation
88+
89+
Now, we can either install from pypi:
5590

56-
It is sometimes challenging to properly install these packages just using pip. The following two errors are common when simply using pip install:
57-
* packages installed correctly, but when loading models you get "ValueError: Fast Mamba kernels are not available. Make sure they are installed and that the mamba module is on a CUDA device."
58-
* Package installation of causal-conv1d or mamba-ssm fails during the build
91+
```bash
92+
uv pip install dayhoff
93+
```
5994

60-
If you encounter any of these errors, try installing using the following commands:
95+
Or, to be able to run the example scripts, clone the repo and install.
6196

6297
```bash
63-
git clone https://github.com/Dao-AILab/causal-conv1d.git
64-
cd causal-conv1d
65-
git checkout v1.4.0 # current latest version tag
66-
CAUSAL_CONV1D_FORCE_BUILD=TRUE pip install .
67-
cd ..
68-
git clone https://github.com/state-spaces/mamba.git
69-
cd mamba
70-
git checkout v2.2.4 # current latest version tag
71-
CAUSAL_CONV1D_FORCE_BUILD=TRUE CAUSAL_CONV1D_SKIP_CUDA_BUILD=TRUE CAUSAL_CONV1D_FORCE_CXX11_ABI=TRUE pip install --no-build-isolation .
98+
git clone https://github.com/microsoft.com/dayhoff.git
99+
uv pip install -e .
72100
```
73101

102+
103+
74104
**Docker**
75105

76106
For a fully functional containerized environment without needing to install dependencies manually, you can use the provided Docker image instead:
@@ -82,44 +112,44 @@ docker run -it samirchar/dayhoff:latest
82112

83113
## Data and model availability
84114

85-
All Dayhoff models are available on [Azure AI Foundry](https://aka.ms/dayhoff/foundry)
115+
All Dayhoff models are available on [AzureAIFoundry](https://ai.azure.com/labs)
86116

87-
Additionally, all Dayhoff models are also hosted on [Hugging Face](https://huggingface.co/collections/microsoft/dayhoff-atlas-6866d679465a2685b06ee969) 🤗. All datasets used in the paper, with the exception of OpenProteinSet are available on Hugging Face in three formats: FASTA, Arrow, and JSONL. The PDB files for structures used to generate BackboneRef are available in Parquet format.
117+
Additionally, all Dayhoff models are also hosted on [Hugging Face](https://huggingface.co/collections/microsoft/dayhoff-atlas-6866d679465a2685b06ee969) 🤗. All datasets used in the paper, with the exception of OpenProteinSet are available on Hugging Face in three formats: FASTA, Arrow, and JSONL.
88118

89119
GigaRef, BackboneRef, and DayhoffRef are available under [CC BY License](https://creativecommons.org/licenses/by/4.0/)
90120

91121
## Datasets
92122
### Training datasets
93123
The Dayhoff models were trained on the Dayhoff Atlas, with varying data mixes which include:
94124

95-
* **[UniRef50](https://www.uniprot.org/)** (**UR50**) - dataset from UniProt, clustered at 50% sequence identity, contains only cluster representatives.
96-
* _Splits: train (25 GB), test (26 MB), valid (26 MB)_
125+
**[UniRef50](https://www.uniprot.org/)** (**UR50**) - dataset from UniProt, clustered at 50% sequence identity, contains only cluster representatives.
126+
* _Splits: train (25 GB), test (26 MB), valid (26 MB)_
97127

98-
* **[UniRef90](https://www.uniprot.org/)** (**UR90**) - dataset from UniProt, clustered at 90% sequence identity, contains cluster representatives and members.
99-
* _Splits: train (83 GB), test (90 MB), valid (87 MB)_
128+
**[UniRef90](https://www.uniprot.org/)** (**UR90**) - dataset from UniProt, clustered at 90% sequence identity, contains cluster representatives and members.
129+
* _Splits: train (83 GB), test (90 MB), valid (87 MB)_
100130

101131

102-
* **GigaRef** (**GR**)– 3.43B protein sequences across 1.7B clusters of metagenomic and natural protein sequences. There are two subsets of gigaref:
103-
* **GigaRef-clusters** (**GR**) - Only includes cluster representatives and members, no singletons
104-
* _Splits: train (433 GB), test (22 MB)_
105-
* **GigaRef-singletons** (**GR-s**) - Only includes singletons
106-
* _Splits: train (282 GB)_
132+
**GigaRef** (**GR**)– 3.43B protein sequences across 1.7B clusters of metagenomic and natural protein sequences. There are two subsets of gigaref:
133+
* **GigaRef-clusters** (**GR**) - Only includes cluster representatives and members, no singletons
134+
* _Splits: train (433 GB), test (22 MB)_
135+
* **GigaRef-singletons** (**GR-s**) - Only includes singletons
136+
* _Splits: train (282 GB)_
107137

108-
* **BackboneRef** (**BR**) – 46M structure-derived synthetic sequences from c.a. 240,000 de novo backbones, with three subsets containing 10M sequences each:
109-
* **BackboneRef unfiltered** (**BRu**) – 10M sequences randomly sampled from all 46M designs.
110-
* _Splits: train (3 GB)_
111-
* **BackboneRef quality** (**BRq**) – 10M sequences sampled from 127,633 backbones whose average self-consistency RMSD ≤ 2 Å.
112-
* _Splits: train(3 GB)_
113-
* **BackboneRef novelty** (**BRn**) – 10M sequences from 138,044 backbones with a max TM-score < 0.5 to any natural structure.
114-
* _Splits: train (3GB)_
138+
**BackboneRef** (**BR**) – 46M structure-derived synthetic sequences from c.a. 240,000 de novo backbones, with three subsets containing 10M sequences each:
139+
* **BackboneRef unfiltered** (**BRu**) – 10M sequences randomly sampled from all 46M designs.
140+
* _Splits: train (3 GB)_
141+
* **BackboneRef quality** (**BRq**) – 10M sequences sampled from 127,633 backbones whose average self-consistency RMSD ≤ 2 Å.
142+
* _Splits: train(3 GB)_
143+
* **BackboneRef novelty** (**BRn**) – 10M sequences from 138,044 backbones with a max TM-score < 0.5 to any natural structure.
144+
* _Splits: train (3GB)_
115145

116-
* **[OpenProteinSet](https://arxiv.org/abs/2308.05326)** (**HM**) – 16 million precomputed MSAs from 16M sequences in UniClust30 and 140,000 PDB chains.
146+
**[OpenProteinSet](https://arxiv.org/abs/2308.05326)** (**HM**) – 16 million precomputed MSAs from 16M sequences in UniClust30 and 140,000 PDB chains.
117147

118148
### DayhoffRef
119149
Given the potential for generative models to expand the space of proteins and their functions, we used the Dayhoff models to generate DayhoffRef, a PLM-generated database of synthetic protein sequences
120150

121-
* **DayhoffRef**: dataset of 16 million synthetic protein sequences generated by the Dayhoff models: Dayhoff-3b-UR90, Dayhoff-3b-GR-HM, Dayhoff-3b-GR-HM-c, and Dayhoff-170m-UR50-BRn.
122-
* _Splits: train (5 GB)_
151+
**DayhoffRef**: dataset of 16 million synthetic protein sequences generated by the Dayhoff models: Dayhoff-3b-UR90, Dayhoff-3b-GR-HM, Dayhoff-3b-GR-HM-c, and Dayhoff-170m-UR50-BRn.
152+
* _Splits: train (5 GB)_
123153

124154
### Loading datasets in HuggingFace
125155

@@ -166,20 +196,27 @@ Weights are available for the following models, as described in the [paper](http
166196

167197
## Unconditional generation
168198

169-
For most cases, use [src/generate.py](https://github.com/microsoft/dayhoff/blob/main/src/generate.py) to generate new protein sequences. Below is a sample code to generate 10 sequence with at most 100 residues:
199+
For most cases, use [examples/generate.py](https://github.com/microsoft/dayhoff/blob/main/src/generate.py) to generate new protein sequences. Below is a sample command to generate 10 sequences with at most 100 residues and to place them in a fasta file in the directory `generations/`
170200

171201
```bash
172-
python src/generate.py --out-dir generations --model 170m-UR50-BBR-n --max-length 100 --n-generations 10 --temp 1.0 --min-p 0.0 --random-seed 1
202+
python examples/generate.py generations/ --model-name Dayhoff-170m-UR50-BBR-n --max-length 100 --n-generations 10 --temp 1.0 --min-p 0.0 --random-seed 1 --gpu 0
173203
```
174204

175205
## Homolog-conditioned generation
176206

177-
The [generate_from_homologs](https://github.com/microsoft/dayhoff/blob/main/src/generate_from_homologs.py) script performs sequence generation conditioned on evolutionarily-related homologous sequences modeled as multiple sequence alignments (MSAs)
207+
[examples/generate.py] includes an option to pass a fasta file, in which case it performs sequence generation conditioned on the sequences in the fasta file. The order of the conditioning sequences will be randomly shuffled for each generation.
208+
209+
210+
```bash
211+
python examples/generate.py generations/ --fasta-file example.fasta --model-name Dayhoff-3b-GR-HM-c --max-length 128 --n-generations 10 --temp 1.0 --min-p 0.0 --random-seed 1 --gpu 0
212+
```
213+
214+
## Zero-shot fitness scoring
178215

179-
The following code specifies the folder where MSAs in fasta format are stored and selects two specific MSAs for conditional generation. The list of MSAs within the MSAs dir can also be specified via an --include-pattern argument.
216+
[examples/score.py] will compute backwards and forward average log likelihoods for every sequence in a fasta file.
180217

181218
```bash
182-
python src/generate_from_homologs.py --model 3b-GGR-MSA --msas-dir MSAs --task sequence --out-dir generations --msa-file-names 100220484.fasta 10123434.fasta --temp 1.0 --min-p 0.0 --max-length 768 --random-seed 1
219+
python examples/score.py example.fasta output_dir/ --model-name Dayhoff-3b-GR-HM-c --gpu 0
183220
```
184221

185222
## Analysis scripts
@@ -238,8 +275,8 @@ Evolution guided generation:
238275
* [reorg_msas.py](https://github.com/microsoft/dayhoff/blob/main/analysis/reorg_msas.py)
239276

240277
Cas9 evals:
241-
* [generate_cas9.py](https://github.com/microsoft/dayhoff/blob/main/analysis/generate_cas9.py)
242-
* [compile_cas9_fidelity.py](https://github.com/microsoft/dayhoff/blob/main/analysis/compile_cas9_fidelity.py)
278+
* [generate_cas9.py](https://github.com/microsoft/dayhoff/blob/main/analysis/generate_cas9.py):
279+
* [compile_cas9_fidelity.py](https://github.com/microsoft/dayhoff/blob/main/analysis/compile_cas9_fidelity.py):
243280

244281
## Out-of-Scope Use Cases
245282

@@ -272,4 +309,4 @@ contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additio
272309
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow
273310
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
274311
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
275-
Any use of third-party trademarks or logos are subject to those third-party's policies.
312+
Any use of third-party trademarks or logos are subject to those third-party's policies.

dayhoff/utils.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -199,7 +199,6 @@ def get_logger():
199199
logger.addHandler(console_handler)
200200

201201
# Example usage
202-
logger.info("This is an info message.")
203202
return logger
204203

205204

examples/generate.py

Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
import argparse
2+
import os
3+
import datetime
4+
5+
import numpy as np
6+
import torch
7+
from dayhoff.constants import START_UL
8+
from dayhoff.constants import constants
9+
from sequence_models.utils import parse_fasta
10+
from tqdm import tqdm
11+
from transformers import SuppressTokensLogitsProcessor
12+
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
13+
14+
CAN_AAS = constants.CAN_AAS
15+
STOP = constants.STOP
16+
SEP = constants.SEP
17+
START = constants.START
18+
GAP = constants.GAP
19+
20+
21+
def generate(args: argparse.Namespace) -> None:
22+
#print(f"Starting job on rank {RANK} with local rank {LOCAL_RANK} and world size {WORLD_SIZE}")
23+
DEVICE = torch.device("cuda:" + str(args.gpu))
24+
25+
set_seed(0)
26+
torch.set_default_device(DEVICE)
27+
28+
model = AutoModelForCausalLM.from_pretrained('microsoft/%s' %args.model_name)
29+
tokenizer = AutoTokenizer.from_pretrained('microsoft/%s' %args.model_name,
30+
trust_remote_code=True)
31+
32+
model = model.to(DEVICE)
33+
model = model.to(torch.bfloat16)
34+
out_dir = args.out_dir
35+
alphabet = tokenizer.alphabet
36+
37+
all_tokens = list(range(len(alphabet)))
38+
allowed_tokens = [alphabet.index(aa) for aa in CAN_AAS]
39+
if args.fasta_file is not None:
40+
if 'HM' not in args.model_name:
41+
raise ValueError(args.model_name + " cannot use homolog conditioning.")
42+
seqs, names = parse_fasta(args.fasta_file, return_names=True)
43+
stop = SEP
44+
eos_id = alphabet.index(SEP)
45+
else:
46+
eos_id = alphabet.index(STOP)
47+
seqs = None
48+
stop = STOP
49+
params = args.model_name + "_%.1f_minp%.2f_%d" % (args.temp, args.min_p, args.random_seed)
50+
allowed_tokens += [eos_id]
51+
model.generation_config.eos_token_id = eos_id
52+
sup = SuppressTokensLogitsProcessor([t for t in all_tokens if not t in allowed_tokens], device=DEVICE)
53+
os.makedirs(out_dir, exist_ok=True)
54+
n_generated = 0
55+
now = str(datetime.datetime.now()).replace(' ', '_').replace(':', '.')
56+
with open(os.path.join(out_dir, params + '_' + now + ".fasta"), "w") as f:
57+
for _ in tqdm(range(args.n_generations)):
58+
if seqs is not None:
59+
# shuffle
60+
idx = np.arange(len(seqs))
61+
shuffled_seqs = [seqs[i] for i in idx]
62+
ul = START_UL + SEP.join(shuffled_seqs) + SEP
63+
start = tokenizer(ul, return_tensors="pt", return_token_type_ids=False)['input_ids']
64+
else:
65+
start = tokenizer(START, return_tensors="pt", return_token_type_ids=False)['input_ids']
66+
67+
generated = model.generate(start, do_sample=True, logits_processor=[sup],
68+
temperature=args.temp, min_p=args.min_p, num_beams=1,
69+
max_new_tokens=args.max_length,
70+
use_cache=True)
71+
untokenized = tokenizer.batch_decode(generated, skip_special_tokens=False)[0]
72+
if seqs is None:
73+
new_seq = untokenized.replace(START, "").replace(STOP, "")
74+
else:
75+
if untokenized[-1] == stop:
76+
new_seq = untokenized.split(stop)[-2]
77+
else:
78+
new_seq = untokenized.split(stop)[-1]
79+
# TODO: Deal with max token termination
80+
f.write(">" + params + "_%d\n" %n_generated)
81+
f.write(new_seq + "\n")
82+
n_generated += 1
83+
84+
85+
def main():
86+
parser = argparse.ArgumentParser()
87+
parser.add_argument("out_dir", type=str) # location to write to
88+
parser.add_argument("--model-name", type=str, default='Dayhoff-170m-UR90')
89+
parser.add_argument("--fasta-file", type=str, default=None)
90+
parser.add_argument("--max-length", type=int, default=2048)
91+
parser.add_argument("--n-generations", type=int, default=32)
92+
parser.add_argument("--temp", type=float, default=1.0) #
93+
parser.add_argument("--random-seed", type=int, default=0) #
94+
parser.add_argument("--min-p", type=float, default=0.)
95+
parser.add_argument("--gpu", type=int, default=0)
96+
args = parser.parse_args()
97+
generate(args)
98+
99+
100+
if __name__ == "__main__":
101+
main()

0 commit comments

Comments
 (0)