Skip to content

Commit 3409d91

Browse files
committed
clean
1 parent 2cab4b8 commit 3409d91

3 files changed

Lines changed: 57 additions & 75 deletions

File tree

README.md

Lines changed: 20 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,7 @@ docker run -it samirchar/dayhoff:latest
8282

8383
## Data and model availability
8484

85-
All Dayhoff models are available on [AzureAIFoundry](https://ai.azure.com/labs)
85+
All Dayhoff models are available on [Azure AI Foundry](https://aka.ms/dayhoff/foundry)
8686

8787
Additionally, all Dayhoff models are also hosted on [Hugging Face](https://huggingface.co/collections/microsoft/dayhoff-atlas-6866d679465a2685b06ee969) 🤗. All datasets used in the paper, with the exception of OpenProteinSet are available on Hugging Face in three formats: FASTA, Arrow, and JSONL.
8888

@@ -92,34 +92,34 @@ GigaRef, BackboneRef, and DayhoffRef are available under [CC BY License](https:/
9292
### Training datasets
9393
The Dayhoff models were trained on the Dayhoff Atlas, with varying data mixes which include:
9494

95-
**[UniRef50](https://www.uniprot.org/)** (**UR50**) - dataset from UniProt, clustered at 50% sequence identity, contains only cluster representatives.
96-
* _Splits: train (25 GB), test (26 MB), valid (26 MB)_
95+
* **[UniRef50](https://www.uniprot.org/)** (**UR50**) - dataset from UniProt, clustered at 50% sequence identity, contains only cluster representatives.
96+
* _Splits: train (25 GB), test (26 MB), valid (26 MB)_
9797

98-
**[UniRef90](https://www.uniprot.org/)** (**UR90**) - dataset from UniProt, clustered at 90% sequence identity, contains cluster representatives and members.
99-
* _Splits: train (83 GB), test (90 MB), valid (87 MB)_
98+
* **[UniRef90](https://www.uniprot.org/)** (**UR90**) - dataset from UniProt, clustered at 90% sequence identity, contains cluster representatives and members.
99+
* _Splits: train (83 GB), test (90 MB), valid (87 MB)_
100100

101101

102-
**GigaRef** (**GR**)– 3.43B protein sequences across 1.7B clusters of metagenomic and natural protein sequences. There are two subsets of gigaref:
103-
* **GigaRef-clusters** (**GR**) - Only includes cluster representatives and members, no singletons
104-
* _Splits: train (433 GB), test (22 MB)_
105-
* **GigaRef-singletons** (**GR-s**) - Only includes singletons
106-
* _Splits: train (282 GB)_
102+
* **GigaRef** (**GR**)– 3.43B protein sequences across 1.7B clusters of metagenomic and natural protein sequences. There are two subsets of gigaref:
103+
* **GigaRef-clusters** (**GR**) - Only includes cluster representatives and members, no singletons
104+
* _Splits: train (433 GB), test (22 MB)_
105+
* **GigaRef-singletons** (**GR-s**) - Only includes singletons
106+
* _Splits: train (282 GB)_
107107

108-
**BackboneRef** (**BR**) – 46M structure-derived synthetic sequences from c.a. 240,000 de novo backbones, with three subsets containing 10M sequences each:
109-
* **BackboneRef unfiltered** (**BRu**) – 10M sequences randomly sampled from all 46M designs.
110-
* _Splits: train (3 GB)_
111-
* **BackboneRef quality** (**BRq**) – 10M sequences sampled from 127,633 backbones whose average self-consistency RMSD ≤ 2 Å.
112-
* _Splits: train(3 GB)_
113-
* **BackboneRef novelty** (**BRn**) – 10M sequences from 138,044 backbones with a max TM-score < 0.5 to any natural structure.
114-
* _Splits: train (3GB)_
108+
* **BackboneRef** (**BR**) – 46M structure-derived synthetic sequences from c.a. 240,000 de novo backbones, with three subsets containing 10M sequences each:
109+
* **BackboneRef unfiltered** (**BRu**) – 10M sequences randomly sampled from all 46M designs.
110+
* _Splits: train (3 GB)_
111+
* **BackboneRef quality** (**BRq**) – 10M sequences sampled from 127,633 backbones whose average self-consistency RMSD ≤ 2 Å.
112+
* _Splits: train(3 GB)_
113+
* **BackboneRef novelty** (**BRn**) – 10M sequences from 138,044 backbones with a max TM-score < 0.5 to any natural structure.
114+
* _Splits: train (3GB)_
115115

116-
**[OpenProteinSet](https://arxiv.org/abs/2308.05326)** (**HM**) – 16 million precomputed MSAs from 16M sequences in UniClust30 and 140,000 PDB chains.
116+
* **[OpenProteinSet](https://arxiv.org/abs/2308.05326)** (**HM**) – 16 million precomputed MSAs from 16M sequences in UniClust30 and 140,000 PDB chains.
117117

118118
### DayhoffRef
119119
Given the potential for generative models to expand the space of proteins and their functions, we used the Dayhoff models to generate DayhoffRef, a PLM-generated database of synthetic protein sequences
120120

121-
**DayhoffRef**: dataset of 16 million synthetic protein sequences generated by the Dayhoff models: Dayhoff-3b-UR90, Dayhoff-3b-GR-HM, Dayhoff-3b-GR-HM-c, and Dayhoff-170m-UR50-BRn.
122-
* _Splits: train (5 GB)_
121+
* **DayhoffRef**: dataset of 16 million synthetic protein sequences generated by the Dayhoff models: Dayhoff-3b-UR90, Dayhoff-3b-GR-HM, Dayhoff-3b-GR-HM-c, and Dayhoff-170m-UR50-BRn.
122+
* _Splits: train (5 GB)_
123123

124124
### Loading datasets in HuggingFace
125125

datasets/openproteinset/utils.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
import glob
2-
from tqdm import tqdm
32
import os
4-
import pandas as pd
5-
import numpy as np
6-
from sequence_models.utils import parse_fasta
73
import subprocess
84
from multiprocessing.pool import ThreadPool
95

6+
import pandas as pd
7+
from sequence_models.utils import parse_fasta
8+
from tqdm import tqdm
9+
1010

1111
def make_uniref_fasta_splits(alignments):
1212
for x in alignments:

pyproject.toml

Lines changed: 33 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -33,64 +33,46 @@ classifiers = [
3333
"Operating System :: OS Independent"
3434
]
3535

36-
dependencies = [ #TODO: can add optional dependencies
37-
# Basics
38-
'pandas~=2.2.3',
39-
'numpy~=1.26.4',
40-
# 'lmdb~=1.4.1',
41-
'mdanalysis~=2.7.0',
42-
'python-dotenv~=1.0.1', # to load env variables
43-
'matplotlib~=3.10.1',
44-
'seaborn~=0.13.2',
45-
'h5py~=3.13.0',
46-
'scikit-learn~=1.5.0',
47-
'scipy~=1.15.2',
48-
49-
#Pytorch
50-
'torch-geometric~=2.5.2',
51-
'torch-scatter~=2.1.2',
52-
'transformers~=4.42.4',
53-
'datasets~=3.2.0', #for HF datasets
54-
# 'causal-conv1d>=1.4.0', #For jamba/mamba
55-
# 'mamba-ssm~=2.2.4', #For jamba/mamba
56-
# 'flash-attn~=2.7.4.post1', #Flash Attention
57-
58-
#Bio
59-
'biopython~=1.83',
60-
# 'biotite~=0.40.0',
61-
'blosum~=2.0.3',
62-
'fair-esm~=2.0.0',
63-
'evodiff~=1.1.0',
64-
'sequence-models~=1.8.0',
65-
66-
67-
# logging debugging
68-
# 'mlflow',
69-
'pdb-tools~=2.5.0',
70-
'wandb~=0.16.6',
71-
'tqdm~=4.67.1',
72-
73-
74-
#Huggingface Hub
75-
'ijson~=3.3.0',
76-
'pyfastx~=2.2.0',
77-
'huggingface_hub~=0.27.1',
78-
79-
#Azure
80-
'azure-identity~=1.21.0'
36+
# only the packages imported by dayhoff/:
37+
dependencies = [
38+
"numpy>=1.26",
39+
"torch>=2.7",
40+
"transformers>=4.49",
41+
"pandas>=2.3",
42+
"biopython>=1.85",
43+
"sequence-models>=1.8",
44+
"scipy>=1.13",
8145
]
8246

83-
# [project.optional-dependencies]
84-
47+
[project.optional-dependencies]
48+
# everything else in your monorepo lives here
49+
full = [
50+
"mdanalysis>=2.7",
51+
"python-dotenv>=1.0",
52+
"matplotlib>=3.10",
53+
"seaborn>=0.13",
54+
"h5py>=3.13",
55+
"scikit-learn>=1.5",
56+
"torch-geometric>=2.5",
57+
"torch-scatter>=2.1",
58+
"datasets>=3.2",
59+
"blosum>=2.0",
60+
"fair-esm>=2.0",
61+
"evodiff>=1.1",
62+
"pdb-tools>=2.5",
63+
"wandb>=0.16",
64+
"tqdm>=4.67",
65+
"ijson>=3.3",
66+
"pyfastx>=2.2",
67+
"huggingface_hub>=0.27",
68+
"azure-identity>=1.21",
69+
]
8570

8671
[project.urls]
8772
Homepage = "https://github.com/microsoft/dayhoff"
8873
Repository = "https://github.com/microsoft/dayhoff"
8974
Issues = "https://github.com/microsoft/dayhoff/issues"
90-
91-
# May need to remove this
92-
HuggingFaceDatasets = "https://huggingface.co/datasets/microsoft/DayhoffDataset"
93-
HuggingFaceModels = "https://huggingface.co/microsoft/Dayhoff"
75+
HuggingFace = "https://huggingface.co/collections/microsoft/dayhoff-atlas-6866d679465a2685b06ee969"
9476

9577

9678
[tool.setuptools.packages.find]

0 commit comments

Comments
 (0)