Skip to content

Commit df949c0

Browse files
Kevin Kaichuang Yangsarahalamdari
authored andcommitted
First commit.
0 parents  commit df949c0

123 files changed

Lines changed: 15113 additions & 0 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

CODE_OF_CONDUCT.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# Microsoft Open Source Code of Conduct
2+
3+
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
4+
5+
Resources:
6+
7+
- [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/)
8+
- [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)
9+
- Contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with questions or concerns

Dockerfile

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
FROM pytorch/pytorch:2.7.0-cuda12.8-cudnn9-devel
2+
3+
# Set environment to non-interactive for clean installs
4+
ENV DEBIAN_FRONTEND=noninteractive
5+
6+
# Install git and other system dependencies
7+
RUN apt-get update && apt-get install -y \
8+
git \
9+
build-essential \
10+
&& rm -rf /var/lib/apt/lists/*
11+
12+
# Clone and install causal-conv1d
13+
RUN git clone https://github.com/Dao-AILab/causal-conv1d.git && \
14+
cd causal-conv1d && \
15+
git checkout v1.4.0 && \
16+
CAUSAL_CONV1D_FORCE_BUILD=TRUE pip install . && \
17+
cd ..
18+
19+
# Clone and install mamba
20+
RUN git clone https://github.com/state-spaces/mamba.git && \
21+
cd mamba && \
22+
git checkout v2.2.4 && \
23+
CAUSAL_CONV1D_FORCE_BUILD=TRUE \
24+
CAUSAL_CONV1D_SKIP_CUDA_BUILD=TRUE \
25+
CAUSAL_CONV1D_FORCE_CXX11_ABI=TRUE \
26+
pip install --no-build-isolation . && \
27+
cd ..
28+
29+
# Install dayhoff package from PyPI
30+
RUN pip install dayhoff
31+
32+
# Add GitHub to known_hosts to avoid host verification error
33+
RUN mkdir -p /root/.ssh && \
34+
ssh-keyscan github.com >> /root/.ssh/known_hosts
35+
36+
# Clone the private or public Dayhoff repo
37+
RUN --mount=type=ssh git clone git@github.com:microsoft/dayhoff.git /dayhoff
38+
39+
# Set working directory to inside the cloned repo
40+
WORKDIR /dayhoff

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) Microsoft Corporation.
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE

README.md

Lines changed: 275 additions & 0 deletions
Large diffs are not rendered by default.

SECURITY.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
<!-- BEGIN MICROSOFT SECURITY.MD V0.0.9 BLOCK -->
2+
3+
## Security
4+
5+
Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet) and [Xamarin](https://github.com/xamarin).
6+
7+
If you believe you have found a security vulnerability in any Microsoft-owned repository that meets [Microsoft's definition of a security vulnerability](https://aka.ms/security.md/definition), please report it to us as described below.
8+
9+
## Reporting Security Issues
10+
11+
**Please do not report security vulnerabilities through public GitHub issues.**
12+
13+
Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://aka.ms/security.md/msrc/create-report).
14+
15+
If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com). If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://aka.ms/security.md/msrc/pgp).
16+
17+
You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc).
18+
19+
Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
20+
21+
* Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
22+
* Full paths of source file(s) related to the manifestation of the issue
23+
* The location of the affected source code (tag/branch/commit or direct URL)
24+
* Any special configuration required to reproduce the issue
25+
* Step-by-step instructions to reproduce the issue
26+
* Proof-of-concept or exploit code (if possible)
27+
* Impact of the issue, including how an attacker might exploit the issue
28+
29+
This information will help us triage your report more quickly.
30+
31+
If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://aka.ms/security.md/msrc/bounty) page for more details about our active programs.
32+
33+
## Preferred Languages
34+
35+
We prefer all communications to be in English.
36+
37+
## Policy
38+
39+
Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://aka.ms/security.md/cvd).
40+
41+
<!-- END MICROSOFT SECURITY.MD BLOCK -->

SUPPORT.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Support
2+
3+
## How to file issues and get help
4+
5+
This project uses GitHub Issues to track bugs and feature requests. Please search the existing
6+
issues before filing new issues to avoid duplicates. For new issues, file your bug or
7+
feature request as a new Issue.
8+
9+
For help and questions about using this project, please contact the authors.
10+
11+
## Microsoft Support Policy
12+
13+
Support for this project is limited to the resources listed above.

analysis/ccmgen/Snakefile

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
import os
2+
3+
# alignments = os.listdir('selected_alignments')
4+
# names = []
5+
# for alignment in alignments:
6+
# with open(os.path.join("selected_alignments", alignment)) as f:
7+
# lines = f.readlines()
8+
# if len(lines) > 2:
9+
# names.append(alignment[:-6])
10+
11+
alignments = os.listdir('ccmgen_models')
12+
names = [a[:-8] for a in alignments]
13+
14+
15+
16+
rule cat:
17+
input: ["ccmgen_outputs_short/" + name + ".fasta" for name in names]
18+
output: "ccmgen_short.fasta"
19+
run:
20+
with open(output[0], "w") as out_file:
21+
for in_file in input:
22+
with open(in_file) as f:
23+
_ = f.readline()
24+
seq = f.readlines()
25+
seq = "".join([s[:-1] for s in seq])
26+
name = in_file.split("/")[1][:-6]
27+
out_file.write(">" + name + "\n" + seq + "\n")
28+
29+
30+
rule ccmgen:
31+
input: "single_sequences/{name}.fasta", "ccmgen_models_short/{name}.braw.gz"
32+
output: "ccmgen_outputs_short/{name}.fasta"
33+
conda: "ccmgen"
34+
shell:
35+
"ccmgen ccmgen_models_short/{wildcards.name}.braw.gz {output} --mcmc-sampling --alnfile single_sequences/{wildcards.name}.fasta --mcmc-sample-random-gapped --mcmc-burn-in 500 --num-sequences 1"
36+
37+
38+
rule get_56:
39+
input: "selected_alignments/{name}.fasta"
40+
output: "conditioning_sequences/{name}.fasta"
41+
run:
42+
with open(input[0]) as f_in, open(output[0], 'w') as f_out:
43+
for i, line in enumerate(f_in):
44+
if i == 57 * 2:
45+
break
46+
f_out.write(line)
47+
48+
49+
rule get_first:
50+
input: "selected_alignments/{name}.fasta"
51+
output: "single_sequences/{name}.fasta"
52+
shell:
53+
"head -n 2 {input} > {output}"
54+
55+
56+
rule ccmpred:
57+
input: "conditioning_sequences/{name}.fasta"
58+
output: "ccmgen_models_short/{name}.braw.gz"
59+
threads: 2
60+
conda: "ccmgen"
61+
shell:
62+
"ccmpred {input} --no-logo --num-threads {threads} -b {output}"

analysis/clusters.py

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
import os
2+
import jsonlines
3+
from tqdm import tqdm
4+
5+
import pandas as pd
6+
import numpy as np
7+
import matplotlib.pyplot as plt
8+
import seaborn as sns
9+
10+
_ = sns.set_style("white")
11+
input_dir = "/home/kevyan/generations/dayhoffref/"
12+
13+
df_50 = pd.read_csv(os.path.join(input_dir, 'c50.tsv'), sep='\t', header=None)
14+
c50 = [None] * len(df_50)
15+
current_pos = 0
16+
current_cluster_rep = df_50.loc[0, 0]
17+
current_cluster = []
18+
for row in tqdm(df_50.itertuples()):
19+
if row._1 != current_cluster_rep:
20+
c50[current_pos] = {"rep_id": current_cluster_rep, "ids": current_cluster, "n": len(current_cluster)}
21+
current_pos += 1
22+
current_cluster = []
23+
current_cluster_rep = row._1
24+
current_cluster.append(row._2)
25+
c50[current_pos] = {"rep_id": current_cluster_rep, "ids": current_cluster, "n": len(current_cluster)}
26+
c50 = c50[:current_pos + 1]
27+
with jsonlines.open(os.path.join(input_dir, 'c50.jsonl'), 'w') as f:
28+
for key in c50:
29+
f.write(key)
30+
31+
counts = {}
32+
total = 0
33+
for key in c50:
34+
c = key["n"]
35+
if c not in counts:
36+
counts[c] = 1
37+
else:
38+
counts[c] += 1
39+
total += c
40+
41+
x = np.array(list(counts.keys()))
42+
x = np.sort(x)
43+
y = np.array([counts[xx] * xx for xx in x])
44+
y = np.cumsum(y)
45+
fig, ax = plt.subplots(1, 1)
46+
_ = ax.plot(x, y, '.-')
47+
_ = fig.savefig(os.path.join(input_dir, 'c50_cumsum.pdf'), dpi=300, bbox_inches='tight')
48+
49+
df = pd.DataFrame(columns=["model", "temp", "direction", "cluster_size"])
50+
model = [None] * total
51+
temp = [None] * total
52+
direction = [None] * total
53+
cluster_size = [None] * total
54+
current_row = 0
55+
for c in tqdm(c50):
56+
for name in c["ids"]:
57+
broken = name.split('_')
58+
model[current_row] = broken[0]
59+
temp[current_row] = float(broken[2][1:])
60+
direction[current_row] = broken[1]
61+
cluster_size[current_row] = c["n"]
62+
current_row += 1
63+
df["model"] = model
64+
df["temp"] = temp
65+
df["direction"] = direction
66+
df["cluster_size"] = cluster_size
67+
df.to_csv(os.path.join(input_dir, 'c50_sizes.csv'), index=False)
68+
69+
70+
df = pd.read_csv(os.path.join(input_dir, 'c50_sizes.csv'))
71+
model_names = list(set(model))
72+
grouped = df.groupby(["model", "temp", "direction"])
73+
grouped["cluster_size"].mean()
74+
75+
def f(sizes):
76+
return (sizes == 1).mean()
77+
78+
agged = grouped.agg(
79+
cluster_size_mean=('cluster_size', np.mean),
80+
cluster_size_std=('cluster_size', np.std),
81+
frac_singleton=("cluster_size", f),
82+
n=("cluster_size", "count"),
83+
)
84+
pd.set_option('display.max_columns', None)
85+
pd.set_option('display.max_rows', None)
86+
pd.set_option('display.expand_frame_repr', False)
87+
print(agged.reset_index())
88+
89+
df.groupby("model").agg(
90+
cluster_size_mean=('cluster_size', np.mean),
91+
cluster_size_std=('cluster_size', np.std),
92+
frac_singleton=("cluster_size", f),
93+
n=("cluster_size", "count"),
94+
).reset_index()

analysis/compile_cas9_fidelity.py

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
from tqdm import tqdm
2+
import os
3+
4+
from Bio.Align import PairwiseAligner, substitution_matrices
5+
from sequence_models.utils import parse_fasta
6+
7+
8+
base_path = "/home/kevyan/generations/cas9-no-order/"
9+
10+
model = "short_cas9s_1.0_minp0.00_new"
11+
folding_df = pd.read_csv(os.path.join(base_path, 'esmfold_proteinmpnn_merge_data.csv'))
12+
seqs, names = parse_fasta(os.path.join(base_path, "%s.fasta" % model), return_names=True)
13+
df = folding_df[folding_df['if_temp'] == 1.0]
14+
name_df = pd.DataFrame()
15+
name_df['sequence'] = seqs
16+
name_df['file'] = names
17+
df = pd.merge(name_df, df, how='left', on='file')
18+
# for m in models:
19+
# pdb_paths, mpnn_paths = get_all_paths(os.path.join(base_path, "%s_structures/pdb/esmfold/" %m), os.path.join(base_path, "%s_structures/esmfoldmpnn_iftemp_1" %m))
20+
# fold_df, mpnn_df, df = results_to_pandas(pdb_paths, mpnn_paths, name="")
21+
# df['model'] = m
22+
23+
aligner = PairwiseAligner()
24+
aligner.substitution_matrix = substitution_matrices.load("BLOSUM62")
25+
aligner.open_gap_score = -10
26+
aligner.extend_gap_score = -0.5
27+
aligner.target_end_gap_score = 0.0
28+
aligner.query_end_gap_score = 0.0
29+
with tqdm(total=len(df)) as pbar:
30+
homologs, homolog_names = parse_fasta(os.path.join('/home/kevyan/data/characterized_cas9s', "naturals.fasta"), return_names=True)
31+
for idx, row in df.iterrows():
32+
s = row['sequence']
33+
s = s.replace("-", "")
34+
s = s.replace("<mask2>", "")
35+
s = s.replace("<mask1>", "")
36+
s = s.replace("<mask3>", "")
37+
s = s.replace("<eos>", "")
38+
best_matches = -1
39+
best_homolog_sequence = None
40+
best_homolog_name = None
41+
best_cterm_gaps = None
42+
for hs, hn in zip(homologs, homolog_names):
43+
alignment = aligner.align(s, hs)
44+
if alignment.score > best_matches:
45+
best_matches = alignment.score
46+
best_homolog_sequence = hs
47+
best_homolog_name = hn
48+
best_cterm_gaps = len(hs) - alignment[0].aligned[1, -1, 1]
49+
df.loc[idx, 'gen_length'] = len(s)
50+
df.loc[idx, 'best_matches'] = best_matches
51+
df.loc[idx, 'match_length'] = len(best_homolog_sequence)
52+
df.loc[idx, 'homolog_name'] = best_homolog_name
53+
df.loc[idx, 'homolog_sequence'] = best_homolog_sequence
54+
df.loc[idx, 'cterm_gaps'] = best_cterm_gaps
55+
pbar.update(1)
56+
57+
df['plddt'] = df['esmfoldplddt']
58+
df['scperplexity'] = df['proteinmpnnperplexity']
59+
df['seq_id'] = df['best_matches'] / df['gen_length']
60+
df = df.sort_values(['cterm_gaps', 'plddt'], ascending=[True, False])
61+
df['name'] = [f.split('_')[-1] for f in df['file']]
62+
df.to_csv(os.path.join(base_path, "%s_fidelity.csv" %model), index=False)
63+
64+
df = pd.read_csv(os.path.join(base_path, "%s_fidelity.csv" %model))
65+
66+
df[df['plddt'] > .70].head(10)[['name', 'match_length', 'gen_length', 'plddt', 'cterm_gaps', 'best_matches']]
67+
# 52, 8, and 50 have the most domain hits
68+
df[df['plddt'] > 0.7].shape
69+
df.loc[[0, 1, 2, 18, 19, 21], ['name', 'sequence']].values
70+
df.loc[[0, 1, 2, 18, 19, 21], ['name', 'homolog_name', 'homolog_sequence']].values

analysis/compile_dayhoffref.py

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
import os
2+
3+
individual_dir = "/home/kevyan/generations/dayhoffref/dayhoff_generations/"
4+
individual_files = os.listdir(individual_dir)
5+
6+
out_file = "/home/kevyan/generations/dayhoffref/dayhoffref.fasta"
7+
with open(out_file, 'w') as out:
8+
for individual_file in individual_files:
9+
name = individual_file.replace(".fasta", "")
10+
name = name.replace("jamba-", "")
11+
name = name.replace("10mbothfilter", "bbr-novel-sc")
12+
print(name)
13+
with open(os.path.join(individual_dir, individual_file), 'r') as infile:
14+
for line in infile:
15+
if line.startswith(">"):
16+
out.write(">" + name + "_" + line[1:])
17+
else:
18+
out.write(line)

0 commit comments

Comments
 (0)