GitHub - yishakbililign/DeNovoFiltering: UC Berkeley MSSE Capstone Project: Evaluating & refining filtering strategies for de novo protein binders

Authors

Yishak Bililign - yishakbililign
Luis Hernandez - lhmag89
Sheyda Nazarian - SheydaNazarian
Natalia Rivera - nataliariverax

Overview

This project is focused on analyzing and refining filtering strategies for de novo protein binders using model confidence metrics, structural interface descriptors, and sequence-based biophysical characterization with the goal of improving target and design class generalization. We explore filtering for both protein expression & target binding using open-source de novo design datasets from Proteinbase. We propose independent LDA scoring approaches for both filtering tasks, which together create a 6-fold improvement in binder success rate on a new target dataset unincluded in model development:

Repository Map

root/
├── boltz/
├── data/
├── figures/
├── notebooks/
├── scripts/
├── utils/
├── data/
├── environment.yml
└── README.md

Data

Datasets & protein structures used for statistical analysis & modeling.

root/
└── data/
    ├── processed/
    |   ├── full/
    |       ├── full_binder_confidence_metrics.csv
    |       ├── full_binder_dataset.csv
    |       ├── full_binder_lda_scores.csv
    |       └── full_binder_structural_features.csv
    |   ├── nipah/
    |       ├── nipah_binder_dataset.csv
    |       └── nipah_binder_structural_features.csv
    |   └── rbx1/
    |       ├── rbx1_test_dataset.csv
    |       └── rbx1_test_predictions.csv
    ├── structures/
    |   ├── nipah/
    |       ├── <id>_boltz2.cif
    |       └── <id>_esmfold.cif
    |   └── full/
    |       └── <id>_boltz2.cif
    ├── proteinbase_collection_full_28_01_2026.csv
    ├── proteinbase_collection_nipah-binder-competition-results.csv
    └── target_sequences.csv

Notebooks

Notebooks used for EDA, statistical analysis, & modeling.

root/
└── notebooks/
    ├── binding/
    |   ├── ANOVA.ipynb
    |   ├── GAM.ipynb
    |   ├── Thresholding.ipynb
    |   ├── Full_Binder_LDA_Analysis.ipynb
    |   └── Nipah_Binder_LDA_Analysis.ipynb            
    └── expression/
        ├── Full_Expression_LDA_Analysis.ipynb
        └── Nipah_Expression_LDA_Analysis.ipynb

Scripts

Python scripts with CLI for data processing & feature extraction.

root/
└── scripts/
    ├── analyze_structure.py
    ├── parse_proteinbase.py
    └── read_boltz.py

Boltz

Pipeline for generating structure predictions using Boltz2. See boltz/README.md for more details.

Configuration

Create a conda environment with all required package dependencies for Python scripts in this repository with the following command:

conda env create -f environment.yml -n <env_name>

The Boltz2 structure prediction pipeline uses a Docker container which can be built from boltz/Dockerfile with the following command:

cd boltz
docker build -t boltz-image .

Usage

Parsing Proteinbase Datasets

usage: parse_proteinbase.py [-h] --input FILE --output FILE [--target NAME] [--download DIR]

Flag	Type	Required	Default	Description
`-i`, `--input`	str	✅	—	Input csv path
`-o`, `--output`	str	✅	—	Output csv path
`-t`, `--target`	str	❌	`None`	Binder target
`-d`, `--download`	str	❌	—	Directory to download structures

scripts/parse_proteinbase.py can be used to parse the Nipah competition dataset from Proteinbase saved in data/proteinbase_collection_nipah-binder-competition-results.csv to produce a new csv in data/processed/nipah/nipah_binder_dataset.csv with the following command:

python scripts/parse_proteinbase.py -i data/proteinbase_collection_nipah-binder-competition-results.csv -o data/processed/nipah/nipah_binder_dataset.csv -t nipah-glycoprotein-g

The full Proteinbase collection saved in data/proteinbase_collection_full_28_01_2026.csv can similarly be processed to produce a new csv in data/processed/full/full_binder_dataset.csv with the following command:

python scripts/parse_proteinbase.py -i data/proteinbase_collection_full_28_01_2026.csv -o data/processed/full/full_binder_dataset.csv

Extracting Structural Features

usage: analyze_structures.py [-h] --input DIR --output FILE [--source NAME] [--target TARGET_CHAIN_ID] [--binder BINDER_CHAIN_ID]

Flag	Type	Required	Default	Description
`-i`, `--input`	str	✅	—	Input directory of CIF files
`-o`, `--output`	str	✅	—	Output csv path
`-s`, `--source`	str	❌	`boltz2`	Structure prediction source
`-t`, `--target`	str	❌	`A`	Chain id of target
`-b`, `--binder`	str	❌	`B`	Chain id of binder

scripts/analyze_structure.py can be used to extract features from bound structures in CIF files. Files should be named in the format <id>_<source>.cif where id is a unique identifier for the binder submission. This can be used to extract features from the Nipah competition Boltz structure predictions in data/structures/nipah/ with the following command:

python scripts/analyze_structures.py -i data/structures/nipah/ -o data/processed/nipah/nipah_structural_features.csv

Compiling Boltz2 Prediction Results

usage: read_boltz.py [-h] --input DIR --output FILE [--copy DIR]

Flag	Type	Required	Default	Description
`-i`, `--input`	str	✅	—	Directory containing Boltz prediction output
`-o`, `--output`	str	✅	—	Output path for compiled CSV
`-c`, `--copy`	str	❌	—	Directory to copy output CIF files

scripts/read_boltz.py can be used to compile & calculate confidence metrics from Boltz2 predictions. This uses code from the Adaptyv Bio Nipah competition for calculating ipSAE & LIS (see: utils/ipsae.py). Example results can be compiled to data/processed/full/full_binder_confidence_metrics.csv with the structures copied to data/structures/full/ using the following command:

python scripts/read_boltz.py -i boltz/results -o data/processed/full/full_binder_confidence_metrics.csv -c data/structures/full/

Full results for the Boltz2 predictions used in this project can be accessed here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Authors

Overview

Repository Map

Data

Notebooks

Scripts

Boltz

Configuration

Usage

Parsing Proteinbase Datasets

Extracting Structural Features

Compiling Boltz2 Prediction Results

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.github/workflows		.github/workflows
boltz		boltz
data		data
figures		figures
notebooks		notebooks
scripts		scripts
utils		utils
MSSE_Capstone_Report.pdf		MSSE_Capstone_Report.pdf
README.md		README.md
environment.yml		environment.yml

Folders and files

Latest commit

History

Repository files navigation

Authors

Overview

Repository Map

Data

Notebooks

Scripts

Boltz

Configuration

Usage

Parsing Proteinbase Datasets

Extracting Structural Features

Compiling Boltz2 Prediction Results

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages