-
Yishak Bililign - yishakbililign
-
Luis Hernandez - lhmag89
-
Sheyda Nazarian - SheydaNazarian
-
Natalia Rivera - nataliariverax
This project is focused on analyzing and refining filtering strategies for de novo protein binders using model confidence metrics, structural interface descriptors, and sequence-based biophysical characterization with the goal of improving target and design class generalization. We explore filtering for both protein expression & target binding using open-source de novo design datasets from Proteinbase. We propose independent LDA scoring approaches for both filtering tasks, which together create a 6-fold improvement in binder success rate on a new target dataset unincluded in model development:
root/
├── boltz/
├── data/
├── figures/
├── notebooks/
├── scripts/
├── utils/
├── data/
├── environment.yml
└── README.md
Datasets & protein structures used for statistical analysis & modeling.
root/
└── data/
├── processed/
| ├── full/
| ├── full_binder_confidence_metrics.csv
| ├── full_binder_dataset.csv
| ├── full_binder_lda_scores.csv
| └── full_binder_structural_features.csv
| ├── nipah/
| ├── nipah_binder_dataset.csv
| └── nipah_binder_structural_features.csv
| └── rbx1/
| ├── rbx1_test_dataset.csv
| └── rbx1_test_predictions.csv
├── structures/
| ├── nipah/
| ├── <id>_boltz2.cif
| └── <id>_esmfold.cif
| └── full/
| └── <id>_boltz2.cif
├── proteinbase_collection_full_28_01_2026.csv
├── proteinbase_collection_nipah-binder-competition-results.csv
└── target_sequences.csv
Notebooks used for EDA, statistical analysis, & modeling.
root/
└── notebooks/
├── binding/
| ├── ANOVA.ipynb
| ├── GAM.ipynb
| ├── Thresholding.ipynb
| ├── Full_Binder_LDA_Analysis.ipynb
| └── Nipah_Binder_LDA_Analysis.ipynb
└── expression/
├── Full_Expression_LDA_Analysis.ipynb
└── Nipah_Expression_LDA_Analysis.ipynb
Python scripts with CLI for data processing & feature extraction.
root/
└── scripts/
├── analyze_structure.py
├── parse_proteinbase.py
└── read_boltz.py
Pipeline for generating structure predictions using Boltz2. See boltz/README.md for more details.
Create a conda environment with all required package dependencies for Python scripts in this repository with the following command:
conda env create -f environment.yml -n <env_name>The Boltz2 structure prediction pipeline uses a Docker container which can be built from boltz/Dockerfile with the following command:
cd boltz
docker build -t boltz-image .usage: parse_proteinbase.py [-h] --input FILE --output FILE [--target NAME] [--download DIR]| Flag | Type | Required | Default | Description |
|---|---|---|---|---|
-i, --input |
str | ✅ | — | Input csv path |
-o, --output |
str | ✅ | — | Output csv path |
-t, --target |
str | ❌ | None |
Binder target |
-d, --download |
str | ❌ | — | Directory to download structures |
scripts/parse_proteinbase.py can be used to parse the Nipah competition dataset from Proteinbase saved in data/proteinbase_collection_nipah-binder-competition-results.csv to produce a new csv in data/processed/nipah/nipah_binder_dataset.csv with the following command:
python scripts/parse_proteinbase.py -i data/proteinbase_collection_nipah-binder-competition-results.csv -o data/processed/nipah/nipah_binder_dataset.csv -t nipah-glycoprotein-gThe full Proteinbase collection saved in data/proteinbase_collection_full_28_01_2026.csv can similarly be processed to produce a new csv in data/processed/full/full_binder_dataset.csv with the following command:
python scripts/parse_proteinbase.py -i data/proteinbase_collection_full_28_01_2026.csv -o data/processed/full/full_binder_dataset.csvusage: analyze_structures.py [-h] --input DIR --output FILE [--source NAME] [--target TARGET_CHAIN_ID] [--binder BINDER_CHAIN_ID]
| Flag | Type | Required | Default | Description |
|---|---|---|---|---|
-i, --input |
str | ✅ | — | Input directory of CIF files |
-o, --output |
str | ✅ | — | Output csv path |
-s, --source |
str | ❌ | boltz2 |
Structure prediction source |
-t, --target |
str | ❌ | A |
Chain id of target |
-b, --binder |
str | ❌ | B |
Chain id of binder |
scripts/analyze_structure.py can be used to extract features from bound structures in CIF files. Files should be named in the format <id>_<source>.cif where id is a unique identifier for the binder submission. This can be used to extract features from the Nipah competition Boltz structure predictions in data/structures/nipah/ with the following command:
python scripts/analyze_structures.py -i data/structures/nipah/ -o data/processed/nipah/nipah_structural_features.csvusage: read_boltz.py [-h] --input DIR --output FILE [--copy DIR]| Flag | Type | Required | Default | Description |
|---|---|---|---|---|
-i, --input |
str | ✅ | — | Directory containing Boltz prediction output |
-o, --output |
str | ✅ | — | Output path for compiled CSV |
-c, --copy |
str | ❌ | — | Directory to copy output CIF files |
scripts/read_boltz.py can be used to compile & calculate confidence metrics from Boltz2 predictions. This uses code from the Adaptyv Bio Nipah competition for calculating ipSAE & LIS (see: utils/ipsae.py). Example results can be compiled to data/processed/full/full_binder_confidence_metrics.csv with the structures copied to data/structures/full/ using the following command:
python scripts/read_boltz.py -i boltz/results -o data/processed/full/full_binder_confidence_metrics.csv -c data/structures/full/Full results for the Boltz2 predictions used in this project can be accessed here.

