Skip to content

yishakbililign/DeNovoFiltering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Authors

Overview

This project is focused on analyzing and refining filtering strategies for de novo protein binders using model confidence metrics, structural interface descriptors, and sequence-based biophysical characterization with the goal of improving target and design class generalization. We explore filtering for both protein expression & target binding using open-source de novo design datasets from Proteinbase. We propose independent LDA scoring approaches for both filtering tasks, which together create a 6-fold improvement in binder success rate on a new target dataset unincluded in model development:

Repository Map

root/
├── boltz/
├── data/
├── figures/
├── notebooks/
├── scripts/
├── utils/
├── data/
├── environment.yml
└── README.md

Data

Datasets & protein structures used for statistical analysis & modeling.

root/
└── data/
    ├── processed/
    |   ├── full/
    |       ├── full_binder_confidence_metrics.csv
    |       ├── full_binder_dataset.csv
    |       ├── full_binder_lda_scores.csv
    |       └── full_binder_structural_features.csv
    |   ├── nipah/
    |       ├── nipah_binder_dataset.csv
    |       └── nipah_binder_structural_features.csv
    |   └── rbx1/
    |       ├── rbx1_test_dataset.csv
    |       └── rbx1_test_predictions.csv
    ├── structures/
    |   ├── nipah/
    |       ├── <id>_boltz2.cif
    |       └── <id>_esmfold.cif
    |   └── full/
    |       └── <id>_boltz2.cif
    ├── proteinbase_collection_full_28_01_2026.csv
    ├── proteinbase_collection_nipah-binder-competition-results.csv
    └── target_sequences.csv

Notebooks

Notebooks used for EDA, statistical analysis, & modeling.

root/
└── notebooks/
    ├── binding/
    |   ├── ANOVA.ipynb
    |   ├── GAM.ipynb
    |   ├── Thresholding.ipynb
    |   ├── Full_Binder_LDA_Analysis.ipynb
    |   └── Nipah_Binder_LDA_Analysis.ipynb            
    └── expression/
        ├── Full_Expression_LDA_Analysis.ipynb
        └── Nipah_Expression_LDA_Analysis.ipynb

Scripts

Python scripts with CLI for data processing & feature extraction.

root/
└── scripts/
    ├── analyze_structure.py
    ├── parse_proteinbase.py
    └── read_boltz.py

Boltz

Pipeline for generating structure predictions using Boltz2. See boltz/README.md for more details.

Configuration

Create a conda environment with all required package dependencies for Python scripts in this repository with the following command:

conda env create -f environment.yml -n <env_name>

The Boltz2 structure prediction pipeline uses a Docker container which can be built from boltz/Dockerfile with the following command:

cd boltz
docker build -t boltz-image .

Usage

Parsing Proteinbase Datasets

usage: parse_proteinbase.py [-h] --input FILE --output FILE [--target NAME] [--download DIR]
Flag Type Required Default Description
-i, --input str Input csv path
-o, --output str Output csv path
-t, --target str None Binder target
-d, --download str Directory to download structures

scripts/parse_proteinbase.py can be used to parse the Nipah competition dataset from Proteinbase saved in data/proteinbase_collection_nipah-binder-competition-results.csv to produce a new csv in data/processed/nipah/nipah_binder_dataset.csv with the following command:

python scripts/parse_proteinbase.py -i data/proteinbase_collection_nipah-binder-competition-results.csv -o data/processed/nipah/nipah_binder_dataset.csv -t nipah-glycoprotein-g

The full Proteinbase collection saved in data/proteinbase_collection_full_28_01_2026.csv can similarly be processed to produce a new csv in data/processed/full/full_binder_dataset.csv with the following command:

python scripts/parse_proteinbase.py -i data/proteinbase_collection_full_28_01_2026.csv -o data/processed/full/full_binder_dataset.csv

Extracting Structural Features

usage: analyze_structures.py [-h] --input DIR --output FILE [--source NAME] [--target TARGET_CHAIN_ID] [--binder BINDER_CHAIN_ID]
Flag Type Required Default Description
-i, --input str Input directory of CIF files
-o, --output str Output csv path
-s, --source str boltz2 Structure prediction source
-t, --target str A Chain id of target
-b, --binder str B Chain id of binder

scripts/analyze_structure.py can be used to extract features from bound structures in CIF files. Files should be named in the format <id>_<source>.cif where id is a unique identifier for the binder submission. This can be used to extract features from the Nipah competition Boltz structure predictions in data/structures/nipah/ with the following command:

python scripts/analyze_structures.py -i data/structures/nipah/ -o data/processed/nipah/nipah_structural_features.csv

Compiling Boltz2 Prediction Results

usage: read_boltz.py [-h] --input DIR --output FILE [--copy DIR]
Flag Type Required Default Description
-i, --input str Directory containing Boltz prediction output
-o, --output str Output path for compiled CSV
-c, --copy str Directory to copy output CIF files

scripts/read_boltz.py can be used to compile & calculate confidence metrics from Boltz2 predictions. This uses code from the Adaptyv Bio Nipah competition for calculating ipSAE & LIS (see: utils/ipsae.py). Example results can be compiled to data/processed/full/full_binder_confidence_metrics.csv with the structures copied to data/structures/full/ using the following command:

python scripts/read_boltz.py -i boltz/results -o data/processed/full/full_binder_confidence_metrics.csv -c data/structures/full/

Full results for the Boltz2 predictions used in this project can be accessed here.

About

UC Berkeley MSSE Capstone Project: Evaluating & refining filtering strategies for de novo protein binders

Topics

Resources

Stars

Watchers

Forks

Contributors

Languages