Skip to content

Immortals-33/Scaffold-Lab

Repository files navigation

Scaffold-Lab

Paper

Official implementation for Scaffold-Lab: Critical Evaluation and Ranking of Protein Backbone Generation Methods in A Unified Framework.

Description

Scaffold-Lab is the first unified framework for evaluating different protein backbone generation methods.

We present the benchmark for both unconditional generation and conditional generation in terms of designability, diversity, novelty, efficiency and structural properties. Currently evaluated methods are listed below:

Unconditional Generation

Conditional Generation


Updates

  • July 26th, 2024: A guideline for designing protein from scratch using different baseline methods is updated here. We expect this as a reference for both reproduction and running methods benchmarked by our work with minimal efforts.
  • July 19th, 2024: We now enable motif positions to be partially redesigned with ProteinMPNN. Check out here to see the way of specification.
  • June 19th, 2024 : Scaffold-Lab now supports AlphaFold2 for evaluation! The implementation of AF2 is built upon LocalColabFold. We refer interested users to here for more details.

Note

You can also try our notebook in Colab. This is a beta version where bug reports and pull requests are especially welcomed.


Table of Contents


Installation

Expand

We recommend using Conda to set up dependencies. To quickly set up an environment, just simply run:

# Clone this repository and set up virtual environment
git clone https://github.com/Immortals-33/Scaffold-Lab.git
cd Scaffold-Lab
# Create and activate environment
conda env create -f scaffold_lab.yml
source activate scaffold_lab

# Install scaffold_lab as a package.
pip install -e .

You may also need to build a Foldseek database for diversity and novelty evaluation.

Within the conda environment, run:

mkdir <foldseek_pdb_database_path>
cd <foldseek_pdb_database_path>
foldseek databases PDB pdb tmp

After successfully building a PDB database of Foldseek, you can save the <foldseek_pdb_database_path> as a record and lately specify it your foldseek database path either using config or directly by command-line usage, whose demo is provided below.

[!TIP]
When specifying the path of Foldseek database, please add the database name after the path. For example, the Foldseek database described above is "pdb", so you should set evaluation.foldseek_database=<foldseek_database_path>/pdb.


Outline

Expand

Here is a guide about how you can go through this repository. We aim to provide an easy-to-use evaluation pipeline as well as maximize the utility of individual scripts. Let's go through the structure of this repository as a start:

  • scaffold_lab: This is the main directory to run different evaluations described in our paper.

  • analysis: Scripts for calculating several metrics, including diversity, novelty and structural properties.

  • baselines: In order to generate protein backbones directly inside this repository, you may find the code of different methods baselines for unconditional generation and conditional generation then clone their repository under this content. it is highly recommended to run inference for different baselines inside their own virtual environment for potential conflicts of environmental dependencies.

    • Inside the experiment folder we provide scripts for performing motif-scaffolding experiments by Chroma using its SubstrctureConditioner. Refer the script for detailed information if you want.
  • config: We place different configuration settings of Hydra here to organize for evaluations. Hydra is a hierarchical configuration framework to help users systematize different experimental settings. Though it might be confusing when you first get in touch with it, it is a powerful tool to help you perform experiments efficiently with different combinations of parameters, for example, the number of sequences to generate. We recommend readers to Docs for advanced usage.


Usage

Expand

Unconditional Generation

Let's start by running a simple evaluation here:

python scaffold_lab/unconditional/refolding.py 

This performs a simple refolding analysis for the proteins we put inside demo/unconditional/.


Conditional Generation (Motif-scaffolding)

To run a minimal version on motif-scaffolding task, simply run:

python scaffold_lab/motif_scaffolding/motif_refolding.py evaluation.foldseek_database=<foldseek_pdb_database_path> # Specify the path of your Foldseek database directly

This performs a evaluation on demo/motif_scaffolding/2KL8/ where the outputs would be saved under outputs/2KL8/.

Scaffold information file (motif_info.csv)

motif_refolding.py requires a metadata file, motif_info.csv, with information relating generated scaffolds to the motif. Within our grammar system, a complete contig includes two to four parts separated by a , (comma). For example, a row of this file is 0,2KL8,2,A1-7/20/A28-79,A3-5;A33;A36,A;B. Separated by the commas are four parts: This csv file has the following fields for each scaffold:

  • pdb_name The reference PDB name. This is for extracting reference motifs for calculation and identification. e.g. 2KL8 in this case.
  • sample_num gives the sample id number for cases when there are multiple scaffolds to be evaluated.
  • contig_ (motif placement) This part shows the information of where the motifs and scaffolds are placed. e.g. A1-7/20/A28-79 in this case.
    • The motif parts start with an uppercase letter and contain information about the corresponding native motifs. If the numbers are continuous, then separated by hyphens. The boundaries of motifs and scaffolds are separated by slashes.
    • The scaffold parts are single numbers, which is deterministic as the scaffold part of the uploaded PDB files are already placed during the design process. The motif parts indicate residues in native PDBs but not scaffold PDBs. We choose this way because this would be convenient for users to locate which part the motifs are mimicking corresponding to the reference PDBs.
    • Together, the motif placement part provides information about which parts are motifs (indicated by chain letter) and how they correspond to native ones, and the overall length of the designed scaffold. For example, 1A1-7/20/A28-79 means the scaffold parts contains:
      • First a motif segment containing residues 1-7 of chain A of 2KL8;
      • Followed by a 20-residue segment of scaffold;
      • Finally a 52-residue motif segment of chain A of 2KL8.
  • Redesigned positions: This part indicates which positions to be redesigned in the reference proteins, e.g.A3-5;A33;A36 in this case indicates residues 3, 4, 5, 33, and 36 of chain A in 2KL8. Different redesigned positions are separated by semi-colons; if the positions are continuous, then connected by hyphens; always starts with an uppercase chain letter.
  • Segment order: The order of multiple motif segments in backbones. This may be used when each of the motif segments its own chain in the reference pdb file.

Specify through PDB Header

The users can specify the contig string in the “classification” part of the PDB header. Here we have two ways for contig parsing:

  • A complete contig string: Should be followed the format mentioned above with two or three parts separated by commas. The native PDB ID and motif placement are always necessary, and the part of redesigned positions is additionally provided if there’s a need.
  • For specification of redesigned positions, another straightforward way is to index them by the “UNK” residues. The logic here is, if the code found the contig string just have two parts, it will automatically look for “UNK” residues inside the PDB file and specify them as positions to be redesigned.

Output Visualization

We provide optional visualization outputs for motif-scaffolding tasks. In brief, several figures and sessions will be created. We next demonstrate the output items using a motif case from PDB 6E6R:

  • Designability Metrics: The sc-RMSD and motif-RMSD of each evaluated scaffold.
Designability Metrics

Designability Metrics

  • Novelty Metric: The TM-scores of evaluated scaffolds against PDB (pdbTM). The purple vertical dashed lines denote 25%, 50% and 75% quartile values across the whole scaffold sets.
Novelty Metrics

Novelty Metric (pdbTM). The purple vertical dashed lines denote 25%, 50% and 75% quartile values across the whole scaffold sets.

  • Unique Successful Scaffolds: All unique successful scaffolds in PyMol. The orange segment displayed within the first grid is the reference motif to be mimicked, followed by unique solutions with green parts and blue parts correspond to motifs and scaffolds excluding motifs respectively.
PyMol session file

Example PyMol Session File

Auxiliary Metrics

Aside from main metrics in summary.txt files, a set of auxiliary metrics storing in auxiliary_metrics.txt files:

  • Closest Contender: For each case, we additionally provide information for the designable scaffold best mimicking the corresponding motif, namely closest contender. This gives users a closer look on the optimal design outcomes of the method for a particular motif no matter it is success or not, which would be informative for both method development and candidates selection. The scaffold PDB file along with the comparison to native scaffold could be found in the folder named {folding_method}_closest_contender.

Customize Methods for Structure Prediction

We support both AlphaFold2 (single-sequence version) and ESMFold for structure prediction during refolding.

ESMFold

Scaffold-Lab performs evaluation using ESMFold by default. Once you set up the environment this should work.

AlphaFold2 (single-chain version)

Expand

The implementation of AlphaFold2 is based on LocalColabFold, which is a local version of ColabFold. We provide a brief guideline for enabling using AlphaFold2 during evaluation:

  • Install LocalColabFold. Please follow the installation guide on its official page based on your specific OS. Note that it might take a few tries for a complete installation.

  • Export executable ColabFold into your PATH. This enables the running of ColabFold during the refolding pipeline. Suppose the root directory of your LocalColabFold is {LocalColabFold}, then you can export variable PATH in two ways:

    • Set up inside the config (Recommended). Specifically, two ways to do so:

      • Inside config/unconditional.yaml and config/motif_scaffolding.yaml (Recommended):

        inference:
          af2:
            executive_colabfold_path: {LocalColabFold}/colabfold-conda/bin # Replace {LocalColabFold} by your actual path of LocalColabFold
      • Alternatively, set this in a command-line way:

        python scaffold_lab/unconditional/refolding.py inference.af2.executive_colabfold_path='{LocalColabFold}/colabfold-conda-bin'
    • Direct set variable PATH before running evaluation script, which is similarily done in #5 inside this guide.

  • Set AlphaFold2 as your forward folding method when running evaluation. Inside the config:

    inference:
    ...
      predict_method: [AlphaFold2] # Only run AF2 for evaluation
      predict_method: [AlphaFold2, ESMFold] # Run both AF2 and ESMFold for evaluation
    ...

And voilà!


Contact


Citation

If you use Scaffold-Lab in your research or find it helpful, please cite:

@article{zheng2024scaffoldlab,
title = {Scaffold-Lab: Critical Evaluation and Ranking of Protein Backbone Generation Methods in A Unified Framework},
author = {Zheng, Zhuoqi and Zhang, Bo and Zhong, Bozitao and Liu, Kexin and Li, Zhengxin and Zhu, Junjie and Yu, Jinyu and Wei Ting and Chen, Haifeng},
year = {2024},
journal = {bioRxiv},
url = {https://www.biorxiv.org/content/10.1101/2024.02.10.579743v3}
}

Acknowledgments

Open-source Projects

This codebase benefits a lot from FrameDiff, OpenFold, ProteinMPNN and some other amazing open-source projects. Take a look at their work if you find Scaffold-Lab is helpful!

Individuals

We thank the following ones for contributing or pointing out potential bugs for improvements:

About

A comprehensive benchmark on the performances of multiple protein backbone generative models.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors