This repository contains a tool that can be used to calculate the volume of a pocket within a protein.
This tool streamlines the process of aligning a set of protein structures to a common reference via USalign, running fpocket on the aligned structures, and calculating the volume based on a user-defined bounding box.
There are many tools available that can identify protein cavities and estimate volume, but many seem to struggle with consistently defining the boundary between bulk solvent and the inner cavity. To make fair comparisons of pocket volume between protein homologs, this boundary should be defined consistently.
The protein_pocket_volume tool in this repo tries to address this challenge through two key steps:
- Structural Alignment: All proteins must be superimposed onto a common reference frame.
- User-Defined Bounding Box: A bounding box is set appropriately on the reference protein to define the cavity of interest.
- Python 3.11++
- Docker: The Docker daemon must be running.
- Git: For cloning the
USalignrepository. - UV: For installing this package.
1. Build the USalign Docker Image
This repo uses docker for running USalign. You must build this image first.
# Clone the official USalign repository
git clone https://github.com/pylelab/USalign.git
# Navigate into the directory and build the Docker image
cd USalign
docker build -t usalign:latest .2. Install protein-pocket-volume
You can now install the protein-pocket-volume package into a new virtual environment with:
# Clone this repository
git clone https://github.com/Alberto024/protein-pocket-volume.git
cd protein-pocket-volume
# Create virtual environment for protein-pocket-volume
uv sync
# Activate the virtual environment
source .venv/bin/activateAlternatively, you can install it into an existing virtual environment via uv by running:
# Install without adding to pyproject.toml
uv pip install 'git+https://github.com/Alberto024/protein-pocket-volume.git'
# Or add to pyproject.toml
uv add 'git+https://github.com/Alberto024/protein-pocket-volume.git'You can check if the installation was successful by running:
protein_pocket_volume --helpThe main command is protein_pocket_volume. You must provide an input directory of target PDBs and a reference PDB for alignment.
Additionally, you must define the bounding box parameters using the --grid-center and --grid-size options which each
accept 3 floating point numbers. Furthermore, the fpocket parameters should be tuned such that your cavity of interest
is appropriately recognized and mapped out using the --fpocket-min-spheres-per-pocket, --fpocket-min-sphere-size, and
--fpocket-max-sphere-size options.
Command Line Usage:
╰─➤ protein_pocket_volume --help
Usage: protein_pocket_volume [OPTIONS] INPUT_DIR REFERENCE_PDB
Calculate active site volumes for all PDB files in a directory.
This command performs the following steps: 1. Sets up output directories. 2. Aligns all PDBs in the input
directory to the reference PDB using USalign. 3. Runs fpocket on each newly aligned PDB file. 4. Concatenates the
resulting pocket PQR files for each protein. 5. Calculates the volume for each protein based on the defined grid.
6. Saves the results to a CSV file.
╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * input_dir DIRECTORY Directory containing PDB files. [default: None] [required] │
│ * reference_pdb FILE Path to the reference PDB file for alignment. [default: None] [required] │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --output-dir DIRECTORY Directory to save fpocket outputs │
│ and intermediate files. │
│ [default: pocket_volume_outputs] │
│ --grid-center <FLOAT FLOAT FLOAT>... Grid center coordinates (X Y Z). │
│ [default: 0.0, 0.0, 0.0] │
│ --grid-size <FLOAT FLOAT FLOAT>... Grid size in angstroms (Length │
│ Width Height). │
│ [default: 16.0, 16.0, 16.0] │
│ --grid-resolution INTEGER Number of points along each grid │
│ axis. │
│ [default: 30] │
│ --fpocket-min-spheres-per-pocket INTEGER fpocket -i: min spheres per │
│ pocket. │
│ [default: 15] │
│ --fpocket-min-sphere-size FLOAT fpocket -m: min sphere size. │
│ [default: 3.4] │
│ --fpocket-max-sphere-size FLOAT fpocket -M: max sphere size. │
│ [default: 6.2] │
│ --visualize --no-visualize Visualize point cloud. │
│ [default: no-visualize] │
│ --install-completion Install completion for the current │
│ shell. │
│ --show-completion Show completion for the current │
│ shell, to copy it or customize the │
│ installation. │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
The protein_pocket_volume tool once executed will produce output in the following format:
pocket_volume_outputs/
aligned_pdbs/
example_protein.aligned.pdb
fpocket_outputs/
example_protein.aligned_out/
example_protein.aligned_pockets.pqr
pockets/
pocket1_vert.pqr
pocket2_vert.pqr
...
pqr_files/
example_protein.pqr
logs/
fpocket.log
usalign.lot
pocket_volumes.csv
Choosing the centroid and box dimensions for the pocket bounding box is critical. It should be chosen such that it will capture all the variation in your pocket of interest.
The recommended approach is to first optimize the fpocket parameters on your reference protein.
You can easily run fpocket on a protein via docker. For example, if you are in a directory with the file
protein.pdb then you can run fpocket with:
docker run --rm -v `pwd`:/workdir fpocket/fpocket fpocket -f /workdir/protein.pdb -i 30 -m 3.4 -M 6.2Take note of the parameters -i, -m, and -M which can be used to tune the fpocket algorithm. There are
other options as well that you can see by running:
docker run --rm fpocket/fpocket fpocket -hOr going to the fpocket documentation.
Fpocket will produce an output directory in your current directory that should contain a file named protein_out/protein.pml which you can open via pymol with
pymol protein_out/protein.pmlMake sure to choose fpocket parameters such that your active site is fully mapped out. It's okay if there are off-target pockets detected as they will be ignored once you select a bounding box.
In pymol, you can draw a bounding box by loading the script named scripts/draw_box.py located in this repository.
For example, first load the script in the pymol command line:
PyMOL>load protein-pocket-volume/scripts/draw_box.pyNote, the path to the draw_box.py should be correctly set. Then you can visualize a bounding box with:
PyMOL>cgo_box(center=[0,0,0], size=[16,16,16], name='box1', color='blue')Adjust the center and size parameters until your bounding box covers your pocket of interest, but be sure to provide it extra space to account for the variation that other protein homologs will have.
One way to get a good starting point is to select 2 atoms that can act as vertices on opposite corners of the bounding box
and then use the cmd.get_atom_coords command to get their coordinates individually, then use those coordinates to define
a bounding box centroid and dimensions.
PyMOL>print(cmd.get_atom_coords('sele'))
[123.03800201416016, 113.74600219726562, 3.4489998817443848]Alternatively, you can use the calculate_box_dimensions command after loading the draw_box.py script to calculate the centroid
and dimensions of the bounding box. The calculate_box_dimensions command requires a selection of exactly 2 atoms and can
optionally draw a bounding box automatically. For example:
PyMOL>calculate_box_dimensions(selection='sele', draw=True)
Analysis for selection: 'sele'
- Point 1: [123.038 113.746 3.449]
- Point 2: [136.215 98.486 12.934]
- Box Centroid: [129.6265 106.116 8.1915]
- Box Dimensions: [13.176994 15.260002 9.485001]
Bounding box drawn with name: box_4909The bounding box will be used to define the pocket for every homolog of interest, so it is important to define it in a way that will not cut off the pocket for other homologs with larger pockets.
Alternatively, the AutoDockTools pymol plugin can be used to visualize a bounding box.
Lastly, you can visualize the point cloud defined by the intersection of the bounding box and the fpocket mapped spheres by
using the --visualize flag of the protein_pocket_volume command. An example plot will be made like the one below:
The calculation of the pocket volume is only an estimate for several reasons:
- Proteins are dynamic and measuring the volume of an individual structure will only be a snapshot of possible conformations
- The bounding box is user defined to capture variation in the cavity of interest, but its not necessarily the true bulk solvent/pocket boundary
- Defining the bounding box for a single reference protein that is applied to all homologs of interest may not always be appropriate. Doing this carries the assumption that the cavity of interest is conserved enough across your protein homologs that a single bounding box would provide a reasonable boundary for all of them.
- Choosing a single reference protein to align every homolog to carries bias. Proteins that are more structurally similar to the reference protein will have a bounding box applied more consistently than proteins that are more structurally distant. An alternative approach would be to perform a multiple structure alignment from all homologs of interest and using that alignment to define a bounding box. A challenge with that approach is that multiple structure alignments for hundreds to thousands of proteins is computationally intensive. It is possible to take an intermediate approach and select a handful of representatives that could be used to perform a smaller but still representative multiple structural alignment.
- Using a plane to define the bulk solvent boundary (via a bounding box) is possibly not the ideal geometric shape. It's possible that some kind of curved surface would be more appropriate.
Also, defining the bounding box using a centroid and box dimensions is not the most user-friendly and requires careful consideration and tuning from the user. A more automated approach could be to automatically define a bounding box using structurally conserved points on the proteins themselves.
Furthermore, volume itself is only a proxy for characterizing steric interactions in a pocket. There are numerous other factors that play a functional role in protein pockets including morphology, electrostatics, hydrophobicity, etc. To get a more holistic view of the properties of a pocket these should all be considered.
If you use this code, please cite:
Englund E, Schmidt M, Nava AA, Lechner A, Deng K, Jocic R, Lin Y, Roberts J, Benites VT, Kakumanu R, Gin JW.
Expanding extender substrate selection for unnatural polyketide biosynthesis by acyltransferase domain exchange within a modular polyketide synthase.
Journal of the American Chemical Society. 2023 Apr 14;145(16):8822-32.
https://doi.org/10.1021/jacs.2c11027
Additionally, this code relies on the following programs:
USalign
Zhang, C., Shine, M., Pyle, A.M. et al.
US-align: universal structure alignments of proteins, nucleic acids, and macromolecular complexes.
Nat Methods 19, 1109–1115 (2022).
https://doi.org/10.1038/s41592-022-01585-1
FPocket
Le Guilloux, V., Schmidtke, P. & Tuffery, P.
Fpocket: An open source platform for ligand pocket detection.
BMC Bioinformatics 10, 168 (2009).
https://doi.org/10.1186/1471-2105-10-168
- docker
- numpy
- pandas
- pydantic
- rich
- tqdm
- typer
- matplotlib
- ruff
- uv
We thank all the authors and maintainers for their open-source contributions!
