A complex-free deep learning model for protein-ligand binding affinity prediction with intrinsic binding site detection.
Key Features:
- No molecular docking required
- Robust performance regardless of binding site determination method
- Robust performance on imperfect structural inputs
git clone https://github.com/KU-MedAI/InSiteDTA.git
cd InSiteDTAconda env create -f environment.yml
conda activate insite3. Install P2Rank (Optional, Recommended) — Krivák & Hoksza, 2018
mkdir src/p2rank && cd src/p2rank
wget https://github.com/rdk/p2rank/releases/download/2.5.1/p2rank_2.5.1.tar.gz
tar -xzf p2rank_2.5.1.tar.gz -C ./ --strip-components=1Why P2Rank? InSiteDTA internally predicts the binding site and uses it as a feature for affinity prediction, so P2Rank is not strictly required. However, providing a P2Rank-predicted pocket helps guide the voxelization step so that the sampled protein voxel is more likely to include the true binding site. This can enable more sophisticated prediction, especially when inferencing with large proteins.
Our tested environment:
- Python: 3.9.19
- PyTorch: 2.5.1
- PyTorch Geometric: 2.6.1
- CUDA: 11.8
- P2Rank: 2.5.1
Without pocket guidance (unguided voxelization):
python 01-inference.py \
--pdb_path ./src/data/samples/4gkm/4gkm_protein.pdb \
--smiles "Cc1ccc(c(c1)C(=O)[O-])Nc1ccccc1C(=O)[O-]"With P2Rank guidance (guided voxelization, recommended):
python 01-inference.py \
--pdb_path ./src/data/samples/4gkm/4gkm_protein.pdb \
--smiles "Cc1ccc(c(c1)C(=O)[O-])Nc1ccccc1C(=O)[O-]" \
--use_p2rankOrganize your data in nested structure (PDBbind format):
raw_data/
├── {pdb_id}/
│ ├── {pdb_id}_protein.pdb
│ └── {pdb_id}_pocket.pdb
...
Prepare SMILES CSV file (smiles.csv):
PDB_ID,Canonical SMILES
1abc,CCO
1def,c1ccccc1For affinity prediction, prepare affinity index JSON (affinity.json):
{"1abc": 5.2, "1def": 7.8}Note: If you only want to train binding site prediction, omit
--index_fileargument in preprocessing.
python 02-preprocess.py \
--raw_dir ./raw_data \
--save_dir ./preprocessed \
--smiles_csv ./smiles.csv \
--index_file ./affinity.json \
--test_key_file ./test_keys.txt \
--voxel_size 2 \
--n_voxels 32 \
--device 0This generates preprocessed data and data_config_*.json in ./preprocessed/.
python 03-train.py \
--data_config ./preprocessed/data_config_*.json \
--save_dir ./checkpoints \
--device 0 \
--epochs 300 \
--batch_size 48Trained model will be saved as ./checkpoints/{timestamp}_{data_config_name}.pt.
python 04-evaluate.py \
--ckpt ./checkpoints/{experiment_name}.pt \
--result_file ./checkpoints/{experiment_name}_results.json \
--save_dir ./evaluation \
--device 0The script will:
- Load the test split defined in the training result file
- Run inference on the test set
- Report performance metrics (PCC, RMSE, MAE, DCC, DVO)
- Save detailed results to
{save_dir}/{experiment_name}_test_results.csv
Run evaluation on three benchmark datasets:
# Evaluate on Coreset_crystal
python 05-reproduce.py --data crystal --batch_size 64 --device 0
# Evaluate on Coreset_redocked
python 05-reproduce.py --data redocked --batch_size 64 --device 0
# Evaluate on Coreset_p2rank
python 05-reproduce.py --data p2rank --batch_size 64 --device 0The script will:
- Prepare ligand features from SMILES
- Voxelize protein structures
- Evaluate with three trained models
- Report performance metrics (PCC, RMSE, MAE)
Inference (01-inference.py):
- Predicted binding affinity in pK scale (higher values = stronger binding)
Training (03-train.py):
- Model checkpoint:
{save_dir}/{timestamp}_{data_config_name}.pt - Training results:
{save_dir}/{timestamp}_{data_config_name}_results.json
Evaluate (04-evaluate.py):
- Evaluation results CSV:
{save_dir}/{experiment_name}_test_results.csv
Reproduce (05-reproduce.py):
- Performance metrics (mean ± std across 3 models): PCC, RMSE, MAE
- Standard benchmark dataset from PDBbind
- Coreset with redocked ligand in the native pocket
- Coreset with redocked ligand in the p2rank predicted pocket
TBD