Drug-Protein Interaction ML

Language: English | Español

Prediction of drug-protein interactions using Machine Learning on real BindingDB data.
Personal project combining knowledge from computer science, business administration, and biochemistry.

Description

This project implements a complete virtual screening pipeline to predict whether a small molecule (drug candidate) is likely to bind to a target protein. The model enables prioritization of compounds from libraries of thousands of molecules before performing costly experimental assays, significantly reducing both the time and cost of the drug discovery process.

Dataset: BindingDB — public database with over 2.8M affinity measurements between molecules and proteins.

Business Impact and Operational Efficiency

As a Business Administration student, this project was designed not only as a technical challenge but as an asset optimization and risk reduction solution in the pharmaceutical R&D value chain.

The "Funnel Filter": In drug discovery, physical High-Throughput Screening is extremely costly. This pipeline acts as a critical filter, processing a virtual library of 100,000 compounds to prioritize only the 500 candidates with the highest probability of success.
Cost Savings (OPEX): By identifying failures early (Fail Fast), it avoids investing millions of euros in in vitro laboratory assays destined to fail, optimizing the research budget.
Time-to-Market Reduction: Accelerating the initial screening stage allows promising molecules to reach clinical phases sooner, maximizing the net present value (NPV) of intellectual property.

Repository Structure

DrugProtein_ML/
├── Drug_Discovery_Training.ipynb   # Complete training pipeline
├── Drug_Discovery_Inference.ipynb  # Prediction on new library
├── app.py                          # Interactive virtual screening application
├── requeriments.txt
├── xgboost.pkl                     # Trained model ready to use
├── examples/
│   └── mi_libreria.csv             # Example CSV for testing inference
└── images/                         # Plots generated during training

Interactive Application

The project includes an interactive web application built with Streamlit that allows direct use of the model on new molecules or complete chemical libraries.

Online Demo

The application is deployed and can be tested directly here:

👉 https://molecular-predictive-discovery.streamlit.app/

The application allows you to:

Analyze an individual molecule from its SMILES string
Obtain binding probability and Binder / Non-Binder classification
Visualize the molecular structure in 2D and 3D
Calculate physicochemical properties and drug-likeness
Evaluate the multivariate molecular profile
Perform automated screening of libraries in CSV format
Rank candidates and interactively analyze the best hits

This reproduces the real workflow of computational virtual screening prior to experimental screening.

Run the Application

pip install -r requirements.txt  
streamlit run app.py

Methodology

Data Preprocessing

Loading 500k rows from BindingDB with Ki, IC50, and Kd measurements
Chemical cleaning with RDKit: desalting, SMILES canonicalization, sanitization
Deduplication by canonical SMILES
Binary labeling: binder if affinity < 1000 nM (1 µM), non-binder otherwise
1:1 balancing via undersampling → 109,028 final molecules

Molecular Featurization

Feature	Description	Dimensions
Morgan FP (ECFP4)	Circular fingerprint radius 2, gold standard in QSAR	2048 bits
Physicochemical Descriptors	MolWt, LogP, HBA, HBD, TPSA, RotBonds, Aromatic, QED, HeavyAtoms, FracCSP3	10

Rigorous Split — Scaffold Split

The dataset is split by Murcko scaffold: train and test sets contain entirely distinct chemical families, simulating the real-world scenario of predicting on novel compounds.

Split audit (4 checks):

Scaffold overlap: 0% (< 1% acceptable)
Exact SMILES in common: 0
Tanimoto NN >= 0.85: 0% (< 5% acceptable)
Class drift train/test: 0.6% (< 5% acceptable)

Trained Models

1. XGBoost + Morgan FP + Descriptors — Best Model

Industry standard in pharma for QSAR (Bender et al., 2022). Gradient boosting on sparse Morgan FP features with L1/L2 regularization and early stopping.

2. ChemBERTa + XGBoost

768-dimensional embeddings from a transformer pretrained on 77M molecules (ZINC), used as input for XGBoost. The architecture avoids overfitting from an additional MLP.

3. Ensemble (simple average)

Combination of both models with fixed equal weights. Weights are not optimized on the test set to avoid data leakage.

Results

Model	Train ROC-AUC	Test ROC-AUC	Gap	PR-AUC	F1	MCC	Brier
XGBoost	0.9333	0.8979	0.0354	0.8993	0.8199	0.6470	0.1291
ChemBERTa+XGB	0.8748	0.8357	0.0391	0.8283	0.7627	0.5238	0.1651
Ensemble	0.9211	0.8832	0.0379	0.8830	0.8071	0.6174	0.1404

XGBoost outperforms the ensemble because ChemBERTa was pretrained on ZINC, which has a different chemical distribution than BindingDB, limiting the quality of its embeddings in this domain.

All train/test gaps are below 0.05, confirming the absence of significant overfitting.

ROC and Precision-Recall Curves

Overfitting Diagnosis and Calibration

Methodological Guarantees

Scaffold split with 4 integrity checks
Early stopping in all XGBoost models
Ensemble with fixed weights (no optimization on test set)
Probability calibration reported (Brier score)
Train/test gap reported for each model
Feature interpretability (bitInfo + substructure visualization)

Interpretability

Feature Importance — XGBoost

Morgan Fingerprints dominate the model with 98.8% of total importance. The only physicochemical descriptor in the top 20 is NumHeavyAtoms, which makes biological sense: larger molecules have more surface contact area with the protein.

Activating Substructures for the Most Important Bits

Using RDKit's bitInfo, each Morgan bit can be mapped to the chemical substructure that activates it, visualized directly on the molecule.

Molecular Visualizations

Top 10 Binders with Highest Model Confidence

Typical Binder vs. Typical Non-Binder

Error Analysis — False Positives and False Negatives

FPs have very weak affinities (> 10,000 nM) and are structurally similar to known kinase inhibitors. FNs are ultra-potent (< 1 nM), structurally atypical, and underrepresented in the training data.

Chemical Space — t-SNE (3,000 molecules from the test set)

2D projection of Morgan FPs showing how binders and errors are distributed across structural space.

Requirements

Library	Recommended Version
Python	>= 3.9
RDKit	>= 2023.x
XGBoost	>= 1.7
scikit-learn	>= 1.2
transformers	>= 4.x (only for ChemBERTa)
PyTorch	>= 2.0 (only for ChemBERTa)

How to Run

Option A — Inference only (fast, no retraining)

The model is already trained in xgboost.pkl. You only need your compound library.

Prepare a CSV with a smiles column (and optionally a label column with 0/1 values)
Open Drug_Discovery_Inference.ipynb
Edit the configuration cell with the path to your CSV
Run all cells

Output: binding probability, BINDER/NON-BINDER classification, molecular visualizations, and a report.

Option B — Full training (requires BindingDB)

Download BindingDB_All.tsv from bindingdb.org (~6 GB)
Place it in the project root alongside the notebook
Open Drug_Discovery_Training.ipynb
Run all cells (~45 min with GPU, ~90 min without GPU)

Recommended: run on Google Colab with GPU enabled to reduce ChemBERTa training time.

Input CSV Format (Inference)

smiles,name,label
CC(=O)Oc1ccccc1C(=O)O,Aspirin,0
Cc1ccc(NC(=O)c2ccc(CN3CCN(C)CC3)cc2)cc1Nc1nccc(-c2cccnc2)n1,Imatinib,1

Column	Required	Description
`smiles`	Yes	SMILES string of the molecule
`name`	No	Compound name
`label`	No	Ground truth label (0/1) — enables performance metrics

Limitations

The model predicts binding in general, not against a specific target protein. It is useful as a structural pre-filter before experimental screening.
Trained with the publication bias of BindingDB (overrepresentation of actives and certain targets such as kinases).
The Enrichment Factor is limited by the 1:1 balancing of the training dataset.

Author

Personal project developed as a demonstration of the intersection between computer science, business administration, and computational biochemistry.

Stack: Python, RDKit, XGBoost, HuggingFace Transformers, scikit-learn
Data: BindingDB (public domain)
Environment: Google Colab / Jupyter Notebook

License

MIT License — free for academic and personal use.

María Campos Carneros, 2026

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.devcontainer		.devcontainer
example		example
images		images
.gitignore		.gitignore
Drug_Discovery_Inference.ipynb		Drug_Discovery_Inference.ipynb
Drug_Discovery_Training.ipynb		Drug_Discovery_Training.ipynb
LICENSE		LICENSE
README.md		README.md
README_ES.md		README_ES.md
app.py		app.py
packages.txt		packages.txt
requirements.txt		requirements.txt
xgboost.pkl		xgboost.pkl

Folders and files

Latest commit

History

Repository files navigation

Drug-Protein Interaction ML

Table of Contents

Description

Business Impact and Operational Efficiency

Repository Structure

Interactive Application

Online Demo

Run the Application

Methodology

Data Preprocessing

Molecular Featurization

Rigorous Split — Scaffold Split

Trained Models

1. XGBoost + Morgan FP + Descriptors — Best Model

2. ChemBERTa + XGBoost

3. Ensemble (simple average)

Results

ROC and Precision-Recall Curves

Overfitting Diagnosis and Calibration

Methodological Guarantees

Interpretability

Feature Importance — XGBoost

Activating Substructures for the Most Important Bits

Molecular Visualizations

Top 10 Binders with Highest Model Confidence

Typical Binder vs. Typical Non-Binder

Error Analysis — False Positives and False Negatives

Chemical Space — t-SNE (3,000 molecules from the test set)

Requirements

How to Run

Option A — Inference only (fast, no retraining)

Option B — Full training (requires BindingDB)

Input CSV Format (Inference)

Limitations

Author

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages