Explainable Active Learning Framework for Ligand Binding Affinity Prediction

Active learning (AL) prioritises which compounds to measure next for protein–ligand affinity when assay or simulation budgets are limited. We present an explainable AL framework built on Gaussian process regression and assess how molecular representations, covariance kernels, and acquisition policies affect enrichment across four drug-relevant targets. Using recall of top active compounds, we find that dataset identity—the target’s chemical landscape—sets the performance ceiling, while method choices modulate outcomes rather than overturn them. Fingerprints with simple Gaussian process kernels provide robust, low-variance enrichment, whereas learned embeddings with nonlinear kernels can reach higher peaks but with greater variability. Uncertainty-guided acquisition consistently outperforms random selection, yet no single policy is universally optimal; the best choice follows structure-activity relationship (SAR) complexity. To enhance interpretability beyond blackbox selection, we integrate SHapley Additive exPlanations (SHAP) to link high-impact fingerprint bits to chemically meaningful fragments across AL cycles, illustrating how the model’s attention progressively concentrates on SAR-relevant motifs.

🔗 Paper: https://pubs.rsc.org/en/content/articlelanding/2025/dd/d5dd00436e

📦 Zenodo (Data and results): https://doi.org/10.5281/zenodo.17935028

📁 Repository overview

explainable_al/ — core Python package (featurisers, GP surrogate, acquisition functions, experiment runner, plotting helpers).
apps/ — Streamlit apps:
- apps/al_app — interactive active learning GUI (uses explainable_al).
- apps/shap_app — SHAP analysis GUI (expects precomputed results from Zenodo or your own runs).
data/ — example CSV datasets used by notebooks and tests.
notebooks/ — cleaned tutorial notebook al_notebook.ipynb.
scripts/ — CLI helpers (e.g., scripts/run_al.py).
tests/ — unit tests for core utilities (keep for CI/regression checks).

🚀 Quickstart

Create a reproducible environment (conda recommended for RDKit):

conda create -n al python=3.10 -y
conda activate al
conda install -c conda-forge rdkit numpy pandas scikit-learn matplotlib seaborn pytorch jupyter -y
pip install -r requirements.txt

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Run the cleaned tutorial notebook:

jupyter lab notebooks/al_notebook.ipynb

Run an AL experiment from the terminal (CLI):

python scripts/run_al.py --dataset data/D2R.csv --representation ecfp \
	--protocol ucb --initial-size 60 --batch-size 30 --cycles 10 --output results/run1

Launch the demo apps:

# Active Learning GUI
streamlit run apps/al_app/app.py

# SHAP GUI (point to Zenodo or your outputs)
streamlit run apps/shap_app/shapey.py

Notes:

The SHAP app expects experiment outputs (CSV/NPZ). You can download the paper's results from Zenodo and point the app to that folder.

⚙️ How to run experiments and customise

Featurisers: choose ecfp (fingerprints), maccs, or chemberta via the CLI --representation flag or in the app UI.
Acquisition policies: supported policies include ucb, pi, ei, and random — select via --protocol or the app.
Surrogate model: experiments currently use Gaussian Process regression; kernel choices can be set in the app.

Examples:

# Run with MACCS keys and PI acquisition
python scripts/run_al.py --dataset data/D2R.csv --representation maccs --protocol PI --initial-size 60 --batch-size 20 --cycles 12 --output results/pi_run

# Use precomputed ChemBERTa embeddings (place dataset_chemberta.npz alongside the CSV)
python scripts/run_al.py --dataset data/D2R.csv --representation chemberta --output results/chem_run

📝 Citations

If you find the work useful in your research, please cite our paper:

@article{srivastava2025explainable,
  title={Explainable Active Learning Framework for Ligand Binding Affinity Prediction},
  author={Srivastava, Satya Pratik and Gorantla, Rohan and Chundru, Sharath Krishna and Winkelman, Claire JR and Mey, Antonia and Singh, Rajeev Kumar},
  journal={Digital Discovery},
  year={2025},
  publisher={Royal Society of Chemistry},
doi = {DOI	https://doi.org/10.1039/D5DD00436E}

📫 Contact us

Satya Pratik Srivastava (ss864@snu.edu.in)
Rohan Gorantla (rohan.gorantla@ed.ac.uk)
Antonia Mey (antonia.mey@ed.ac.uk)

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
apps		apps
data		data
explainable_al		explainable_al
notebooks		notebooks
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
workflow.png		workflow.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Explainable Active Learning Framework for Ligand Binding Affinity Prediction

📁 Repository overview

🚀 Quickstart

⚙️ How to run experiments and customise

📝 Citations

📫 Contact us

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

meyresearch/explainable_AL

Folders and files

Latest commit

History

Repository files navigation

Explainable Active Learning Framework for Ligand Binding Affinity Prediction

📁 Repository overview

🚀 Quickstart

⚙️ How to run experiments and customise

📝 Citations

📫 Contact us

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages