See below for sample notebooks for various computation and medicinal chemistry, machine learning and AI research tools.
- Train a GPT and generate novel molecules
- Train an RNN and generate novel molecules
- Generative tools for hit expansion
- Grow Fragments in a Binding Site
- Basic machine learning and cleaning ChEMBL CSV files
- Regression and Classification with dense neural networks using TensorFlow
- MLP for Regression and Classification with PyTorch
- SciKitLearn Classifiers
- Gradient Boosting Models
- PCA, t-SNE, and Autoviz analysis for data
- HuggingFace classifier models
- ChemProp GNN MPNN training and inference
- Chemeleon GNN foundation model finetuning
- Build a dataset with active learning
- Autodock Vina for any protein / any ligand
- Docking with Autodock Vina, rescoring docking poses with Meta's UMA MLIP1
- Boltz2 for co-folding proteins and ligands
- ODDT for molecule similarity and protein-ligand interactions
- PDBFixer for preparing proteins
- AlphaFold2 - Colabfold version
- ESMFold - Colabfold version
- Protein masking and embedding
- Protein GPT training and finetuning
- Finetune ESM protein models
- Pharmacophore feature testing
- Pharmacokinetic Properties
- Find bioactive molecules on Chembl
- Fingerprints, filters, distances
- Find targets for a disease
- QM calculations with PySCF
- QM calculations with the UMA MLIP1
- DFT calculations using Microsoft's Skala
- DFT and SAPT calculations using Psi4
- AI Agent with Medicinal Chemistry Tools
- Embedding Models for Molecules
- Inference and finetuning with TxGemma
- Inference with Ether0
1Solvation (adding explicit waters and optimizing) available in ReDock and QM_UMA
git clone https://github.com/MauricioCafiero/CafChem.git
import CafChem.CafChemGPT as ccgpt
import CafChem.CafChemRNN as ccrnn
import CafChem.CafChemTxGemma as cctxg
import CafChemSkipDense as ccsd
import CafChem.CafChemHFClassifier as cchf
import CafChem.CafChemBoltz as ccb
import CafChem.CafChemQM_UMA as ccqm
import CafChem.CafChemEleon as ccel
import CafChem.CafChemProp as ccp
import CafChem.CafChemSubs as ccs
import CafChem.CafChemReDock as ccr
import CafChem.CafChemBML as ccml
import CafChem.CafChemFragGrow as ccfg
import CafChem.CafChemMLPPyTorch as ccmlp
import CafChem.CafChemPsi4 as ccp4
- example notebook
- Train a GPT on a SMILES dataset. Use the tools provided to generate novel molecules.
- Using a provided foundation model, finetune with a specific dataset for targeted molecule generation.
- This also uses the CafChemGPTINF module for inference.
- example notebook
- Train an RNN on a SMILES dataset. Use the tools provided to generate novel molecules.
- Using a provided foundation model, finetune with a specific dataset for targeted molecule generation.
- example notebook
- generate analogues of a molecule (from SMILES strings) using generative mask-filling and/or substitutions on phenyl rings.
- Can also calculate some properties (QED, Lipinski properties) related to drug design.
- Calculate Tanimoto similarities based on Fingerprints between molecules in a list and molecules against a known active.
- visualize molecules.
- example notebook
- Explore a binding site with chemical fragments.
- Various viewing options to probe the nature of the binding site.
- example notebook
- read ChEMBL CSV files and clean data.
- featurize data, remove outliers, scale, apply PCA and split into training ad validation sets.
- perform analysis with tree-based methods, linear methods, SVR, and MLP.
- example notebook
- Create regression and classification models using skipdense neural networks.
- Train, save, load and evaluate models.
- example notebook
- Featurize a dataset and
- Train an MLP using Pytorch.
- Evaluate, predict with, save and load models.
- example notebook
- Create a classifier model using a variety of SciKitLearn models.
- Load a CSV with quantitative data and create classes.
- Tree-based models, Logistic Regression, Support Vector Machines, Ridge, MLP.
- Analyze data with confusion matrices.
- example notebook
- Featurize SMILES data with RDKit, Mordred or Fingerprints
- Perform classification or regression.
- XGBoost, LightGBM, and CatBoost.
- Evaluate models.
- example notebook
- Calculate RDKit or Mordred features, or fingerprints for a set of molecules.
- Use PCA or t-SNE to reduce feature dimensionality to 2 and view in a plot.
- Perform autoviz analysis.
- example notebook
- Create a classifier model using HuggingFace.
- Analyze data with confusion matrices.
- Load datasets, add tokens, train, push all to the HuggingFace hub.
- example notebook
- Train the Chemprop GNN-based MPNN model.
- save and load trained models and analyze data.
- example notebook
- finetune the Chemeleon foundation model.
- save and load trained models and analyze data.
- example notebook
- Use active learning and a gaussian process regressor to build up a dataset to a desired accuracy.
- export the dataset at the end.
- example notebook with metal
- example notebook no metal
- example notebook with Quickrun (requires a pre-prepared protein PDBQT for the quickrun)
- Provide a smiles string or sdf or a ligand and a PDB for a proteins and perform docking
- example notebook
- dock molecular SMILES strings in a protein using DockString and save poses.
- Calculate the interaction between a docking pose and a trimmed protein active site using Meta's UMA MLIP.
- visualize molecules.
- example notebook
- Input a protein sequence and a list of SMILES strings.
- Co-fold the protein/ligand pairs using Boltz2, extract the structures and predict IC50.
- example notebook
- Use various methods to compare molecules from SDFs
- find all interactions between a protein (PDB file) and a ligand (SDF file)
- example notebook
- use PDB fixer to prepare a PDB file for docking or MD
- treats both proteins and ligands
- use the output from this notebook to create PDBQT files with obabel.
- example notebook
- Colabfold version of Alphafold2, lightly adapted for CafChem.
- Citations to original work in the notebook.
- example notebook
- Colabfold version of ESMfold, lightly adapted for CafChem.
- Citations to original work in the notebook.
- example notebook
- use the ESM model to mask a protein and generate novel proteins via masking-filling.
- Calculate ESM embeddings and use them to find cosine similarity.
- example notebook
- Train or finetune a GPT on protein data.
- download specific protein data from Uniprot
- generate novel proteins with GPT models
- example notebook
- Finetune the ESM models on various tasks
- example notebook
- Run HF, DFT, MP2 and CCSD(T) calculations
- Implicit solvent, TDDFT, Molecular Dynamics
- example notebook
- Uses ASE to implement calculations using Meta's UMA MLIP.
- perform energy calculations, geometry optimizations, vibrational calculations, and thermodynamics calculations.
- Calculate a reaction Gibbs, Enthalpy and Entropy.
- Perform simple dynamics. (Langevin works, Velocity Verlet seems a bit buggy)
- example notebook
- Implements the Microsoft Skala DFT functional in ASE. Also includes LDA, PBE, and TPSS.
- Includes several def2 basis sets.
- Calculate energy, geometry, dipole, vibrational frequencies.
- example notebook
- Generate a defined number of conformers for a list of molecules.
- Test pharmacophore features of a single or multiple conformers against a known active.
- example notebook.
- predict human, monkey, dog and rat pharmacokinetic properties.
- example notebook
- query Uniprot for protein IDs
- query Chembl for bioactive molecules for the desired protein.
- example notebook
- generate 2D and 3D features/fingerprints for molecules.
- apply molecule filters
- perform distance calculations between molecules.
- example notebook
- Query Open Targets for proteins targets for a disease.
- example notebook
- Use the Psi4 code to run DFT energy and geometry optimization calculations.
- Use SAPT on Psi4 to explore contributions to interaction energies.
- example notebook
- using Apertus
- A simple agent to test light-weight HuggingFace models on chemical tool use.
- example notebook
- Create a contrastive pairs dataset
- Train an embedding model
- Use embeddings for similarity calculations or features for regression
- example notebook
- Inference with TxGemma models.
- These models have been finetuned to answer many types of medicinal chemistry questions.
- Finetune a TxGemma model on your own medchem dataset
- example notebook
- Inference with the Ether0 model.
- This model has been finetuned to answer many types of medicinal chemistry questions. (see the notebook for use cases).
