This repository contains the code, data, and configurations for benchmarking our GSGE (Group-SELFIES Graph Embeddings) and sequence-based models against the CycPeptMP dataset from Li et al. (2024). The project leverages our DeepCROW (Deep Classification & Regression Optimization Workflow) Benchmark Pipeline for hyperparameter optimization (HPO), model training, and evaluation, with comparisons across various models including GCNs, standard Transformers, LSTMs, and xLSTMs.
Figure 1: Test set MAE leaderboard across all models. Orange bars denote Li et al. (2024) CycPeptMP baselines.
Figure 2: Test set R² leaderboard across all models.
Figure 3: Critical difference diagrams for pairwise model comparisons across CV folds.
Figure 4: Model comparison sets (MCS) — sets of models not significantly worse than the best.
Figure 5: Cross-endpoint MAE comparison across models and endpoints.
Figure 6: Direct MAE comparison of our best models vs. Li et al. (2024) per endpoint.
GSGE (Group-SELFIES Graph Embeddings) extends molecular fragment tokenization/node information using learned molecular fragments graph embeddings. It is functional group aware, while preserving learned fragment molecular structural information via graph-based autoencoding.
GSGE enables:
- Compact molecular graph representations using molecular fragment nodes
- Embedding learned molecular fragment chemistry into continuous latent space
- Designed and tested the more complex molecular structures of cyclic peptides
Figure 7: GSGE compound graph used in the GCNs in this study.
See https://github.com/JasperDurinck/GSGE-dev for more info
- data/: Contains
peptide_used.csv, the dataset used for benchmarking, sourced from Li et al. (2024) CycPeptMP. - split_idx/: Holds
.npyfiles with train, validation, and test indices for cross-validation (CV) splits (0, 1, 2) and the holdout test set (Test_index.npy). Indices correspond topeptide_used.csvindex, not ID (refer to CycPeptMP for ID-based indexing). Example:Valid_index_cv0.npy. A test notebook (test.ipynb) in this directory validates the indexing.
- vocabs/:
test_gsge_save_with_descriptors.pkl: Fragment vocabulary for GSGE, used to construct compound graphs in our GSGE package.- SMILES_BPE_vocab/: Contains
custom_BPE_SMILES_v1_vocab_config.jsonandcustom_BPE_SMILES_v1_vocab.jsonfor SMILES-based Byte Pair Encoding (BPE) tokenization, used in the DeepCROW package for ablation studies on Transformers, LSTMs, and xLSTMs.
- model_optimization/: Contains subdirectories for each model (e.g.,
gcn_7desc_mlp), each with:config_hpo.yaml: Configuration for HPO, executable via the DeepCROW Benchmark Pipeline CLI (dc_benchmark_pipeline path/config_hpo.yaml).config_holdout_evaluation.yaml: Configuration for evaluating the best HPO model (stored inmodel_config.yaml) on the holdout test set across multiple seeds.- hpo_results/: Nested directories for each seed (e.g.,
cv012_hpo_seed_42), containing:- CSV files like
fit_model_test_results_cv1.csvandfit_model_valid_results_metrics_cv2.csvwith non-rounded predictions (pred, known, ids; censored data: <-8, >-4). - weights/: Model weights for each fit.
- CSV files like
- custom_code/: Custom code for models, data processing, descriptor calculations, and functions dynamically imported by the DeepCROW pipeline.
All analysis notebooks and outputs live under experiments/.
Main analysis hub comparing all graph-based and sequence models.
test_overview.ipynb: Generates the holdout test set leaderboards, scatter panels, residual panels, error distributions, and uncertainty-vs-error plots. Outputs saved tofigures/test_overview/.val_overview.ipynb: Same analysis for the cross-validation validation set. Outputs saved tofigures/val_overview/.model_cv_test_comparison.csv/model_cv_valid_comparison.csv: Aggregated metrics (mean ± std over seeds) for all models on test and validation sets.
statistical_significance_analysis.ipynb: Runs non-parametric and parametric significance tests across CV folds, generates:- Boxplots (parametric and non-parametric)
- Confidence interval grids (ranked and unranked)
- Critical difference diagrams
- Model comparison sets (MCS) plots
- Normality diagnostics
figures/: All output figures (PDF, PNG, SVG).model_comparison.py/model_labels.py: Shared utilities for model filtering and display labels.
- Notebooks plotting HPO training loss trajectories for selected models (
gcn_7desc_mlp,transformer_bpe,tlstm_bpe).
- Data preparation notebooks (
make_df.ipynb) that compile per-run metrics into dataframes for downstream significance testing, including a Li et al. sub-directory for baseline comparisons.
Same test_overview.ipynb and val_overview.ipynb structure as above, scoped to sequence-based models (Transformers BPE/SELFIES/SAT, LSTMs, xLSTMs).
Analysis of model generalization across all CycPeptMP endpoints (cell permeability, PAMPA, Caco-2, etc.).
cross_endpoint_metrics_by_run.csv: Per-run metrics for every endpoint and model.cross_endpoint_metrics_summary_*.csv: Summary statistics (mean ± std) across seeds and CV folds.cross_endpoint_metrics_li_et_al*.csv/.tex: Li et al. baseline metrics and combined comparison tables (also exported as LaTeX).figures/: Output figures including:- Endpoint leaderboards ranked by MAE and Pearson r
- Heatmaps of MAE across models × endpoints
- Direct Li et al. vs. ours comparison plots
- MAE + R² combined comparison panels
Each model has its own subdirectory under experiments/ (e.g., gcn_7desc_mlp, gae_gcn, fps_mlp, 7desc_mlp, etc.) containing HPO configs, HPO results, and model weights mirroring the model_optimization/ layout.
- inference/: Contains a DeepCROW pipeline example for running inference with trained models.
-
For this project we used Python 3.12:
conda create -n gsge_env python==3.12
-
setup.sh: Specifies required packages.
# installs all required python packages (except xLSTM) bash setup_env.sh -
Tests to check setup:
GSGE_CLI run_test
-
xLSTM: Installed from PyPI (
pip install xlstm==2.0.2). Seeenvironment_dev.ymlfor the pinned version. The official xLSTM repository is available at https://github.com/NX-AI/xlstm -
environment_dev.yml: Provides a complete list of package versions used during experiments.
All models except xLSTM are trained deterministically and can be reproduced identically in the same environment. xLSTM may have minor variations due to implementation specifics.
- Dependencies can be found in
setup.shor seeenvironment_dev.ymlfor specifics. - Use the local training pipeline in
model_optimization/custom_code/for model training and hyperparameter optimization. - Use
inference/for running predictions with trained models. - Refer to
experiments/model_comparison/for test/validation leaderboards and significance analysis. - Refer to
experiments/cross_endpoint_transfer/for cross-endpoint generalization analysis.
- The
split_idx/test.ipynbnotebook verifies the correctness of CV and test splits. - GSGE package details are available in our developed GSGE repository.
- xLSTM models use the official xlstm package (v2.0.2) from PyPI.
- All figures are exported in PDF, PNG, and SVG formats.