Skip to content

CDDLeiden/GSGE-Benchmarking

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GSGE-CycPeptMP-Benchmarking

This repository contains the code, data, and configurations for benchmarking our GSGE (Group-SELFIES Graph Embeddings) and sequence-based models against the CycPeptMP dataset from Li et al. (2024). The project leverages our DeepCROW (Deep Classification & Regression Optimization Workflow) Benchmark Pipeline for hyperparameter optimization (HPO), model training, and evaluation, with comparisons across various models including GCNs, standard Transformers, LSTMs, and xLSTMs.

Key Results

Test Set Performance

Test set MAE leaderboard Figure 1: Test set MAE leaderboard across all models. Orange bars denote Li et al. (2024) CycPeptMP baselines.

Test set R2 leaderboard Figure 2: Test set R² leaderboard across all models.

Statistical Significance

Critical difference diagram Figure 3: Critical difference diagrams for pairwise model comparisons across CV folds.

MCS plot grid Figure 4: Model comparison sets (MCS) — sets of models not significantly worse than the best.

Cross-Endpoint Transfer

Cross-endpoint MAE comparison Figure 5: Cross-endpoint MAE comparison across models and endpoints.

Cross-endpoint Li et al. comparison Figure 6: Direct MAE comparison of our best models vs. Li et al. (2024) per endpoint.


GSGE

GSGE (Group-SELFIES Graph Embeddings) extends molecular fragment tokenization/node information using learned molecular fragments graph embeddings. It is functional group aware, while preserving learned fragment molecular structural information via graph-based autoencoding.

GSGE enables:

  • Compact molecular graph representations using molecular fragment nodes
  • Embedding learned molecular fragment chemistry into continuous latent space
  • Designed and tested the more complex molecular structures of cyclic peptides

Graph Embedding Example Figure 7: GSGE compound graph used in the GCNs in this study.

See https://github.com/JasperDurinck/GSGE-dev for more info


Repository Structure

Data

  • data/: Contains peptide_used.csv, the dataset used for benchmarking, sourced from Li et al. (2024) CycPeptMP.
  • split_idx/: Holds .npy files with train, validation, and test indices for cross-validation (CV) splits (0, 1, 2) and the holdout test set (Test_index.npy). Indices correspond to peptide_used.csv index, not ID (refer to CycPeptMP for ID-based indexing). Example: Valid_index_cv0.npy. A test notebook (test.ipynb) in this directory validates the indexing.

Vocabularies

  • vocabs/:
    • test_gsge_save_with_descriptors.pkl: Fragment vocabulary for GSGE, used to construct compound graphs in our GSGE package.
    • SMILES_BPE_vocab/: Contains custom_BPE_SMILES_v1_vocab_config.json and custom_BPE_SMILES_v1_vocab.json for SMILES-based Byte Pair Encoding (BPE) tokenization, used in the DeepCROW package for ablation studies on Transformers, LSTMs, and xLSTMs.

Model Optimization

  • model_optimization/: Contains subdirectories for each model (e.g., gcn_7desc_mlp), each with:
    • config_hpo.yaml: Configuration for HPO, executable via the DeepCROW Benchmark Pipeline CLI (dc_benchmark_pipeline path/config_hpo.yaml).
    • config_holdout_evaluation.yaml: Configuration for evaluating the best HPO model (stored in model_config.yaml) on the holdout test set across multiple seeds.
    • hpo_results/: Nested directories for each seed (e.g., cv012_hpo_seed_42), containing:
      • CSV files like fit_model_test_results_cv1.csv and fit_model_valid_results_metrics_cv2.csv with non-rounded predictions (pred, known, ids; censored data: <-8, >-4).
      • weights/: Model weights for each fit.
    • custom_code/: Custom code for models, data processing, descriptor calculations, and functions dynamically imported by the DeepCROW pipeline.

Experiments & Analysis

All analysis notebooks and outputs live under experiments/.

experiments/model_comparison/

Main analysis hub comparing all graph-based and sequence models.

  • test_overview.ipynb: Generates the holdout test set leaderboards, scatter panels, residual panels, error distributions, and uncertainty-vs-error plots. Outputs saved to figures/test_overview/.
  • val_overview.ipynb: Same analysis for the cross-validation validation set. Outputs saved to figures/val_overview/.
  • model_cv_test_comparison.csv / model_cv_valid_comparison.csv: Aggregated metrics (mean ± std over seeds) for all models on test and validation sets.
statistical_significance/
  • statistical_significance_analysis.ipynb: Runs non-parametric and parametric significance tests across CV folds, generates:
    • Boxplots (parametric and non-parametric)
    • Confidence interval grids (ranked and unranked)
    • Critical difference diagrams
    • Model comparison sets (MCS) plots
    • Normality diagnostics
  • figures/: All output figures (PDF, PNG, SVG).
  • model_comparison.py / model_labels.py: Shared utilities for model filtering and display labels.
hpo_mae_loss_trajectory/
  • Notebooks plotting HPO training loss trajectories for selected models (gcn_7desc_mlp, transformer_bpe, tlstm_bpe).
statistical_comparison/
  • Data preparation notebooks (make_df.ipynb) that compile per-run metrics into dataframes for downstream significance testing, including a Li et al. sub-directory for baseline comparisons.

experiments/sequence_models/model_comparison/

Same test_overview.ipynb and val_overview.ipynb structure as above, scoped to sequence-based models (Transformers BPE/SELFIES/SAT, LSTMs, xLSTMs).

experiments/cross_endpoint_transfer/

Analysis of model generalization across all CycPeptMP endpoints (cell permeability, PAMPA, Caco-2, etc.).

  • cross_endpoint_metrics_by_run.csv: Per-run metrics for every endpoint and model.
  • cross_endpoint_metrics_summary_*.csv: Summary statistics (mean ± std) across seeds and CV folds.
  • cross_endpoint_metrics_li_et_al*.csv / .tex: Li et al. baseline metrics and combined comparison tables (also exported as LaTeX).
  • figures/: Output figures including:
    • Endpoint leaderboards ranked by MAE and Pearson r
    • Heatmaps of MAE across models × endpoints
    • Direct Li et al. vs. ours comparison plots
    • MAE + R² combined comparison panels

Individual model experiment directories

Each model has its own subdirectory under experiments/ (e.g., gcn_7desc_mlp, gae_gcn, fps_mlp, 7desc_mlp, etc.) containing HPO configs, HPO results, and model weights mirroring the model_optimization/ layout.

Inference

  • inference/: Contains a DeepCROW pipeline example for running inference with trained models.

Environment Setup

  • For this project we used Python 3.12:

    conda create -n gsge_env python==3.12
  • setup.sh: Specifies required packages.

    # installs all required python packages (except xLSTM)
    bash setup_env.sh
  • Tests to check setup:

    GSGE_CLI run_test
  • xLSTM: Installed from PyPI (pip install xlstm==2.0.2). See environment_dev.yml for the pinned version. The official xLSTM repository is available at https://github.com/NX-AI/xlstm

  • environment_dev.yml: Provides a complete list of package versions used during experiments.

Reproducibility

All models except xLSTM are trained deterministically and can be reproduced identically in the same environment. xLSTM may have minor variations due to implementation specifics.

Usage

  1. Dependencies can be found in setup.sh or see environment_dev.yml for specifics.
  2. Use the local training pipeline in model_optimization/custom_code/ for model training and hyperparameter optimization.
  3. Use inference/ for running predictions with trained models.
  4. Refer to experiments/model_comparison/ for test/validation leaderboards and significance analysis.
  5. Refer to experiments/cross_endpoint_transfer/ for cross-endpoint generalization analysis.

Notes

  • The split_idx/test.ipynb notebook verifies the correctness of CV and test splits.
  • GSGE package details are available in our developed GSGE repository.
  • xLSTM models use the official xlstm package (v2.0.2) from PyPI.
  • All figures are exported in PDF, PNG, and SVG formats.

About

Benchmarking repository accompanying the CDDLeiden/GSGE package

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages