Skip to content

Native Rust hyperparameter optimization — Optuna-style without reloading data #83

@eprifti

Description

@eprifti

Context

Issue #77 proposed Optuna in Python (gpredomicspy). But Python-level optimization reloads data from disk for every trial — wasteful when the dataset is large (e.g., wetlab 1981×918 matrix).

A native Rust implementation would:

  1. Load data once
  2. Run hundreds of parameter trials in-memory
  3. Use the same feature selection cache across trials
  4. Be orders of magnitude faster than Python subprocess per trial

Design

Core: optimize() function in lib.rs

pub fn optimize(
    data: &Data,
    base_param: &Param,
    search_space: &SearchSpace,
    n_trials: usize,
    metric: OptMetric,        // TestAUC, TestSpearman, CVMeanAUC, etc.
    sampler: Sampler,         // TPE (default), Random, Grid
) -> OptResult {
    // Data is loaded ONCE, shared across all trials
    for trial in 0..n_trials {
        let param = sampler.suggest(&search_space, &history);
        let result = run_trial(data, &param);  // no disk I/O
        history.push(trial, param, result);
    }
    OptResult { best_params, best_value, history }
}

Sampler options

  1. Random — uniform random sampling (baseline)
  2. TPE (Tree-structured Parzen Estimator) — Optuna's default, Bayesian
  3. Grid — exhaustive grid search for small spaces
  4. CMA-ES — covariance matrix adaptation for continuous params

Search space definition (in param.yaml)

optimize:
  n_trials: 100
  metric: test_auc           # or cv_mean_auc, spearman, etc.
  sampler: tpe
  search_space:
    algo: [ga, beam, sa, ils, lasso]
    k_penalty: {log_uniform: [1e-5, 0.01]}
    language: [ter, "bin,ter", "bin,ter,ratio"]
    data_type: [prev, raw, "raw,prev"]
    population_size: {int: [500, 10000]}
    cooling_rate: {uniform: [0.99, 0.9999]}
    feature_minimal_prevalence_pct: {int: [5, 30]}

Key advantages over Python Optuna

Python Optuna (#77) Native Rust
Data loading Once per trial (subprocess) Once total
Feature selection Recomputed per trial Cached
Overhead per trial ~2s (process spawn + data I/O) ~0ms
100 trials on Qin2014 ~200s + algo time ~algo time only
Parallelism Python GIL limited Full rayon parallelism

Implementation phases

  1. Random sampler + grid — simplest, proves the architecture
  2. TPE sampler — port the core algorithm (kernel density estimation)
  3. Pruning — early stopping of unpromising trials (median pruner)
  4. CLI integrationgpredomics --optimize param.yaml
  5. Web app — "Tune" button that calls optimize() via gpredomicspy

References

  • Akiba et al. (2019). Optuna: A Next-generation Hyperparameter Optimization Framework. KDD.
  • Bergstra et al. (2011). Algorithms for Hyper-Parameter Optimization. NeurIPS.
  • TPE: Tree-structured Parzen Estimator (Bergstra et al., 2011)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions