Context
Issue #77 proposed Optuna in Python (gpredomicspy). But Python-level optimization reloads data from disk for every trial — wasteful when the dataset is large (e.g., wetlab 1981×918 matrix).
A native Rust implementation would:
- Load data once
- Run hundreds of parameter trials in-memory
- Use the same feature selection cache across trials
- Be orders of magnitude faster than Python subprocess per trial
Design
Core: optimize() function in lib.rs
pub fn optimize(
data: &Data,
base_param: &Param,
search_space: &SearchSpace,
n_trials: usize,
metric: OptMetric, // TestAUC, TestSpearman, CVMeanAUC, etc.
sampler: Sampler, // TPE (default), Random, Grid
) -> OptResult {
// Data is loaded ONCE, shared across all trials
for trial in 0..n_trials {
let param = sampler.suggest(&search_space, &history);
let result = run_trial(data, ¶m); // no disk I/O
history.push(trial, param, result);
}
OptResult { best_params, best_value, history }
}
Sampler options
- Random — uniform random sampling (baseline)
- TPE (Tree-structured Parzen Estimator) — Optuna's default, Bayesian
- Grid — exhaustive grid search for small spaces
- CMA-ES — covariance matrix adaptation for continuous params
Search space definition (in param.yaml)
optimize:
n_trials: 100
metric: test_auc # or cv_mean_auc, spearman, etc.
sampler: tpe
search_space:
algo: [ga, beam, sa, ils, lasso]
k_penalty: {log_uniform: [1e-5, 0.01]}
language: [ter, "bin,ter", "bin,ter,ratio"]
data_type: [prev, raw, "raw,prev"]
population_size: {int: [500, 10000]}
cooling_rate: {uniform: [0.99, 0.9999]}
feature_minimal_prevalence_pct: {int: [5, 30]}
Key advantages over Python Optuna
|
Python Optuna (#77) |
Native Rust |
| Data loading |
Once per trial (subprocess) |
Once total |
| Feature selection |
Recomputed per trial |
Cached |
| Overhead per trial |
~2s (process spawn + data I/O) |
~0ms |
| 100 trials on Qin2014 |
~200s + algo time |
~algo time only |
| Parallelism |
Python GIL limited |
Full rayon parallelism |
Implementation phases
- Random sampler + grid — simplest, proves the architecture
- TPE sampler — port the core algorithm (kernel density estimation)
- Pruning — early stopping of unpromising trials (median pruner)
- CLI integration —
gpredomics --optimize param.yaml
- Web app — "Tune" button that calls optimize() via gpredomicspy
References
- Akiba et al. (2019). Optuna: A Next-generation Hyperparameter Optimization Framework. KDD.
- Bergstra et al. (2011). Algorithms for Hyper-Parameter Optimization. NeurIPS.
- TPE: Tree-structured Parzen Estimator (Bergstra et al., 2011)
Context
Issue #77 proposed Optuna in Python (gpredomicspy). But Python-level optimization reloads data from disk for every trial — wasteful when the dataset is large (e.g., wetlab 1981×918 matrix).
A native Rust implementation would:
Design
Core:
optimize()function in lib.rsSampler options
Search space definition (in param.yaml)
Key advantages over Python Optuna
Implementation phases
gpredomics --optimize param.yamlReferences