|
| 1 | +# DynamicAlgorithmSelection2 |
| 2 | + |
| 3 | +RL-based Dynamic Algorithm Selection (DAS) on the [BBOB benchmark](https://numbbo.github.io/coco/). A controller learns to switch between a portfolio of black-box optimizers at runtime — allocating function-evaluation budget to whichever optimizer is most promising at each checkpoint. |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## Agents |
| 8 | + |
| 9 | +Three agent families share the same BBOB problem set and evaluation protocol: |
| 10 | + |
| 11 | +| Agent | Description | Key reference | |
| 12 | +|---|---|---| |
| 13 | +| **PPO** | Stable Baselines 3 PPO with VecNormalize; multi-dimensional; ELA-based observations | — | |
| 14 | +| **RL-DAS** | Custom single-dimension PyTorch PPO; DE-only portfolio; population-state features with local sampling | Guo et al., 2024 | |
| 15 | +| **Exp-DAS** | Custom PyTorch PPO with exponential checkpoint spacing; flexible portfolio | — | |
| 16 | + |
| 17 | +### PPO |
| 18 | +Uses `DASEnv` — a Gymnasium environment that wraps a warm-started portfolio of arbitrary optimizers. Observations are 22-dimensional ELA landscape features plus per-optimizer movement history. Trains across multiple dimensions simultaneously. |
| 19 | + |
| 20 | +### RL-DAS |
| 21 | +Faithful port of [Guo et al. 2024](https://doi.org/10.1145/3638529.3654223) with BBOB adaptations: |
| 22 | +- Fixed DE portfolio: **NL_SHADE_RSP**, **MadDE**, **JDE21** (all share a single `Population` object as mutable warm-started state). |
| 23 | +- 9-dimensional population-state features computed via local sampling (2 independent forward passes on a population deepcopy). |
| 24 | +- Movement embedder networks compress D-dim displacement vectors to scalars; the backbone is dimension-specific (one model per `--dim`). |
| 25 | +- Hand-rolled PPO training loop (no SB3 dependency for this agent). |
| 26 | + |
| 27 | +### Exp-DAS |
| 28 | +Evolution of the original DAS `policy-gradient` agent. Uses `DASEnv` (same as PPO) but replaces uniform checkpoint spacing with an **exponential schedule** controlled by the Checkpoint Division Base (`--cdb`): |
| 29 | + |
| 30 | +- **`cdb = 1.0` (uniform):** every checkpoint covers the same number of function evaluations — consistent monitoring throughout the run. |
| 31 | +- **`cdb > 1.0` (exponential):** early checkpoints are short (frequent switching during initial exploration) and later checkpoints are long (uninterrupted convergence during exploitation). |
| 32 | + |
| 33 | +The agent uses separate actor and critic learning rates and a configurable number of PPO gradient epochs per update. Like PPO, it supports multiple dimensions simultaneously and an arbitrary optimizer portfolio. |
| 34 | + |
| 35 | +--- |
| 36 | + |
| 37 | +## Installation |
| 38 | + |
| 39 | +Requires Python 3.11. Dependency management via [uv](https://docs.astral.sh/uv/). |
| 40 | + |
| 41 | +```bash |
| 42 | +uv sync |
| 43 | +``` |
| 44 | + |
| 45 | +--- |
| 46 | + |
| 47 | +## Quick start |
| 48 | + |
| 49 | +`run_local.sh` runs a single agent with tiny settings (fast smoke test): |
| 50 | + |
| 51 | +```bash |
| 52 | +bash run_local.sh [seed] [agent] [portfolio...] |
| 53 | + |
| 54 | +# Examples |
| 55 | +bash run_local.sh 42 ppo CPSO NM TDE |
| 56 | +bash run_local.sh 42 ppo-cv CPSO NM TDE |
| 57 | +bash run_local.sh 42 rl-das # DE portfolio fixed; no -p needed |
| 58 | +bash run_local.sh 42 rl-das-cv |
| 59 | +bash run_local.sh 42 exp-das CPSO NM TDE |
| 60 | +bash run_local.sh 42 exp-das-cv CPSO NM TDE |
| 61 | +bash run_local.sh 42 baselines CPSO NM TDE |
| 62 | +``` |
| 63 | + |
| 64 | +Run the full smoke-test suite (all agent types): |
| 65 | + |
| 66 | +```bash |
| 67 | +bash smoke_test.sh |
| 68 | +# or selectively |
| 69 | +bash smoke_test.sh rl-das rl-das-cv |
| 70 | +``` |
| 71 | + |
| 72 | +--- |
| 73 | + |
| 74 | +## Training |
| 75 | + |
| 76 | +```bash |
| 77 | +python train.py {ppo,rl-das,exp-das} <name> [options] |
| 78 | +``` |
| 79 | + |
| 80 | +### PPO |
| 81 | + |
| 82 | +```bash |
| 83 | +python train.py ppo MY_PPO \ |
| 84 | + -p CPSO NM TDE \ |
| 85 | + -d 2 5 10 \ |
| 86 | + -E 20 \ |
| 87 | + --fe-multiplier 10000 \ |
| 88 | + --n-checkpoints 10 \ |
| 89 | + --seed 42 |
| 90 | +``` |
| 91 | + |
| 92 | +Key options: |
| 93 | + |
| 94 | +| Flag | Default | Description | |
| 95 | +|---|---|---| |
| 96 | +| `-p / --portfolio` | `SPSO IPSO SPSOL` | Sub-optimizer names | |
| 97 | +| `-d / --dims` | all | Problem dimensions | |
| 98 | +| `-E / --n-epochs` | 20 | Passes over the training set | |
| 99 | +| `--fe-multiplier` | 10 000 | Budget = multiplier × dimension | |
| 100 | +| `--n-checkpoints` | 10 | Optimizer-selection steps per episode | |
| 101 | +| `-x / --cdb` | 1.0 | Checkpoint division base (1 = uniform) | |
| 102 | +| `-O / --reward-option` | 1 | Reward shaping (1–4) | |
| 103 | +| `--wandb` | off | Log to Weights & Biases | |
| 104 | + |
| 105 | +Outputs: `models/<name>.zip`, `models/<name>_vecnorm.pkl` |
| 106 | + |
| 107 | +### RL-DAS |
| 108 | + |
| 109 | +```bash |
| 110 | +python train.py rl-das MY_RLDAS \ |
| 111 | + --dim 10 \ |
| 112 | + --n-epochs 20 \ |
| 113 | + --fe-multiplier 10000 \ |
| 114 | + --seed 42 |
| 115 | +``` |
| 116 | + |
| 117 | +The portfolio is fixed to `NL_SHADE_RSP MADDE JDE21` and `--n-individuals` defaults to 170 (matching the original paper). Use `--portfolio` to override. |
| 118 | + |
| 119 | +Key options: |
| 120 | + |
| 121 | +| Flag | Default | Description | |
| 122 | +|---|---|---| |
| 123 | +| `--dim` | 10 | Problem dimension (one model per dim) | |
| 124 | +| `--n-epochs` | 20 | Training epochs | |
| 125 | +| `--lr` | 1e-5 | Learning rate | |
| 126 | +| `--k-epoch` | `0.3 × n_checkpoints` | PPO gradient steps per episode | |
| 127 | +| `--device` | cpu | PyTorch device | |
| 128 | + |
| 129 | +Outputs: `models/<name>_final.pt`, `models/<name>_epoch<N>.pt`, `models/<name>_train_log.jsonl` |
| 130 | + |
| 131 | +### Exp-DAS |
| 132 | + |
| 133 | +```bash |
| 134 | +python train.py exp-das MY_EXPDAS \ |
| 135 | + -p CPSO NM TDE \ |
| 136 | + --dims 2 5 10 \ |
| 137 | + -E 3 \ |
| 138 | + --cdb 2.0 \ |
| 139 | + --reward-option 1 \ |
| 140 | + --seed 42 |
| 141 | +``` |
| 142 | + |
| 143 | +Key options: |
| 144 | + |
| 145 | +| Flag | Default | Description | |
| 146 | +|---|---|---| |
| 147 | +| `--dims` | `2 5 10` | Problem dimensions | |
| 148 | +| `--cdb` | 2.0 | Checkpoint Division Base (see below) | |
| 149 | +| `-E / --n-epochs` | 3 | Passes over the training set | |
| 150 | +| `--actor-lr` | 3e-5 | Actor learning rate | |
| 151 | +| `--critic-lr` | 1e-5 | Critic learning rate | |
| 152 | +| `--ppo-epochs` | 6 | PPO gradient epochs per update | |
| 153 | +| `--buffer-capacity` | `16 × n_checkpoints` | PPO rollout buffer size in steps | |
| 154 | +| `-O / --reward-option` | 1 | Reward shaping strategy (1–4, see below) | |
| 155 | +| `--save-interval` | 500 | Save a checkpoint every N episodes | |
| 156 | +| `--device` | cpu | PyTorch device | |
| 157 | + |
| 158 | +Outputs: `models/<name>_best.pt`, `models/<name>_final.pt`, `models/<name>_ep<N>.pt`, `models/<name>_train_log.jsonl` |
| 159 | + |
| 160 | +--- |
| 161 | + |
| 162 | +## Checkpoint Division Base (CDB) |
| 163 | + |
| 164 | +The `--cdb` argument controls how the total FE budget is distributed across the `n_checkpoints` decision points in each episode. |
| 165 | + |
| 166 | +With `cdb = 1.0` every checkpoint covers the same number of FEs (uniform). With `cdb > 1.0` checkpoint durations grow exponentially: the first checkpoints are short (fast switching during early exploration) and the last are long (uninterrupted convergence during exploitation). |
| 167 | + |
| 168 | +``` |
| 169 | +cdb = 1.0 → [───][───][───][───][───] uniform |
| 170 | +cdb = 2.0 → [─][──][────][────────] exponential |
| 171 | +``` |
| 172 | + |
| 173 | +**When to use each value:** |
| 174 | + |
| 175 | +| Value | Effect | Use case | |
| 176 | +|---|---|---| |
| 177 | +| `1.0` | Equal-length checkpoints | Consistent monitoring; PPO default | |
| 178 | +| `2.0` | Moderate exponential growth | Exp-DAS default; balances exploration and exploitation | |
| 179 | +| `> 2.0` | Aggressive early switching | Portfolios where early optimizer choice is decisive | |
| 180 | + |
| 181 | +The `--cdb` flag is available for all three agents (`ppo`, `rl-das` ignores it, `exp-das`). |
| 182 | + |
| 183 | +--- |
| 184 | + |
| 185 | +## Reward options |
| 186 | + |
| 187 | +The `-O / --reward-option` flag selects the reward signal used at each checkpoint. All options measure improvement in the best objective value found so far and scale it by the initial value range. |
| 188 | + |
| 189 | +| Option | Name | Description | |
| 190 | +|---|---|---| |
| 191 | +| `1` | Log-scaled improvement | `improvement` between consecutive checkpoints, clipped to `[0, 1]`, then `log(r + 1e-5)`. Smooths large variance. **Default.** | |
| 192 | +| `2` | Linear clipped improvement | Same as option 1 but without the log transform: `clip(improvement, 0, 1)`. | |
| 193 | +| `3` | Sparse total improvement | Returns `0` at every intermediate checkpoint; at the final checkpoint returns the log-scaled total improvement from episode start. Focuses the agent on end-of-run quality. | |
| 194 | +| `4` | Binary threshold | Returns `1` if scaled improvement ≥ `1e-3`, else `0`. Simple binary feedback. | |
| 195 | + |
| 196 | +--- |
| 197 | + |
| 198 | +## Cross-validation |
| 199 | + |
| 200 | +```bash |
| 201 | +python cv.py {ppo,rl-das,exp-das} <name> [options] |
| 202 | +``` |
| 203 | + |
| 204 | +Two CV modes: |
| 205 | + |
| 206 | +- **LOIO** (Leave-One-Instance-Out): hold out a subset of BBOB instances per fold. |
| 207 | +- **LOPO** (Leave-One-Problem-Out): hold out a subset of BBOB functions per fold. |
| 208 | + |
| 209 | +```bash |
| 210 | +# PPO – 3-fold LOIO |
| 211 | +python cv.py ppo MY_PPO_CV \ |
| 212 | + -p CPSO NM TDE -d 5 10 \ |
| 213 | + --cv-mode LOIO --n-folds 3 --n-epochs 10 --seed 42 |
| 214 | + |
| 215 | +# RL-DAS – 3-fold LOPO, dim 10 only |
| 216 | +python cv.py rl-das MY_RLDAS_CV \ |
| 217 | + --dim 10 --cv-mode LOPO --n-folds 3 --n-epochs 20 --seed 42 |
| 218 | + |
| 219 | +# Run only folds 0 and 2 |
| 220 | +python cv.py exp-das MY_EXPDAS_CV \ |
| 221 | + -p CPSO NM TDE --dims 5 10 \ |
| 222 | + --cv-mode LOIO --folds 0 2 --n-epochs 3 --seed 42 |
| 223 | +``` |
| 224 | + |
| 225 | +Outputs per fold: `results/<name>_cv_<fold_tag>.jsonl` |
| 226 | +Aggregated: `results/<name>_cv_summary.jsonl` |
| 227 | + |
| 228 | +--- |
| 229 | + |
| 230 | +## Baselines |
| 231 | + |
| 232 | +```bash |
| 233 | +python baselines.py <name> --agent <agent_type> [options] |
| 234 | +``` |
| 235 | + |
| 236 | +Agent types: |
| 237 | + |
| 238 | +| Type | Description | |
| 239 | +|---|---| |
| 240 | +| `random` | Uniform random selection at each checkpoint | |
| 241 | +| `fixed:<name>` | Always pick one optimizer, e.g. `fixed:CPSO` | |
| 242 | +| `single:<name>` | One optimizer runs the full budget (no checkpointing) | |
| 243 | +| `all` | All of the above; derives oracle-best / oracle-worst | |
| 244 | + |
| 245 | +```bash |
| 246 | +python baselines.py MY_BASELINES --agent all \ |
| 247 | + -p CPSO NM TDE -d 2 5 10 --seed 42 |
| 248 | +``` |
| 249 | + |
| 250 | +--- |
| 251 | + |
| 252 | +## Evaluation |
| 253 | + |
| 254 | +Load a trained PPO model and evaluate it on the BBOB test set: |
| 255 | + |
| 256 | +```bash |
| 257 | +python evaluate.py MY_PPO \ |
| 258 | + -p CPSO NM TDE -d 5 10 --seed 42 |
| 259 | +``` |
| 260 | + |
| 261 | +Add `--coco-observer` to write COCO-compatible data for `cocopp` post-processing. |
| 262 | + |
| 263 | +--- |
| 264 | + |
| 265 | +## Problem set |
| 266 | + |
| 267 | +The BBOB benchmark provides 24 functions × 15 instances × 6 dimensions = 2 160 problems per dimension. |
| 268 | + |
| 269 | +**Dimensions:** `2, 3, 5, 10, 20, 40` |
| 270 | + |
| 271 | +**Default train/test split** (`--mode easy`): trains on 14 structurally simpler functions and tests on the remaining 10 harder functions. |
| 272 | + |
| 273 | +| Mode | Train | Test | |
| 274 | +|---|---|---| |
| 275 | +| `easy` | functions {4,6–14,18–20,22–24} | remaining 10 functions | |
| 276 | +| `hard` | inverse of easy | — | |
| 277 | +| `random` | 2/3 of all problems | 1/3 | |
| 278 | + |
| 279 | +--- |
| 280 | + |
| 281 | +## Optimizer portfolio |
| 282 | + |
| 283 | +Available sub-optimizers (pass names via `-p / --portfolio`): |
| 284 | + |
| 285 | +| Family | Names | |
| 286 | +|---|---| |
| 287 | +| PSO | `SPSO`, `IPSO`, `SPSOL`, `CPSO` | |
| 288 | +| DE | `NL_SHADE_RSP`, `MADDE`, `JDE21`, `TDE` | |
| 289 | +| ES | `NM` (Nelder-Mead) | |
| 290 | +| BO | `BO` | |
| 291 | +| DS | `DS` (Direct Search) | |
| 292 | + |
| 293 | +RL-DAS always uses the DE trio `NL_SHADE_RSP / MADDE / JDE21` — overridable with `--portfolio`. |
| 294 | + |
| 295 | +--- |
| 296 | + |
| 297 | +## HPC / SLURM |
| 298 | + |
| 299 | +Submit all agents for a given seed and portfolio: |
| 300 | + |
| 301 | +```bash |
| 302 | +bash runner.sh |
| 303 | +``` |
| 304 | + |
| 305 | +Individual SLURM scripts: |
| 306 | + |
| 307 | +| Script | Agent | |
| 308 | +|---|---| |
| 309 | +| `ppo_study.slurm` | PPO | |
| 310 | +| `rl_das_study.slurm` | RL-DAS | |
| 311 | +| `exp_das_study.slurm` | Exp-DAS | |
| 312 | +| `baselines.slurm` | Baselines | |
| 313 | + |
| 314 | +--- |
| 315 | + |
| 316 | +## Project structure |
| 317 | + |
| 318 | +``` |
| 319 | +DynamicAlgorithmSelection2/ |
| 320 | +├── train.py # Unified training entry point |
| 321 | +├── cv.py # Cross-validation entry point |
| 322 | +├── baselines.py # Baseline agents |
| 323 | +├── evaluate.py # Model evaluation |
| 324 | +├── run_local.sh # Local smoke-test runner |
| 325 | +├── smoke_test.sh # Full smoke-test suite |
| 326 | +├── runner.sh # SLURM batch submission |
| 327 | +│ |
| 328 | +├── agents/ |
| 329 | +│ ├── rl_das/ # RL-DAS (Guo et al. 2024 port) |
| 330 | +│ │ ├── env.py # RLDASEnv: Population-based Gymnasium env |
| 331 | +│ │ ├── optimizers.py # NL_SHADE_RSP, JDE21, MadDE (BBOB-adapted) |
| 332 | +│ │ ├── population.py # Shared mutable Population state (NLPSR) |
| 333 | +│ │ ├── agent.py # PPOAgent (actor-critic) |
| 334 | +│ │ ├── network.py # Movement embedder + backbone |
| 335 | +│ │ └── trainer.py # train() / evaluate() loops |
| 336 | +│ └── exponential_das/ # Exp-DAS agent |
| 337 | +│ |
| 338 | +├── das/ |
| 339 | +│ ├── env/ |
| 340 | +│ │ ├── das_env.py # DASEnv: Gymnasium env for PPO / Exp-DAS |
| 341 | +│ │ ├── bbob_splits.py# BBOB problem IDs, train/test/CV splits |
| 342 | +│ │ ├── observation.py# ELA feature extraction (22-dim) |
| 343 | +│ │ └── reward.py # Reward shaping options |
| 344 | +│ ├── optimizers/ |
| 345 | +│ │ ├── portfolio.py # get_portfolio() factory |
| 346 | +│ │ └── {PSO,DE,ES,BO,DS}/ # Sub-optimizer implementations |
| 347 | +│ └── training/ |
| 348 | +│ ├── ppo.py # run_ppo() / run_cv_ppo() |
| 349 | +│ ├── rldas.py # run_rl_das() / run_cv_rl_das() |
| 350 | +│ ├── expdas.py # run_exp_das() / run_cv_exp_das() |
| 351 | +│ └── common.py # Shared utilities (JSONL writer, etc.) |
| 352 | +│ |
| 353 | +├── tests/ # pytest test suite |
| 354 | +└── pyproject.toml |
| 355 | +``` |
| 356 | + |
| 357 | +--- |
| 358 | + |
| 359 | +## References |
| 360 | + |
| 361 | +- Guo, Y. et al. (2024). *Deep Reinforcement Learning for Dynamic Algorithm Selection: A Proof-of-Principle Study on Differential Evolution*. GECCO 2024. https://doi.org/10.1145/3638529.3654223 |
| 362 | +- Hansen, N. et al. (2021). *COCO: A Platform for Comparing Continuous Optimizers in a Black-Box Setting*. Optimization Methods and Software. |
0 commit comments