Skip to content

Commit bd686a8

Browse files
committed
add README
1 parent fd680a5 commit bd686a8

1 file changed

Lines changed: 362 additions & 0 deletions

File tree

README.md

Lines changed: 362 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,362 @@
1+
# DynamicAlgorithmSelection2
2+
3+
RL-based Dynamic Algorithm Selection (DAS) on the [BBOB benchmark](https://numbbo.github.io/coco/). A controller learns to switch between a portfolio of black-box optimizers at runtime — allocating function-evaluation budget to whichever optimizer is most promising at each checkpoint.
4+
5+
---
6+
7+
## Agents
8+
9+
Three agent families share the same BBOB problem set and evaluation protocol:
10+
11+
| Agent | Description | Key reference |
12+
|---|---|---|
13+
| **PPO** | Stable Baselines 3 PPO with VecNormalize; multi-dimensional; ELA-based observations ||
14+
| **RL-DAS** | Custom single-dimension PyTorch PPO; DE-only portfolio; population-state features with local sampling | Guo et al., 2024 |
15+
| **Exp-DAS** | Custom PyTorch PPO with exponential checkpoint spacing; flexible portfolio ||
16+
17+
### PPO
18+
Uses `DASEnv` — a Gymnasium environment that wraps a warm-started portfolio of arbitrary optimizers. Observations are 22-dimensional ELA landscape features plus per-optimizer movement history. Trains across multiple dimensions simultaneously.
19+
20+
### RL-DAS
21+
Faithful port of [Guo et al. 2024](https://doi.org/10.1145/3638529.3654223) with BBOB adaptations:
22+
- Fixed DE portfolio: **NL_SHADE_RSP**, **MadDE**, **JDE21** (all share a single `Population` object as mutable warm-started state).
23+
- 9-dimensional population-state features computed via local sampling (2 independent forward passes on a population deepcopy).
24+
- Movement embedder networks compress D-dim displacement vectors to scalars; the backbone is dimension-specific (one model per `--dim`).
25+
- Hand-rolled PPO training loop (no SB3 dependency for this agent).
26+
27+
### Exp-DAS
28+
Evolution of the original DAS `policy-gradient` agent. Uses `DASEnv` (same as PPO) but replaces uniform checkpoint spacing with an **exponential schedule** controlled by the Checkpoint Division Base (`--cdb`):
29+
30+
- **`cdb = 1.0` (uniform):** every checkpoint covers the same number of function evaluations — consistent monitoring throughout the run.
31+
- **`cdb > 1.0` (exponential):** early checkpoints are short (frequent switching during initial exploration) and later checkpoints are long (uninterrupted convergence during exploitation).
32+
33+
The agent uses separate actor and critic learning rates and a configurable number of PPO gradient epochs per update. Like PPO, it supports multiple dimensions simultaneously and an arbitrary optimizer portfolio.
34+
35+
---
36+
37+
## Installation
38+
39+
Requires Python 3.11. Dependency management via [uv](https://docs.astral.sh/uv/).
40+
41+
```bash
42+
uv sync
43+
```
44+
45+
---
46+
47+
## Quick start
48+
49+
`run_local.sh` runs a single agent with tiny settings (fast smoke test):
50+
51+
```bash
52+
bash run_local.sh [seed] [agent] [portfolio...]
53+
54+
# Examples
55+
bash run_local.sh 42 ppo CPSO NM TDE
56+
bash run_local.sh 42 ppo-cv CPSO NM TDE
57+
bash run_local.sh 42 rl-das # DE portfolio fixed; no -p needed
58+
bash run_local.sh 42 rl-das-cv
59+
bash run_local.sh 42 exp-das CPSO NM TDE
60+
bash run_local.sh 42 exp-das-cv CPSO NM TDE
61+
bash run_local.sh 42 baselines CPSO NM TDE
62+
```
63+
64+
Run the full smoke-test suite (all agent types):
65+
66+
```bash
67+
bash smoke_test.sh
68+
# or selectively
69+
bash smoke_test.sh rl-das rl-das-cv
70+
```
71+
72+
---
73+
74+
## Training
75+
76+
```bash
77+
python train.py {ppo,rl-das,exp-das} <name> [options]
78+
```
79+
80+
### PPO
81+
82+
```bash
83+
python train.py ppo MY_PPO \
84+
-p CPSO NM TDE \
85+
-d 2 5 10 \
86+
-E 20 \
87+
--fe-multiplier 10000 \
88+
--n-checkpoints 10 \
89+
--seed 42
90+
```
91+
92+
Key options:
93+
94+
| Flag | Default | Description |
95+
|---|---|---|
96+
| `-p / --portfolio` | `SPSO IPSO SPSOL` | Sub-optimizer names |
97+
| `-d / --dims` | all | Problem dimensions |
98+
| `-E / --n-epochs` | 20 | Passes over the training set |
99+
| `--fe-multiplier` | 10 000 | Budget = multiplier × dimension |
100+
| `--n-checkpoints` | 10 | Optimizer-selection steps per episode |
101+
| `-x / --cdb` | 1.0 | Checkpoint division base (1 = uniform) |
102+
| `-O / --reward-option` | 1 | Reward shaping (1–4) |
103+
| `--wandb` | off | Log to Weights & Biases |
104+
105+
Outputs: `models/<name>.zip`, `models/<name>_vecnorm.pkl`
106+
107+
### RL-DAS
108+
109+
```bash
110+
python train.py rl-das MY_RLDAS \
111+
--dim 10 \
112+
--n-epochs 20 \
113+
--fe-multiplier 10000 \
114+
--seed 42
115+
```
116+
117+
The portfolio is fixed to `NL_SHADE_RSP MADDE JDE21` and `--n-individuals` defaults to 170 (matching the original paper). Use `--portfolio` to override.
118+
119+
Key options:
120+
121+
| Flag | Default | Description |
122+
|---|---|---|
123+
| `--dim` | 10 | Problem dimension (one model per dim) |
124+
| `--n-epochs` | 20 | Training epochs |
125+
| `--lr` | 1e-5 | Learning rate |
126+
| `--k-epoch` | `0.3 × n_checkpoints` | PPO gradient steps per episode |
127+
| `--device` | cpu | PyTorch device |
128+
129+
Outputs: `models/<name>_final.pt`, `models/<name>_epoch<N>.pt`, `models/<name>_train_log.jsonl`
130+
131+
### Exp-DAS
132+
133+
```bash
134+
python train.py exp-das MY_EXPDAS \
135+
-p CPSO NM TDE \
136+
--dims 2 5 10 \
137+
-E 3 \
138+
--cdb 2.0 \
139+
--reward-option 1 \
140+
--seed 42
141+
```
142+
143+
Key options:
144+
145+
| Flag | Default | Description |
146+
|---|---|---|
147+
| `--dims` | `2 5 10` | Problem dimensions |
148+
| `--cdb` | 2.0 | Checkpoint Division Base (see below) |
149+
| `-E / --n-epochs` | 3 | Passes over the training set |
150+
| `--actor-lr` | 3e-5 | Actor learning rate |
151+
| `--critic-lr` | 1e-5 | Critic learning rate |
152+
| `--ppo-epochs` | 6 | PPO gradient epochs per update |
153+
| `--buffer-capacity` | `16 × n_checkpoints` | PPO rollout buffer size in steps |
154+
| `-O / --reward-option` | 1 | Reward shaping strategy (1–4, see below) |
155+
| `--save-interval` | 500 | Save a checkpoint every N episodes |
156+
| `--device` | cpu | PyTorch device |
157+
158+
Outputs: `models/<name>_best.pt`, `models/<name>_final.pt`, `models/<name>_ep<N>.pt`, `models/<name>_train_log.jsonl`
159+
160+
---
161+
162+
## Checkpoint Division Base (CDB)
163+
164+
The `--cdb` argument controls how the total FE budget is distributed across the `n_checkpoints` decision points in each episode.
165+
166+
With `cdb = 1.0` every checkpoint covers the same number of FEs (uniform). With `cdb > 1.0` checkpoint durations grow exponentially: the first checkpoints are short (fast switching during early exploration) and the last are long (uninterrupted convergence during exploitation).
167+
168+
```
169+
cdb = 1.0 → [───][───][───][───][───] uniform
170+
cdb = 2.0 → [─][──][────][────────] exponential
171+
```
172+
173+
**When to use each value:**
174+
175+
| Value | Effect | Use case |
176+
|---|---|---|
177+
| `1.0` | Equal-length checkpoints | Consistent monitoring; PPO default |
178+
| `2.0` | Moderate exponential growth | Exp-DAS default; balances exploration and exploitation |
179+
| `> 2.0` | Aggressive early switching | Portfolios where early optimizer choice is decisive |
180+
181+
The `--cdb` flag is available for all three agents (`ppo`, `rl-das` ignores it, `exp-das`).
182+
183+
---
184+
185+
## Reward options
186+
187+
The `-O / --reward-option` flag selects the reward signal used at each checkpoint. All options measure improvement in the best objective value found so far and scale it by the initial value range.
188+
189+
| Option | Name | Description |
190+
|---|---|---|
191+
| `1` | Log-scaled improvement | `improvement` between consecutive checkpoints, clipped to `[0, 1]`, then `log(r + 1e-5)`. Smooths large variance. **Default.** |
192+
| `2` | Linear clipped improvement | Same as option 1 but without the log transform: `clip(improvement, 0, 1)`. |
193+
| `3` | Sparse total improvement | Returns `0` at every intermediate checkpoint; at the final checkpoint returns the log-scaled total improvement from episode start. Focuses the agent on end-of-run quality. |
194+
| `4` | Binary threshold | Returns `1` if scaled improvement ≥ `1e-3`, else `0`. Simple binary feedback. |
195+
196+
---
197+
198+
## Cross-validation
199+
200+
```bash
201+
python cv.py {ppo,rl-das,exp-das} <name> [options]
202+
```
203+
204+
Two CV modes:
205+
206+
- **LOIO** (Leave-One-Instance-Out): hold out a subset of BBOB instances per fold.
207+
- **LOPO** (Leave-One-Problem-Out): hold out a subset of BBOB functions per fold.
208+
209+
```bash
210+
# PPO – 3-fold LOIO
211+
python cv.py ppo MY_PPO_CV \
212+
-p CPSO NM TDE -d 5 10 \
213+
--cv-mode LOIO --n-folds 3 --n-epochs 10 --seed 42
214+
215+
# RL-DAS – 3-fold LOPO, dim 10 only
216+
python cv.py rl-das MY_RLDAS_CV \
217+
--dim 10 --cv-mode LOPO --n-folds 3 --n-epochs 20 --seed 42
218+
219+
# Run only folds 0 and 2
220+
python cv.py exp-das MY_EXPDAS_CV \
221+
-p CPSO NM TDE --dims 5 10 \
222+
--cv-mode LOIO --folds 0 2 --n-epochs 3 --seed 42
223+
```
224+
225+
Outputs per fold: `results/<name>_cv_<fold_tag>.jsonl`
226+
Aggregated: `results/<name>_cv_summary.jsonl`
227+
228+
---
229+
230+
## Baselines
231+
232+
```bash
233+
python baselines.py <name> --agent <agent_type> [options]
234+
```
235+
236+
Agent types:
237+
238+
| Type | Description |
239+
|---|---|
240+
| `random` | Uniform random selection at each checkpoint |
241+
| `fixed:<name>` | Always pick one optimizer, e.g. `fixed:CPSO` |
242+
| `single:<name>` | One optimizer runs the full budget (no checkpointing) |
243+
| `all` | All of the above; derives oracle-best / oracle-worst |
244+
245+
```bash
246+
python baselines.py MY_BASELINES --agent all \
247+
-p CPSO NM TDE -d 2 5 10 --seed 42
248+
```
249+
250+
---
251+
252+
## Evaluation
253+
254+
Load a trained PPO model and evaluate it on the BBOB test set:
255+
256+
```bash
257+
python evaluate.py MY_PPO \
258+
-p CPSO NM TDE -d 5 10 --seed 42
259+
```
260+
261+
Add `--coco-observer` to write COCO-compatible data for `cocopp` post-processing.
262+
263+
---
264+
265+
## Problem set
266+
267+
The BBOB benchmark provides 24 functions × 15 instances × 6 dimensions = 2 160 problems per dimension.
268+
269+
**Dimensions:** `2, 3, 5, 10, 20, 40`
270+
271+
**Default train/test split** (`--mode easy`): trains on 14 structurally simpler functions and tests on the remaining 10 harder functions.
272+
273+
| Mode | Train | Test |
274+
|---|---|---|
275+
| `easy` | functions {4,6–14,18–20,22–24} | remaining 10 functions |
276+
| `hard` | inverse of easy ||
277+
| `random` | 2/3 of all problems | 1/3 |
278+
279+
---
280+
281+
## Optimizer portfolio
282+
283+
Available sub-optimizers (pass names via `-p / --portfolio`):
284+
285+
| Family | Names |
286+
|---|---|
287+
| PSO | `SPSO`, `IPSO`, `SPSOL`, `CPSO` |
288+
| DE | `NL_SHADE_RSP`, `MADDE`, `JDE21`, `TDE` |
289+
| ES | `NM` (Nelder-Mead) |
290+
| BO | `BO` |
291+
| DS | `DS` (Direct Search) |
292+
293+
RL-DAS always uses the DE trio `NL_SHADE_RSP / MADDE / JDE21` — overridable with `--portfolio`.
294+
295+
---
296+
297+
## HPC / SLURM
298+
299+
Submit all agents for a given seed and portfolio:
300+
301+
```bash
302+
bash runner.sh
303+
```
304+
305+
Individual SLURM scripts:
306+
307+
| Script | Agent |
308+
|---|---|
309+
| `ppo_study.slurm` | PPO |
310+
| `rl_das_study.slurm` | RL-DAS |
311+
| `exp_das_study.slurm` | Exp-DAS |
312+
| `baselines.slurm` | Baselines |
313+
314+
---
315+
316+
## Project structure
317+
318+
```
319+
DynamicAlgorithmSelection2/
320+
├── train.py # Unified training entry point
321+
├── cv.py # Cross-validation entry point
322+
├── baselines.py # Baseline agents
323+
├── evaluate.py # Model evaluation
324+
├── run_local.sh # Local smoke-test runner
325+
├── smoke_test.sh # Full smoke-test suite
326+
├── runner.sh # SLURM batch submission
327+
328+
├── agents/
329+
│ ├── rl_das/ # RL-DAS (Guo et al. 2024 port)
330+
│ │ ├── env.py # RLDASEnv: Population-based Gymnasium env
331+
│ │ ├── optimizers.py # NL_SHADE_RSP, JDE21, MadDE (BBOB-adapted)
332+
│ │ ├── population.py # Shared mutable Population state (NLPSR)
333+
│ │ ├── agent.py # PPOAgent (actor-critic)
334+
│ │ ├── network.py # Movement embedder + backbone
335+
│ │ └── trainer.py # train() / evaluate() loops
336+
│ └── exponential_das/ # Exp-DAS agent
337+
338+
├── das/
339+
│ ├── env/
340+
│ │ ├── das_env.py # DASEnv: Gymnasium env for PPO / Exp-DAS
341+
│ │ ├── bbob_splits.py# BBOB problem IDs, train/test/CV splits
342+
│ │ ├── observation.py# ELA feature extraction (22-dim)
343+
│ │ └── reward.py # Reward shaping options
344+
│ ├── optimizers/
345+
│ │ ├── portfolio.py # get_portfolio() factory
346+
│ │ └── {PSO,DE,ES,BO,DS}/ # Sub-optimizer implementations
347+
│ └── training/
348+
│ ├── ppo.py # run_ppo() / run_cv_ppo()
349+
│ ├── rldas.py # run_rl_das() / run_cv_rl_das()
350+
│ ├── expdas.py # run_exp_das() / run_cv_exp_das()
351+
│ └── common.py # Shared utilities (JSONL writer, etc.)
352+
353+
├── tests/ # pytest test suite
354+
└── pyproject.toml
355+
```
356+
357+
---
358+
359+
## References
360+
361+
- Guo, Y. et al. (2024). *Deep Reinforcement Learning for Dynamic Algorithm Selection: A Proof-of-Principle Study on Differential Evolution*. GECCO 2024. https://doi.org/10.1145/3638529.3654223
362+
- Hansen, N. et al. (2021). *COCO: A Platform for Comparing Continuous Optimizers in a Black-Box Setting*. Optimization Methods and Software.

0 commit comments

Comments
 (0)