rl-toybox is a compact reinforcement-learning playground built around short arcade-style games, one shared composition path, shared runtime/rendering/training infrastructure, and small, inspectable environments. The repo is organized so each game can stand on its own while still reusing common configuration, evaluation, algorithm, and runtime code.
core/value_discrete/contains the shared value-based stack used bysnakeandbang.core/actor_critic/contains the shared PPO/SAC stack plus centralized-critic support used byjump,vroom, andkick.core/search_play/contains the compact MCTS, policy/value, and self-play stack used byosero.core/algorithms/contains the shared algorithm factory and thin common interfaces used by the composition layer.core/shared_config.pycontains the shared runtime/window defaults used across the active games.core/game.pyowns the active game registry, compatibility checks, config composition, and shared run preparation.games/<name>/contains each game's environment, configuration, and game-specific README.
- Repo guide: docs/repo-guide.md
- RL and environment design guide: docs/rl-design-guide.md
With package install:
pip install -e .
rl-toybox-train --game bang
rl-toybox-play-ai --game bang --model best --render
rl-toybox-play-user --game bangWithout installation, from the repo root:
python -m scripts.train --game bang
python -m scripts.play_ai --game bang --model best --render
python -m scripts.play_user --game bangplay_ai loads best by default, so --model best is shown only to make the artifact choice explicit. Curriculum-based games now use a shared L1 to L5 ladder, with training defaulting to L1 and play/eval/capture defaulting to L5. osero remains the temporary exception: its board size is selected through OSERO_BOARD_SIZE, and 6x6 is the default.
| Game ID | Role | Family | Summary | Docs |
|---|---|---|---|---|
snake |
Intro grid-control game | value-based | Classic Snake with obstacle curriculum, compact egocentric observations, and lightweight shaping rewards | games/snake/README.md |
bang |
Flagship discrete-control arena game | value-based | Top-down arena shooter focused on movement, aiming, line of sight, and shot timing under pressure | games/bang/README.md |
jump |
Traversal platformer | actor-critic | Compact side-view micro-platformer built around short procedural runs, timing windows, and simple left/right/jump control | games/jump/README.md |
vroom |
Continuous-control racing game | actor-critic | One-lap top-down racer with procedural tracks, compact vector observations, and SAC-oriented defaults | games/vroom/README.md |
osero |
Planning + self-play capstone | search + self-play | Compact Osero/Reversi implementation using MCTS, self-play, and a small policy/value network | games/osero/README.md |
kick |
Multi-agent football / CTDE showcase | actor-critic / CTDE | Shared-policy top-down 7v7 football environment with centralized-critic PPO training |
games/kick/README.md |
- Arcade / egocentric control:
SELF -> SENS -> TGT/LAND/OPP -> HAZ -> FLAG - Team / CTDE control:
SELF -> TGT -> LAND -> ALLY -> OPP -> MAP -> FLAG - Board self-play / search:
BOARDonly; legal moves stay outside the observation via action masking - Blocks can be omitted when they do not apply. Compact canonical prefixes are
self_,sens_,tgt_,land_,ally_,opp_,map_,haz_,flag_, andboard_.
Current active examples:
snake:self_*,sens_*,tgt_*bang:self_*,sens_*,opp*_*,haz_*jump:self_*,sens_*,land_*,opp*_*,flag_*vroom:self_*,sens_*,flag_*kick:self_*,tgt_*,land_*,ally*_*,opp*_*,map_*,flag_*osero:board_r*_c*
Per-game config.py owns the exact observation/action names, order, dimensions, model defaults, and training stop budget. The standard active-game template is DEFAULT_ALGO, DEFAULT_MODEL_CONFIG, ALGO_CONFIG_OVERRIDES, and DEFAULT_TRAIN_CONFIG. Change DEFAULT_MODEL_CONFIG["hidden_sizes"] to set one game-wide network size across supported models, and use DEFAULT_MODEL_CONFIG["critic_hidden_sizes"] when a game has a separate critic shape. Only use ALGO_CONFIG_OVERRIDES[algo_id] for true algo-specific deltas such as PPO entropy, DQN replay settings, or search-play simulations. Change DEFAULT_TRAIN_CONFIG["budget"] to change when a game's training run stops, including when you launch that game with a non-default compatible algo; the budget unit is total environment steps for value-based and actor-critic families, and self-play games for search_play. Runner-specific extras such as rollout_steps still only apply to runners that use them. The root docs and game READMEs should mirror that config truth.
snake->qlearn,obs=12,act=3, Q-network12 -> 32 -> 3bang->dqn,obs=28,act=8, Q-network28 -> 64 -> 64 -> 8with double-Q, a dueling head, and prioritized replayjump->ppo,obs=32,act=4, actor32 -> 32 -> 32 -> 4, critic32 -> 32 -> 32 -> 1vroom->sac,obs=20,act=3, actor20 -> 64 -> 64 -> 3, twin critics(20 + 3) -> 64 -> 64 -> 1osero->search_play, default6x6,obs=36,act=37, policy/value net36 -> 64 -> 64 -> (37 + 1);4x4uses16 -> 48 -> 48,8x8uses64 -> 96 -> 96kick->ppo,obs=56/player,act=12, shared actor56 -> 96 -> 96 -> 12, centralized critic405 -> 192 -> 192 -> 1
There is no post-config pair-override layer for the active games. Shared algorithm defaults provide the family baseline, and each active game's config.py is the final default source before explicit user overrides.





