A Python-based puzzle game environment for generating datasets and training RL agents.
Fruit Box is a puzzle game played on a 10×17 grid:
- Each cell contains a digit from 1-9
- Goal: Clear as many cells as possible
- Valid Move: Select any rectangular region that sums to exactly 10
- Action: Selected cells are cleared (set to 0)
- Game Over: No more valid moves exist
pip install -e .
# or with uv: uv pip install -e .python policies/generate_dataset.py --policy <policy_name> --episodes 1000 --out_dir out_data/<name>Available policies: greedy_area, minimal_area, high_value_pairs, random_legal, look_ahead:3:20:0.95. Output: jsonl (default) or parquet (--format parquet). Pre-generated data: HuggingFace.
Compare policies and visualize performance:
python policies/policy_analysis.pyGenerates out_data/analysis/policy_comparisons.png:
Complete RL pipeline achieving 94.4% performance (top benchmark result). Two-stage training:
Trains a CNN policy network on expert demonstrations to learn action legality. Uses set-based losses to penalize all illegal actions simultaneously and includes negative examples for robust legality learning.
python rl/train/train_sft.py --dataset-name djdumpling/fruit-boxFine-tunes the SFT policy with RL using a two-phase approach:
- Phase 0 (PPO): Stabilizes the policy with conservative updates
- Phase 1 (GRPO): Learns optimal strategy by comparing groups of actions sampled from the same anchor point, using relative advantages rather than absolute rewards
python rl/train/train_grpo.py --load-checkpoint artifacts/sft-checkpoint-epoch-120:v3/policy_sft_epoch120.ptKey features:
- CNN policy network with sum prediction head
- Custom Gym environment wrapper with two-phase action space (anchor selection + extent selection)
- Curriculum learning and exploration scheduling
- Wandb integration for experiment tracking
Model and policy performance comparison:
Combine policy datasets into HuggingFace format:
python policies/merge_to_hf.pyOutputs out_data/hf_dataset/train/train.parquet and out_data/hf_dataset/episodes.jsonl from all trajectories.parquet files in out_data/.
fruit-box-env/
├── policies/ # Data generation and analysis
│ ├── generate_dataset.py
│ ├── policy_analysis.py
│ └── merge_to_hf.py
├── rl/ # RL training pipeline
│ ├── train/ # SFT, GRPO, Q-learning
│ ├── algo/ # PPO, GRPO algorithms
│ ├── models/ # Policy network
│ ├── envs/ # Gym wrappers
│ └── eval.py
├── out_data/ # Generated datasets and analysis
└── README.md


