Fruit Box Game Environment + Data Generation

A Python-based puzzle game environment for generating datasets and training RL agents.

The Game

Fruit Box is a puzzle game played on a 10×17 grid:

Each cell contains a digit from 1-9
Goal: Clear as many cells as possible
Valid Move: Select any rectangular region that sums to exactly 10
Action: Selected cells are cleared (set to 0)
Game Over: No more valid moves exist

Quick Start

Installation

pip install -e .
# or with uv: uv pip install -e .

Generate Data

python policies/generate_dataset.py --policy <policy_name> --episodes 1000 --out_dir out_data/<name>

Available policies: greedy_area, minimal_area, high_value_pairs, random_legal, look_ahead:3:20:0.95. Output: jsonl (default) or parquet (--format parquet). Pre-generated data: HuggingFace.

📈 Policy Analysis

Compare policies and visualize performance:

python policies/policy_analysis.py

Generates out_data/analysis/policy_comparisons.png:

🤖 Reinforcement Learning

Complete RL pipeline achieving 94.4% performance (top benchmark result). Two-stage training:

Stage 1: Supervised Fine-Tuning (SFT)

Trains a CNN policy network on expert demonstrations to learn action legality. Uses set-based losses to penalize all illegal actions simultaneously and includes negative examples for robust legality learning.

python rl/train/train_sft.py --dataset-name djdumpling/fruit-box

Stage 2: Group Relative Policy Optimization (GRPO)

Fine-tunes the SFT policy with RL using a two-phase approach:

Phase 0 (PPO): Stabilizes the policy with conservative updates
Phase 1 (GRPO): Learns optimal strategy by comparing groups of actions sampled from the same anchor point, using relative advantages rather than absolute rewards

python rl/train/train_grpo.py --load-checkpoint artifacts/sft-checkpoint-epoch-120:v3/policy_sft_epoch120.pt

Key features:

CNN policy network with sum prediction head
Custom Gym environment wrapper with two-phase action space (anchor selection + extent selection)
Curriculum learning and exploration scheduling
Wandb integration for experiment tracking

Benchmark Results

Model and policy performance comparison:

Dataset Merging

Combine policy datasets into HuggingFace format:

python policies/merge_to_hf.py

Outputs out_data/hf_dataset/train/train.parquet and out_data/hf_dataset/episodes.jsonl from all trajectories.parquet files in out_data/.

📁 Project Structure

fruit-box-env/
├── policies/               # Data generation and analysis
│   ├── generate_dataset.py
│   ├── policy_analysis.py
│   └── merge_to_hf.py
├── rl/                     # RL training pipeline
│   ├── train/              # SFT, GRPO, Q-learning
│   ├── algo/               # PPO, GRPO algorithms
│   ├── models/             # Policy network
│   ├── envs/               # Gym wrappers
│   └── eval.py
├── out_data/               # Generated datasets and analysis
└── README.md

Name		Name	Last commit message	Last commit date
Latest commit History 132 Commits
artifacts/sft-checkpoint-epoch-120:v3		artifacts/sft-checkpoint-epoch-120:v3
images		images
out_data		out_data
policies		policies
rl		rl
scripts		scripts
.gitignore		.gitignore
README.md		README.md
benchmark.py		benchmark.py
fruit_box.py		fruit_box.py
fruit_box_gym.py		fruit_box_gym.py
fruit_box_with_legal.py		fruit_box_with_legal.py
main.py		main.py
notes.md		notes.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_single_trajectory.py		run_single_trajectory.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fruit Box Game Environment + Data Generation

The Game

Quick Start

Installation

Generate Data

📈 Policy Analysis

🤖 Reinforcement Learning

Stage 1: Supervised Fine-Tuning (SFT)

Stage 2: Group Relative Policy Optimization (GRPO)

Benchmark Results

Dataset Merging

📁 Project Structure

About

Uh oh!

Releases

Packages

Languages

djdumpling/fruit_box

Folders and files

Latest commit

History

Repository files navigation

Fruit Box Game Environment + Data Generation

The Game

Quick Start

Installation

Generate Data

📈 Policy Analysis

🤖 Reinforcement Learning

Stage 1: Supervised Fine-Tuning (SFT)

Stage 2: Group Relative Policy Optimization (GRPO)

Benchmark Results

Dataset Merging

📁 Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages