Add GRL Sokoban Resource Server #261

yixinhuang48 · 2025-10-30T21:23:34Z

Contributing To NeMo-Gym (GRL Sokoban Resource Server)

1) Necessary information

i. Corresponding dataset on the spreadsheet

N/A

ii. Description of the prompt (source + domain)

Domain: Sokoban (grid-based puzzle; tool-use agent).
Source: Synthetic prompts generated programmatically. Prompts instruct the agent to use the step tool to push boxes onto goals efficiently.

iii. Description of the environment

A self-contained Sokoban environment under resources_servers/grl_sokoban/sokoban_env, adapted from GRL.
Configurable map sizes and generation parameters.
Observation: ASCII grid with walls, goals, boxes, and player encoded.
Actions: Up, Down, Left, Right.
FastAPI resource server following NeMo Gym conventions.

iv. Description of the verifier

Verifier is the environment: success=true when all boxes are placed on goals; reward from env logic; zero on failure.
/verify computes final reward and cleans up session state.

v. Legal approval status

Code: Apache 2.0.
Data: Synthetic, programmatically generated (Apache 2.0).

2) Simple correctness check

i. Commands used to run the server

# Start NeMo Gym servers (agent + Sokoban)
config_paths="responses_api_models/openai_model/configs/openai_model.yaml,\
resources_servers/grl_sokoban/configs/grl_sokoban.yaml"
ng_run "+config_paths=[$config_paths]"

# Collect sample rollouts
ng_collect_rollouts +agent_name=grl_sokoban_game_agent \
  +input_jsonl_fpath=resources_servers/grl_sokoban/data/example.jsonl \
  +output_jsonl_fpath=resources_servers/grl_sokoban/data/example_rollouts.jsonl \
  +limit=5

# View rollouts
ng_viewer +jsonl_fpath=resources_servers/grl_sokoban/data/example_rollouts.jsonl

ii. Resulting rollout and judges (5 examples)

See resources_servers/grl_sokoban/data/example_rollouts.jsonl
Expected behavior:
- All boxes on goals → positive reward, success=true
- Otherwise penalties or zero reward, success=false

iii. Additional notes

Must call /seed_session before /step.
Actions accepted as labels or indices.
Session cookies are maintained by middleware; agent path propagates cookies.
Large rollout artifacts are gitignored; do not commit them.

3) Tests

Test files / command

pytest resources_servers/grl_sokoban/tests -q

Coverage notes

Sokoban server tests: seed/step flow, invalid actions, done handling, verify success/failure, cleanup.

4) Reward profiling

Models

Qwen3-4B (discussed with @banghuaz-nvidia could be used.)

Method

200 prompts × 16 rollouts each (3,200 total).
Tool calling enabled; agent loops until done/max_steps.

Commands

cd resources_servers/grl_sokoban
./run_qwen3_4b_eval_loop.sh  # or ./run_qwen3_4b_eval.sh

# Manual analysis
python analyze_rewards.py \
  --rollouts-path resources_servers/grl_sokoban/data/qwen3_4b_eval/rollouts.jsonl \
  --model-name "Qwen3-4B" \
  --output resources_servers/grl_sokoban/data/qwen3_4b_eval/reward-analysis.md

Results

Report file: resources_servers/grl_sokoban/data/qwen3_4b_eval/reward-analysis.md
Include success rate, mean/median reward, tool-call statistics.

Results from running Qwen3-4B on 3,200 rollouts (200 prompts × 16 rollouts):

Overall Metrics

Total Rollouts: 3,200
Success Rate: 13.47% (431 / 3,200)
Mean Reward: 0.9305
Median Reward: 0.0000
Min Reward: -8.9000
Max Reward: 10.9000

Tool Call Statistics

Average Tool Calls: 2.64 per rollout
Min Tool Calls: 1
Max Tool Calls: 11
Correlation (tool calls ↔ reward): -0.2338 (negative correlation)

Reward Distribution

0.0 reward: 2,134 occurrences (66.7%) - immediate failures
10.8 reward: 206 occurrences (6.4%)
10.9 reward: 72 occurrences (2.2%)
10.7 reward: 51 occurrences (1.6%)
Negative rewards: ~800 occurrences (25%) - invalid moves/failures

Performance by Tool Call Count

Tool Calls	Mean Reward	Rollout Count	Notes
1	0.0000	2,112	Immediate failures (66%)
2	7.0948	174	Quick successes
3	8.0076	314	Best average performance
4	4.9391	87	Moderate attempts
5	3.0453	53	Declining performance
10	-3.5120	409	Getting stuck in loops

Key Observations

High Early Failure Rate: 66.7% of rollouts fail immediately with only 1 tool call, suggesting the model often doesn't properly engage with the task
Negative Correlation: More tool calls correlate with worse outcomes (-0.2338), indicating the model gets stuck in invalid move patterns
Sweet Spot: Rollouts with 2-3 tool calls perform best (mean rewards ~7-8), suggesting successful puzzles are solved quickly
Success Pattern: When successful, the model typically completes puzzles in 2-3 moves, but this only happens in ~15% of cases

5) Training

Trained using GRPO on Qwen3-4B-Instruct-2507 using 1600 training examples and 400 validation examples for 1 epoch, using 6x6 Sokoban configuration.

Results

Initially tried training on Qwen3-4B (thinking mode) and results weren't good. For Qwen3-4B-Instruct-2507, we can see based on the training/validation curves that the overall trend is upward showing slight improvement (#261 (comment)), which could be more significant if trained on more data or more epochs.

Appendix

Reuses shared responses_api_agents/game_agent from the Tetris PR.
Added as a new resource server; existing servers unchanged.

…rtifacts; keep repo clean Signed-off-by: yixin <yixinhuang48@gmail.com>

Signed-off-by: yixin <yixinhuang48@gmail.com>

copy-pr-bot · 2025-10-30T21:23:38Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.