Add GRL Sokoban Resource Server #564

yixinhuang48 · 2026-01-09T10:04:08Z

Contributing To NeMo-Gym (GRL Sokoban Resource Server)

1) Necessary information

i. Corresponding dataset on the spreadsheet

N/A

ii. Description of the prompt (source + domain)

Domain: Sokoban (grid-based puzzle; tool-use agent).
Source: Synthetic prompts generated programmatically. Prompts instruct the agent to use the step tool to push boxes onto goals efficiently.

iii. Description of the environment

A self-contained Sokoban environment under resources_servers/grl_sokoban/sokoban_env, adapted from GRL.
Configurable map sizes and generation parameters.
Observation: ASCII grid with walls, goals, boxes, and player encoded.
Actions: Up, Down, Left, Right.
FastAPI resource server following NeMo Gym conventions.

iv. Description of the verifier

Verifier is the environment: success=true when all boxes are placed on goals; reward from env logic; zero on failure.
/verify computes final reward and cleans up session state.

v. Legal approval status

Code: Apache 2.0.
Data: Synthetic, programmatically generated (Apache 2.0).

2) Simple correctness check

i. Commands used to run the server

# Start NeMo Gym servers (agent + Sokoban)
config_paths="responses_api_models/openai_model/configs/openai_model.yaml,\
resources_servers/grl_sokoban/configs/grl_sokoban.yaml"
ng_run "+config_paths=[$config_paths]"

# Collect sample rollouts
ng_collect_rollouts +agent_name=grl_sokoban_game_agent \
  +input_jsonl_fpath=resources_servers/grl_sokoban/data/example.jsonl \
  +output_jsonl_fpath=resources_servers/grl_sokoban/data/example_rollouts.jsonl \
  +limit=5

# View rollouts
ng_viewer +jsonl_fpath=resources_servers/grl_sokoban/data/example_rollouts.jsonl

ii. Resulting rollout and judges (5 examples)

See resources_servers/grl_sokoban/data/example_rollouts.jsonl
Expected behavior:
- All boxes on goals → positive reward, success=true
- Otherwise penalties or zero reward, success=false

iii. Additional notes

Must call /seed_session before /step.
Actions accepted as labels or indices.
Session cookies are maintained by middleware; agent path propagates cookies.
Large rollout artifacts are gitignored; do not commit them.

3) Tests

Test files / command

pytest resources_servers/grl_sokoban/tests -q

Coverage notes

Sokoban server tests: seed/step flow, invalid actions, done handling, verify success/failure, cleanup.

4) Reward profiling

Models

Qwen3-4B (discussed with @banghuaz-nvidia could be used.)

Method

200 prompts × 16 rollouts each (3,200 total).
Tool calling enabled; agent loops until done/max_steps.

Commands

cd resources_servers/grl_sokoban
./run_qwen3_4b_eval_loop.sh  # or ./run_qwen3_4b_eval.sh

# Manual analysis
python analyze_rewards.py \
  --rollouts-path resources_servers/grl_sokoban/data/qwen3_4b_eval/rollouts.jsonl \
  --model-name "Qwen3-4B" \
  --output resources_servers/grl_sokoban/data/qwen3_4b_eval/reward-analysis.md

Results

Report file: resources_servers/grl_sokoban/data/qwen3_4b_eval/reward-analysis.md
Include success rate, mean/median reward, tool-call statistics.

Results from running Qwen3-4B on 3,200 rollouts (200 prompts × 16 rollouts):

Overall Metrics

Total Rollouts: 3,200
Success Rate: 13.47% (431 / 3,200)
Mean Reward: 0.9305
Median Reward: 0.0000
Min Reward: -8.9000
Max Reward: 10.9000

Tool Call Statistics

Average Tool Calls: 2.64 per rollout
Min Tool Calls: 1
Max Tool Calls: 11
Correlation (tool calls ↔ reward): -0.2338 (negative correlation)

Reward Distribution

0.0 reward: 2,134 occurrences (66.7%) - immediate failures
10.8 reward: 206 occurrences (6.4%)
10.9 reward: 72 occurrences (2.2%)
10.7 reward: 51 occurrences (1.6%)
Negative rewards: ~800 occurrences (25%) - invalid moves/failures

Performance by Tool Call Count

Tool Calls	Mean Reward	Rollout Count	Notes
1	0.0000	2,112	Immediate failures (66%)
2	7.0948	174	Quick successes
3	8.0076	314	Best average performance
4	4.9391	87	Moderate attempts
5	3.0453	53	Declining performance
10	-3.5120	409	Getting stuck in loops

Key Observations

High Early Failure Rate: 66.7% of rollouts fail immediately with only 1 tool call, suggesting the model often doesn't properly engage with the task
Negative Correlation: More tool calls correlate with worse outcomes (-0.2338), indicating the model gets stuck in invalid move patterns
Sweet Spot: Rollouts with 2-3 tool calls perform best (mean rewards ~7-8), suggesting successful puzzles are solved quickly
Success Pattern: When successful, the model typically completes puzzles in 2-3 moves, but this only happens in ~15% of cases

5) Training

Trained using GRPO on Qwen3-4B-Instruct-2507 using 1600 training examples and 400 validation examples for 1 epoch, using 6x6 Sokoban configuration.

Results

Initially tried training on Qwen3-4B (thinking mode) and results weren't good. For Qwen3-4B-Instruct-2507, we can see based on the training/validation curves that the overall trend is upward showing slight improvement (#261 (comment)), which could be more significant if trained on more data or more epochs.

Appendix

Reuses shared responses_api_agents/game_agent from the Tetris PR.
Added as a new resource server; existing servers unchanged.

…rtifacts; keep repo clean Signed-off-by: yixin <yixinhuang48@gmail.com>

Signed-off-by: yixin <yixinhuang48@gmail.com>

copy-pr-bot · 2026-01-09T10:04:14Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: cmunley1 <cmunley@nvidia.com>

Signed-off-by: yixin <yixinhuang48@gmail.com>

Changed grl_sokoban_game_agent to grl_sokoban_simple_agent Signed-off-by: yixin <yixinhuang48@gmail.com>

yixinhuang48 · 2026-01-09T10:15:33Z

@cmunley1 @bxyu-nvidia this is the updated PR for the one that I closed (#261).

Added Apache 2.0 copyright headers to generation.py and sokoban_env.py Signed-off-by: yixin <yixinhuang48@gmail.com>

…uang48/Gym-1 into feature/grl-sokoban-final

Added proper attribution to https://github.com/lmgame-org/lmenv in docstrings and README, acknowledging the collaborative development with NVIDIA. Signed-off-by: yixin <yixinhuang48@gmail.com>

cmunley1 · 2026-01-09T16:43:22Z

can you reupload the dataset generation script or point to a huggingface dataset in the readme? I removed this by accident on my branch.

otherwise this looks pretty good to me

Add script to generate diverse test examples with varying seeds, room dimensions, and number of boxes for the GRL Sokoban environment. Signed-off-by: yixin <yixinhuang48@gmail.com>

yixinhuang48 · 2026-01-09T21:53:43Z

Just uploaded the dataset generation script. The generation script includes different kinds of configurations for room size and box numbers for reward profiling, but for the training validation, I just used 6x6 room configuration and 1 box.

cmunley1 · 2026-01-09T22:33:57Z

can you regenerate example.jsonl so it include this instruction in your latest dataset generation script? "IMPORTANT: First call the step tool with an empty array [] to see the initial puzzle state. Example: step({"actions": []})"

I run the existing example and get a response like
"text": "I cannot solve the Sokoban puzzle without specific tool observations or the initial state of the puzzle. Please provide the details of the puzzle (e.g., layout, box positions, player position) so I can assist further.",

Also, with updated example, I can run rollouts that look okay, but i do get lots of these messages in ng_run logs, do you see this too, is expected?


INFO:     127.0.0.1:56362 - "POST /step HTTP/1.1" 422 Unprocessable Entity
Hit validation exception! Errors: [
    {
        "type": "model_attributes_type",
        "loc": [
            "body"
        ],
        "msg": "Input should be a valid dictionary or object to extract fields from",
        "input": [
            "Right"
        ]
    }
]
Full body: [
    "Right"
]

cmunley1 · 2026-01-09T22:35:33Z

@bxyu-nvidia can u please look at simple_agent changes? I tested offline rollouts with other resources servers, seems unaffected, but want to make sure with you.

…ep endpoint docs - Regenerated example.jsonl to include instruction about calling step with empty array first - Added docstring to step endpoint explaining 422 validation errors - The 422 errors occur when model sends array instead of object format Signed-off-by: yixin <yixinhuang48@gmail.com>

- Updated /step endpoint to accept both {"actions": [...]} and [...] formats - This fixes 422 Unprocessable Entity errors when model sends array directly - Regenerated 5 example rollouts with the fix applied Signed-off-by: yixin <yixinhuang48@gmail.com>

yixinhuang48 · 2026-01-10T00:13:42Z

@cmunley1 I've updated the example.jsonl and example_rollouts.jsonl files, and fixed the rollout message issue we encountered.

Problem

The /step endpoint was receiving 422 Unprocessable Entity errors when the model sometimes sent actions in the format ["Right"] instead of the expected {"actions": ["Right"]} format. This caused rollouts to fail with validation errors.

Solution

Updated the /step endpoint in app.py to handle both formats:

{"actions": ["Right"]} (expected format)
["Right"] (array-only format that was causing errors)

The endpoint now automatically detects the format and normalizes it to the expected structure, making it more robust to handle cases where the model sends just the array directly. Hopefully this should eliminate the 422 errors you were seeing in the ng_run logs.

cmunley1

looks alright to me, tested offline rollouts, seems ok.

lets wait for @bxyu-nvidia to review simple_agent changes, at least

yixinhuang48 and others added 7 commits October 30, 2025 21:13

GRL Sokoban: sync shared game_agent to tetris-final; remove rollout a…

e92d6f9

…rtifacts; keep repo clean Signed-off-by: yixin <yixinhuang48@gmail.com>

removed some unncessary parts

c157ced

Signed-off-by: yixin <yixinhuang48@gmail.com>

modified qwen 30b model inference pipeline for Sokoban

b66677e

Signed-off-by: yixin <yixinhuang48@gmail.com>

updated the README with reward info

0df76ed

Signed-off-by: yixin <yixinhuang48@gmail.com>

updated the README for sokoban

3ed2b0f

Signed-off-by: yixin <yixinhuang48@gmail.com>

Merge branch 'main' into feature/grl-sokoban-final

076bae1

metrics for training/validation

b31a31e

Signed-off-by: yixin <yixinhuang48@gmail.com>

cmunley1 and others added 3 commits January 9, 2026 10:07

remove some things

ec2a6ea

Signed-off-by: cmunley1 <cmunley@nvidia.com>

keep compatible

840f5b5

Signed-off-by: cmunley1 <cmunley@nvidia.com>

change verified status

5491f8d

Signed-off-by: yixin <yixinhuang48@gmail.com>

yixinhuang48 force-pushed the feature/grl-sokoban-final branch from 297a7f9 to 5491f8d Compare January 9, 2026 10:08

yixinhuang48 and others added 2 commits January 9, 2026 10:10

fix: update agent name in grl_sokoban config

3359d1a

Changed grl_sokoban_game_agent to grl_sokoban_simple_agent Signed-off-by: yixin <yixinhuang48@gmail.com>

Merge branch 'main' into feature/grl-sokoban-final

d9348ce

yixinhuang48 mentioned this pull request Jan 9, 2026

Add GRL Sokoban Resource Server #261

Closed

yixinhuang48 added 3 commits January 9, 2026 10:17

fix: add copyright headers to sokoban_env files

b09408d

Added Apache 2.0 copyright headers to generation.py and sokoban_env.py Signed-off-by: yixin <yixinhuang48@gmail.com>

Merge branch 'feature/grl-sokoban-final' of https://github.com/yixinh…

79b0507

…uang48/Gym-1 into feature/grl-sokoban-final

docs: add attribution to lmgame-org/lmenv

bb4242c

Added proper attribution to https://github.com/lmgame-org/lmenv in docstrings and README, acknowledging the collaborative development with NVIDIA. Signed-off-by: yixin <yixinhuang48@gmail.com>

feat: add generate_test_examples.py script for Sokoban

61cd791

Add script to generate diverse test examples with varying seeds, room dimensions, and number of boxes for the GRL Sokoban environment. Signed-off-by: yixin <yixinhuang48@gmail.com>

cmunley1 requested a review from bxyu-nvidia January 9, 2026 22:34

yixinhuang48 force-pushed the feature/grl-sokoban-final branch from d13ee95 to f1b3306 Compare January 10, 2026 00:05

yixinhuang48 force-pushed the feature/grl-sokoban-final branch from f1b3306 to a19c376 Compare January 10, 2026 00:07

cmunley1 approved these changes Jan 10, 2026

View reviewed changes

cmunley1 added the resource-server Resource servers (math, code, etc.) label Jan 10, 2026

Merge branch 'main' into feature/grl-sokoban-final

5717f38

yixinhuang48 mentioned this pull request Jan 12, 2026

Add GRL Tetris resource server #578

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GRL Sokoban Resource Server #564

Add GRL Sokoban Resource Server #564

Uh oh!

yixinhuang48 commented Jan 9, 2026

Uh oh!

copy-pr-bot bot commented Jan 9, 2026

Uh oh!

yixinhuang48 commented Jan 9, 2026 •

edited

Loading

Uh oh!

cmunley1 commented Jan 9, 2026

Uh oh!

yixinhuang48 commented Jan 9, 2026 •

edited

Loading

Uh oh!

cmunley1 commented Jan 9, 2026

Uh oh!

cmunley1 commented Jan 9, 2026

Uh oh!

yixinhuang48 commented Jan 10, 2026

Uh oh!

cmunley1 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add GRL Sokoban Resource Server #564

Are you sure you want to change the base?

Add GRL Sokoban Resource Server #564

Uh oh!

Conversation

yixinhuang48 commented Jan 9, 2026

Contributing To NeMo-Gym (GRL Sokoban Resource Server)

1) Necessary information

i. Corresponding dataset on the spreadsheet

ii. Description of the prompt (source + domain)

iii. Description of the environment

iv. Description of the verifier

v. Legal approval status

2) Simple correctness check

i. Commands used to run the server

ii. Resulting rollout and judges (5 examples)

iii. Additional notes

3) Tests

Test files / command

Coverage notes

4) Reward profiling

Models

Method

Commands

Results

Overall Metrics

Tool Call Statistics

Reward Distribution

Performance by Tool Call Count

Key Observations

5) Training

Results

Appendix

Uh oh!

copy-pr-bot bot commented Jan 9, 2026

Uh oh!

yixinhuang48 commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmunley1 commented Jan 9, 2026

Uh oh!

yixinhuang48 commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmunley1 commented Jan 9, 2026

Uh oh!

cmunley1 commented Jan 9, 2026

Uh oh!

yixinhuang48 commented Jan 10, 2026

Problem

Solution

Uh oh!

cmunley1 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yixinhuang48 commented Jan 9, 2026 •

edited

Loading

yixinhuang48 commented Jan 9, 2026 •

edited

Loading