Skip to content

Conversation

@yixinhuang48
Copy link
Collaborator

Contributing To NeMo-Gym (GRL Sokoban Resource Server)

1) Necessary information

i. Corresponding dataset on the spreadsheet

  • N/A

ii. Description of the prompt (source + domain)

  • Domain: Sokoban (grid-based puzzle; tool-use agent).
  • Source: Synthetic prompts generated programmatically. Prompts instruct the agent to use the step tool to push boxes onto goals efficiently.

iii. Description of the environment

  • A self-contained Sokoban environment under resources_servers/grl_sokoban/sokoban_env, adapted from GRL.
  • Configurable map sizes and generation parameters.
  • Observation: ASCII grid with walls, goals, boxes, and player encoded.
  • Actions: Up, Down, Left, Right.
  • FastAPI resource server following NeMo Gym conventions.

iv. Description of the verifier

  • Verifier is the environment: success=true when all boxes are placed on goals; reward from env logic; zero on failure.
  • /verify computes final reward and cleans up session state.

v. Legal approval status

  • Code: Apache 2.0.
  • Data: Synthetic, programmatically generated (Apache 2.0).

2) Simple correctness check

i. Commands used to run the server

# Start NeMo Gym servers (agent + Sokoban)
config_paths="responses_api_models/openai_model/configs/openai_model.yaml,\
resources_servers/grl_sokoban/configs/grl_sokoban.yaml"
ng_run "+config_paths=[$config_paths]"

# Collect sample rollouts
ng_collect_rollouts +agent_name=grl_sokoban_game_agent \
  +input_jsonl_fpath=resources_servers/grl_sokoban/data/example.jsonl \
  +output_jsonl_fpath=resources_servers/grl_sokoban/data/example_rollouts.jsonl \
  +limit=5

# View rollouts
ng_viewer +jsonl_fpath=resources_servers/grl_sokoban/data/example_rollouts.jsonl

ii. Resulting rollout and judges (5 examples)

  • See resources_servers/grl_sokoban/data/example_rollouts.jsonl
  • Expected behavior:
    • All boxes on goals → positive reward, success=true
    • Otherwise penalties or zero reward, success=false

iii. Additional notes

  • Must call /seed_session before /step.
  • Actions accepted as labels or indices.
  • Session cookies are maintained by middleware; agent path propagates cookies.
  • Large rollout artifacts are gitignored; do not commit them.

3) Tests

Test files / command

pytest resources_servers/grl_sokoban/tests -q

Coverage notes

  • Sokoban server tests: seed/step flow, invalid actions, done handling, verify success/failure, cleanup.

4) Reward profiling

Models

Method

  • 200 prompts × 16 rollouts each (3,200 total).
  • Tool calling enabled; agent loops until done/max_steps.

Commands

cd resources_servers/grl_sokoban
./run_qwen3_4b_eval_loop.sh  # or ./run_qwen3_4b_eval.sh

# Manual analysis
python analyze_rewards.py \
  --rollouts-path resources_servers/grl_sokoban/data/qwen3_4b_eval/rollouts.jsonl \
  --model-name "Qwen3-4B" \
  --output resources_servers/grl_sokoban/data/qwen3_4b_eval/reward-analysis.md

Results

  • Report file: resources_servers/grl_sokoban/data/qwen3_4b_eval/reward-analysis.md
  • Include success rate, mean/median reward, tool-call statistics.

Results from running Qwen3-4B on 3,200 rollouts (200 prompts × 16 rollouts):

Overall Metrics

  • Total Rollouts: 3,200
  • Success Rate: 13.47% (431 / 3,200)
  • Mean Reward: 0.9305
  • Median Reward: 0.0000
  • Min Reward: -8.9000
  • Max Reward: 10.9000

Tool Call Statistics

  • Average Tool Calls: 2.64 per rollout
  • Min Tool Calls: 1
  • Max Tool Calls: 11
  • Correlation (tool calls ↔ reward): -0.2338 (negative correlation)

Reward Distribution

  • 0.0 reward: 2,134 occurrences (66.7%) - immediate failures
  • 10.8 reward: 206 occurrences (6.4%)
  • 10.9 reward: 72 occurrences (2.2%)
  • 10.7 reward: 51 occurrences (1.6%)
  • Negative rewards: ~800 occurrences (25%) - invalid moves/failures

Performance by Tool Call Count

Tool Calls Mean Reward Rollout Count Notes
1 0.0000 2,112 Immediate failures (66%)
2 7.0948 174 Quick successes
3 8.0076 314 Best average performance
4 4.9391 87 Moderate attempts
5 3.0453 53 Declining performance
10 -3.5120 409 Getting stuck in loops

Key Observations

  1. High Early Failure Rate: 66.7% of rollouts fail immediately with only 1 tool call, suggesting the model often doesn't properly engage with the task
  2. Negative Correlation: More tool calls correlate with worse outcomes (-0.2338), indicating the model gets stuck in invalid move patterns
  3. Sweet Spot: Rollouts with 2-3 tool calls perform best (mean rewards ~7-8), suggesting successful puzzles are solved quickly
  4. Success Pattern: When successful, the model typically completes puzzles in 2-3 moves, but this only happens in ~15% of cases

5) Training

Trained using GRPO on Qwen3-4B-Instruct-2507 using 1600 training examples and 400 validation examples for 1 epoch, using 6x6 Sokoban configuration.

Results

Initially tried training on Qwen3-4B (thinking mode) and results weren't good. For Qwen3-4B-Instruct-2507, we can see based on the training/validation curves that the overall trend is upward showing slight improvement (#261 (comment)), which could be more significant if trained on more data or more epochs.


Appendix

  • Reuses shared responses_api_agents/game_agent from the Tetris PR.
  • Added as a new resource server; existing servers unchanged.

yixinhuang48 and others added 7 commits October 30, 2025 21:13
…rtifacts; keep repo clean

Signed-off-by: yixin <yixinhuang48@gmail.com>
Signed-off-by: yixin <yixinhuang48@gmail.com>
Signed-off-by: yixin <yixinhuang48@gmail.com>
Signed-off-by: yixin <yixinhuang48@gmail.com>
Signed-off-by: yixin <yixinhuang48@gmail.com>
Signed-off-by: yixin <yixinhuang48@gmail.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 9, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cmunley1 and others added 3 commits January 9, 2026 10:07
Signed-off-by: cmunley1 <cmunley@nvidia.com>
Signed-off-by: cmunley1 <cmunley@nvidia.com>
Signed-off-by: yixin <yixinhuang48@gmail.com>
@yixinhuang48 yixinhuang48 force-pushed the feature/grl-sokoban-final branch from 297a7f9 to 5491f8d Compare January 9, 2026 10:08
yixinhuang48 and others added 2 commits January 9, 2026 10:10
Changed grl_sokoban_game_agent to grl_sokoban_simple_agent

Signed-off-by: yixin <yixinhuang48@gmail.com>
@yixinhuang48
Copy link
Collaborator Author

yixinhuang48 commented Jan 9, 2026

@cmunley1 @bxyu-nvidia this is the updated PR for the one that I closed (#261).

Added Apache 2.0 copyright headers to generation.py and sokoban_env.py

Signed-off-by: yixin <yixinhuang48@gmail.com>
Added proper attribution to https://github.com/lmgame-org/lmenv in
docstrings and README, acknowledging the collaborative development
with NVIDIA.

Signed-off-by: yixin <yixinhuang48@gmail.com>
@cmunley1
Copy link
Contributor

cmunley1 commented Jan 9, 2026

can you reupload the dataset generation script or point to a huggingface dataset in the readme? I removed this by accident on my branch.

otherwise this looks pretty good to me

Add script to generate diverse test examples with varying seeds,
room dimensions, and number of boxes for the GRL Sokoban environment.

Signed-off-by: yixin <yixinhuang48@gmail.com>
@yixinhuang48
Copy link
Collaborator Author

yixinhuang48 commented Jan 9, 2026

Just uploaded the dataset generation script. The generation script includes different kinds of configurations for room size and box numbers for reward profiling, but for the training validation, I just used 6x6 room configuration and 1 box.

@cmunley1
Copy link
Contributor

cmunley1 commented Jan 9, 2026

can you regenerate example.jsonl so it include this instruction in your latest dataset generation script? "IMPORTANT: First call the step tool with an empty array [] to see the initial puzzle state. Example: step({"actions": []})"

I run the existing example and get a response like
"text": "I cannot solve the Sokoban puzzle without specific tool observations or the initial state of the puzzle. Please provide the details of the puzzle (e.g., layout, box positions, player position) so I can assist further.",

Also, with updated example, I can run rollouts that look okay, but i do get lots of these messages in ng_run logs, do you see this too, is expected?


INFO:     127.0.0.1:56362 - "POST /step HTTP/1.1" 422 Unprocessable Entity
Hit validation exception! Errors: [
    {
        "type": "model_attributes_type",
        "loc": [
            "body"
        ],
        "msg": "Input should be a valid dictionary or object to extract fields from",
        "input": [
            "Right"
        ]
    }
]
Full body: [
    "Right"
]

@cmunley1 cmunley1 requested a review from bxyu-nvidia January 9, 2026 22:34
@cmunley1
Copy link
Contributor

cmunley1 commented Jan 9, 2026

@bxyu-nvidia can u please look at simple_agent changes? I tested offline rollouts with other resources servers, seems unaffected, but want to make sure with you.

…ep endpoint docs

- Regenerated example.jsonl to include instruction about calling step with empty array first
- Added docstring to step endpoint explaining 422 validation errors
- The 422 errors occur when model sends array instead of object format

Signed-off-by: yixin <yixinhuang48@gmail.com>
@yixinhuang48 yixinhuang48 force-pushed the feature/grl-sokoban-final branch from d13ee95 to f1b3306 Compare January 10, 2026 00:05
- Updated /step endpoint to accept both {"actions": [...]} and [...] formats
- This fixes 422 Unprocessable Entity errors when model sends array directly
- Regenerated 5 example rollouts with the fix applied

Signed-off-by: yixin <yixinhuang48@gmail.com>
@yixinhuang48 yixinhuang48 force-pushed the feature/grl-sokoban-final branch from f1b3306 to a19c376 Compare January 10, 2026 00:07
@yixinhuang48
Copy link
Collaborator Author

@cmunley1 I've updated the example.jsonl and example_rollouts.jsonl files, and fixed the rollout message issue we encountered.

Problem

The /step endpoint was receiving 422 Unprocessable Entity errors when the model sometimes sent actions in the format ["Right"] instead of the expected {"actions": ["Right"]} format. This caused rollouts to fail with validation errors.

Solution

Updated the /step endpoint in app.py to handle both formats:

  • {"actions": ["Right"]} (expected format)
  • ["Right"] (array-only format that was causing errors)

The endpoint now automatically detects the format and normalizes it to the expected structure, making it more robust to handle cases where the model sends just the array directly. Hopefully this should eliminate the 422 errors you were seeing in the ng_run logs.

Copy link
Contributor

@cmunley1 cmunley1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks alright to me, tested offline rollouts, seems ok.

lets wait for @bxyu-nvidia to review simple_agent changes, at least

@cmunley1 cmunley1 added the resource-server Resource servers (math, code, etc.) label Jan 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

resource-server Resource servers (math, code, etc.)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants