Skip to content

Conversation

@yixinhuang48
Copy link
Collaborator

@yixinhuang48 yixinhuang48 commented Oct 30, 2025

Contributing To NeMo-Gym (GRL Sokoban Resource Server)

1) Necessary information

i. Corresponding dataset on the spreadsheet

  • N/A

ii. Description of the prompt (source + domain)

  • Domain: Sokoban (grid-based puzzle; tool-use agent).
  • Source: Synthetic prompts generated programmatically. Prompts instruct the agent to use the step tool to push boxes onto goals efficiently.

iii. Description of the environment

  • A self-contained Sokoban environment under resources_servers/grl_sokoban/sokoban_env, adapted from GRL.
  • Configurable map sizes and generation parameters.
  • Observation: ASCII grid with walls, goals, boxes, and player encoded.
  • Actions: Up, Down, Left, Right.
  • FastAPI resource server following NeMo Gym conventions.

iv. Description of the verifier

  • Verifier is the environment: success=true when all boxes are placed on goals; reward from env logic; zero on failure.
  • /verify computes final reward and cleans up session state.

v. Legal approval status

  • Code: Apache 2.0.
  • Data: Synthetic, programmatically generated (Apache 2.0).

2) Simple correctness check

i. Commands used to run the server

# Start NeMo Gym servers (agent + Sokoban)
config_paths="responses_api_models/openai_model/configs/openai_model.yaml,\
resources_servers/grl_sokoban/configs/grl_sokoban.yaml"
ng_run "+config_paths=[$config_paths]"

# Collect sample rollouts
ng_collect_rollouts +agent_name=grl_sokoban_game_agent \
  +input_jsonl_fpath=resources_servers/grl_sokoban/data/example.jsonl \
  +output_jsonl_fpath=resources_servers/grl_sokoban/data/example_rollouts.jsonl \
  +limit=5

# View rollouts
ng_viewer +jsonl_fpath=resources_servers/grl_sokoban/data/example_rollouts.jsonl

ii. Resulting rollout and judges (5 examples)

  • See resources_servers/grl_sokoban/data/example_rollouts.jsonl
  • Expected behavior:
    • All boxes on goals → positive reward, success=true
    • Otherwise penalties or zero reward, success=false

iii. Additional notes

  • Must call /seed_session before /step.
  • Actions accepted as labels or indices.
  • Session cookies are maintained by middleware; agent path propagates cookies.
  • Large rollout artifacts are gitignored; do not commit them.

3) Tests

Test files / command

pytest resources_servers/grl_sokoban/tests -q

Coverage notes

  • Sokoban server tests: seed/step flow, invalid actions, done handling, verify success/failure, cleanup.

4) Reward profiling

Models

Method

  • 200 prompts × 16 rollouts each (3,200 total).
  • Tool calling enabled; agent loops until done/max_steps.

Commands

cd resources_servers/grl_sokoban
./run_qwen3_4b_eval_loop.sh  # or ./run_qwen3_4b_eval.sh

# Manual analysis
python analyze_rewards.py \
  --rollouts-path resources_servers/grl_sokoban/data/qwen3_4b_eval/rollouts.jsonl \
  --model-name "Qwen3-4B" \
  --output resources_servers/grl_sokoban/data/qwen3_4b_eval/reward-analysis.md

Results

  • Report file: resources_servers/grl_sokoban/data/qwen3_4b_eval/reward-analysis.md
  • Include success rate, mean/median reward, tool-call statistics.

Results from running Qwen3-4B on 3,200 rollouts (200 prompts × 16 rollouts):

Overall Metrics

  • Total Rollouts: 3,200
  • Success Rate: 13.47% (431 / 3,200)
  • Mean Reward: 0.9305
  • Median Reward: 0.0000
  • Min Reward: -8.9000
  • Max Reward: 10.9000

Tool Call Statistics

  • Average Tool Calls: 2.64 per rollout
  • Min Tool Calls: 1
  • Max Tool Calls: 11
  • Correlation (tool calls ↔ reward): -0.2338 (negative correlation)

Reward Distribution

  • 0.0 reward: 2,134 occurrences (66.7%) - immediate failures
  • 10.8 reward: 206 occurrences (6.4%)
  • 10.9 reward: 72 occurrences (2.2%)
  • 10.7 reward: 51 occurrences (1.6%)
  • Negative rewards: ~800 occurrences (25%) - invalid moves/failures

Performance by Tool Call Count

Tool Calls Mean Reward Rollout Count Notes
1 0.0000 2,112 Immediate failures (66%)
2 7.0948 174 Quick successes
3 8.0076 314 Best average performance
4 4.9391 87 Moderate attempts
5 3.0453 53 Declining performance
10 -3.5120 409 Getting stuck in loops

Key Observations

  1. High Early Failure Rate: 66.7% of rollouts fail immediately with only 1 tool call, suggesting the model often doesn't properly engage with the task
  2. Negative Correlation: More tool calls correlate with worse outcomes (-0.2338), indicating the model gets stuck in invalid move patterns
  3. Sweet Spot: Rollouts with 2-3 tool calls perform best (mean rewards ~7-8), suggesting successful puzzles are solved quickly
  4. Success Pattern: When successful, the model typically completes puzzles in 2-3 moves, but this only happens in ~15% of cases

5) Training

Trained using GRPO on Qwen3-4B-Instruct-2507 using 1600 training examples and 400 validation examples for 1 epoch, using 6x6 Sokoban configuration.

Results

Initially tried training on Qwen3-4B (thinking mode) and results weren't good. For Qwen3-4B-Instruct-2507, we can see based on the training/validation curves that the overall trend is upward showing slight improvement (#261 (comment)), which could be more significant if trained on more data or more epochs.


Appendix

  • Reuses shared responses_api_agents/game_agent from the Tetris PR.
  • Added as a new resource server; existing servers unchanged.

…rtifacts; keep repo clean

Signed-off-by: yixin <yixinhuang48@gmail.com>
Signed-off-by: yixin <yixinhuang48@gmail.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 30, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: yixin <yixinhuang48@gmail.com>
@yixinhuang48 yixinhuang48 force-pushed the feature/grl-sokoban-final branch from 2cf6b1b to b66677e Compare November 4, 2025 21:41
Signed-off-by: yixin <yixinhuang48@gmail.com>
@yixinhuang48 yixinhuang48 requested a review from cmunley1 November 7, 2025 03:06
+num_repeats=16 \
+num_samples_in_parallel=16 \
+responses_create_params.temperature=0.8 \
+responses_create_params.max_output_tokens=4096
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4k max output tokens seems short but it is fine.

@@ -0,0 +1,296 @@
# GRL Sokoban Resource Server

Single-box Sokoban puzzle environment served via FastAPI with NeMo Gym conventions. The environment is implemented locally under `resources_servers/grl_sokoban/env`, mirroring GRL’s behaviour without requiring the external repository.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think its worth pointing out that this integrates this environment: https://github.com/mpSchrader/gym-sokoban which implements DeepMind's paper Imagination Augmented Agents for Deep Reinforcement Learning following the standard https://gymnasium.farama.org/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch!

@yixinhuang48 yixinhuang48 requested a review from cmunley1 November 7, 2025 19:44
@cwing-nvidia cwing-nvidia added the resource-server Resource servers (math, code, etc.) label Nov 15, 2025
@yixinhuang48
Copy link
Collaborator Author

Screenshot 2025-11-24 at 11 18 27 AM Screenshot 2025-11-24 at 11 18 42 AM Screenshot 2025-11-24 at 11 20 29 AM Screenshot 2025-11-24 at 11 20 59 AM

Training and validation curves (training on 1600 examples and validation on 400 examples for 6x6 Sokoban configuration, for Qwen3-4B-Instruct-2507).

@yixinhuang48
Copy link
Collaborator Author

@cwing-nvidia @pjin-nvidia @cmunley1 — would anyone have time to help review this PR? Thanks!

@cmunley1
Copy link
Contributor

cmunley1 commented Jan 9, 2026

Hi @yixinhuang48 sorry for delay and thanks so much for contribution. In order to keep the codebase standard as we scale up to more environments, can we remove the reward profiling scripts and mainly keep core environment logic? I suggest providing a summary of reward profiling results in the PR comments here or in the readme. We can create a general recipe for reward profiling / sdg / rollout collection that will work for this and other environments.

I would suggest deleting the following and updating readme and/or here with a summary of reward profiling results.

resources_servers/grl_sokoban/analyze_rewards.py
resources_servers/grl_sokoban/checkpoint_resume_rollouts.py
resources_servers/grl_sokoban/data/qwen3_30b_eval/
resources_servers/grl_sokoban/data/qwen3_4b_eval/
resources_servers/grl_sokoban/run_qwen3_4b_eval.sh
resources_servers/grl_sokoban/run_qwen3_30b_eval.sh
resources_servers/grl_sokoban/run_qwen3_4b_eval_loop.sh
resources_servers/grl_sokoban/run_qwen3_30b_eval_loop.sh

@yixinhuang48
Copy link
Collaborator Author

yixinhuang48 commented Jan 9, 2026

Hi @cmunley1 no worries. Yeah this is a good idea. I just checked out your branch ( https://github.com/NVIDIA-NeMo/Gym/tree/cmunley1/grl-sokoban) and saw that you have already removed all the files mentioned above and some additional ones, and I just merged it.

@cmunley1
Copy link
Contributor

cmunley1 commented Jan 9, 2026

Thanks, I also removed game_agent and support the required features in simple_agent, but we need to make sure that this doesn't affect other environments.

cc @bxyu-nvidia

@yixinhuang48
Copy link
Collaborator Author

Cool thanks! One issue that I noticed is that the new commits are my branch are not reflected in this PR. I am wondering if this issue is fixed when the automated CI/CD pipeline is executed?

@cmunley1
Copy link
Contributor

cmunley1 commented Jan 9, 2026

it should update, can you double check on your side, or create a new PR if you have to?

@yixinhuang48
Copy link
Collaborator Author

@cmunley1 just created a new PR #564

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

resource-server Resource servers (math, code, etc.)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants