-
Notifications
You must be signed in to change notification settings - Fork 54
Add GRL Sokoban Resource Server #261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GRL Sokoban Resource Server #261
Conversation
…rtifacts; keep repo clean Signed-off-by: yixin <yixinhuang48@gmail.com>
Signed-off-by: yixin <yixinhuang48@gmail.com>
Signed-off-by: yixin <yixinhuang48@gmail.com>
2cf6b1b to
b66677e
Compare
Signed-off-by: yixin <yixinhuang48@gmail.com>
| +num_repeats=16 \ | ||
| +num_samples_in_parallel=16 \ | ||
| +responses_create_params.temperature=0.8 \ | ||
| +responses_create_params.max_output_tokens=4096 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4k max output tokens seems short but it is fine.
| @@ -0,0 +1,296 @@ | |||
| # GRL Sokoban Resource Server | |||
|
|
|||
| Single-box Sokoban puzzle environment served via FastAPI with NeMo Gym conventions. The environment is implemented locally under `resources_servers/grl_sokoban/env`, mirroring GRL’s behaviour without requiring the external repository. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think its worth pointing out that this integrates this environment: https://github.com/mpSchrader/gym-sokoban which implements DeepMind's paper Imagination Augmented Agents for Deep Reinforcement Learning following the standard https://gymnasium.farama.org/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good catch!
Signed-off-by: yixin <yixinhuang48@gmail.com>
|
@cwing-nvidia @pjin-nvidia @cmunley1 — would anyone have time to help review this PR? Thanks! |
|
Hi @yixinhuang48 sorry for delay and thanks so much for contribution. In order to keep the codebase standard as we scale up to more environments, can we remove the reward profiling scripts and mainly keep core environment logic? I suggest providing a summary of reward profiling results in the PR comments here or in the readme. We can create a general recipe for reward profiling / sdg / rollout collection that will work for this and other environments. I would suggest deleting the following and updating readme and/or here with a summary of reward profiling results. resources_servers/grl_sokoban/analyze_rewards.py |
|
Hi @cmunley1 no worries. Yeah this is a good idea. I just checked out your branch ( https://github.com/NVIDIA-NeMo/Gym/tree/cmunley1/grl-sokoban) and saw that you have already removed all the files mentioned above and some additional ones, and I just merged it. |
|
Thanks, I also removed game_agent and support the required features in simple_agent, but we need to make sure that this doesn't affect other environments. cc @bxyu-nvidia |
|
Cool thanks! One issue that I noticed is that the new commits are my branch are not reflected in this PR. I am wondering if this issue is fixed when the automated CI/CD pipeline is executed? |
|
it should update, can you double check on your side, or create a new PR if you have to? |




Contributing To NeMo-Gym (GRL Sokoban Resource Server)
1) Necessary information
i. Corresponding dataset on the spreadsheet
ii. Description of the prompt (source + domain)
steptool to push boxes onto goals efficiently.iii. Description of the environment
resources_servers/grl_sokoban/sokoban_env, adapted from GRL.iv. Description of the verifier
success=truewhen all boxes are placed on goals; reward from env logic; zero on failure./verifycomputes final reward and cleans up session state.v. Legal approval status
2) Simple correctness check
i. Commands used to run the server
ii. Resulting rollout and judges (5 examples)
resources_servers/grl_sokoban/data/example_rollouts.jsonlsuccess=truesuccess=falseiii. Additional notes
/seed_sessionbefore/step.3) Tests
Test files / command
Coverage notes
4) Reward profiling
Models
Method
done/max_steps.Commands
Results
resources_servers/grl_sokoban/data/qwen3_4b_eval/reward-analysis.mdResults from running Qwen3-4B on 3,200 rollouts (200 prompts × 16 rollouts):
Overall Metrics
Tool Call Statistics
Reward Distribution
Performance by Tool Call Count
Key Observations
5) Training
Trained using GRPO on Qwen3-4B-Instruct-2507 using 1600 training examples and 400 validation examples for 1 epoch, using 6x6 Sokoban configuration.
Results
Initially tried training on Qwen3-4B (thinking mode) and results weren't good. For Qwen3-4B-Instruct-2507, we can see based on the training/validation curves that the overall trend is upward showing slight improvement (#261 (comment)), which could be more significant if trained on more data or more epochs.
Appendix
responses_api_agents/game_agentfrom the Tetris PR.