-
Notifications
You must be signed in to change notification settings - Fork 53
Add GRL Sokoban Resource Server #564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add GRL Sokoban Resource Server #564
Conversation
…rtifacts; keep repo clean Signed-off-by: yixin <yixinhuang48@gmail.com>
Signed-off-by: yixin <yixinhuang48@gmail.com>
Signed-off-by: yixin <yixinhuang48@gmail.com>
Signed-off-by: yixin <yixinhuang48@gmail.com>
Signed-off-by: yixin <yixinhuang48@gmail.com>
Signed-off-by: yixin <yixinhuang48@gmail.com>
Signed-off-by: cmunley1 <cmunley@nvidia.com>
Signed-off-by: cmunley1 <cmunley@nvidia.com>
Signed-off-by: yixin <yixinhuang48@gmail.com>
297a7f9 to
5491f8d
Compare
Changed grl_sokoban_game_agent to grl_sokoban_simple_agent Signed-off-by: yixin <yixinhuang48@gmail.com>
|
@cmunley1 @bxyu-nvidia this is the updated PR for the one that I closed (#261). |
Added Apache 2.0 copyright headers to generation.py and sokoban_env.py Signed-off-by: yixin <yixinhuang48@gmail.com>
…uang48/Gym-1 into feature/grl-sokoban-final
Added proper attribution to https://github.com/lmgame-org/lmenv in docstrings and README, acknowledging the collaborative development with NVIDIA. Signed-off-by: yixin <yixinhuang48@gmail.com>
|
can you reupload the dataset generation script or point to a huggingface dataset in the readme? I removed this by accident on my branch. otherwise this looks pretty good to me |
Add script to generate diverse test examples with varying seeds, room dimensions, and number of boxes for the GRL Sokoban environment. Signed-off-by: yixin <yixinhuang48@gmail.com>
|
Just uploaded the dataset generation script. The generation script includes different kinds of configurations for room size and box numbers for reward profiling, but for the training validation, I just used 6x6 room configuration and 1 box. |
|
can you regenerate example.jsonl so it include this instruction in your latest dataset generation script? "IMPORTANT: First call the I run the existing example and get a response like Also, with updated example, I can run rollouts that look okay, but i do get lots of these messages in ng_run logs, do you see this too, is expected? |
|
@bxyu-nvidia can u please look at simple_agent changes? I tested offline rollouts with other resources servers, seems unaffected, but want to make sure with you. |
…ep endpoint docs - Regenerated example.jsonl to include instruction about calling step with empty array first - Added docstring to step endpoint explaining 422 validation errors - The 422 errors occur when model sends array instead of object format Signed-off-by: yixin <yixinhuang48@gmail.com>
d13ee95 to
f1b3306
Compare
- Updated /step endpoint to accept both {"actions": [...]} and [...] formats
- This fixes 422 Unprocessable Entity errors when model sends array directly
- Regenerated 5 example rollouts with the fix applied
Signed-off-by: yixin <yixinhuang48@gmail.com>
f1b3306 to
a19c376
Compare
|
@cmunley1 I've updated the ProblemThe SolutionUpdated the
The endpoint now automatically detects the format and normalizes it to the expected structure, making it more robust to handle cases where the model sends just the array directly. Hopefully this should eliminate the 422 errors you were seeing in the |
cmunley1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks alright to me, tested offline rollouts, seems ok.
lets wait for @bxyu-nvidia to review simple_agent changes, at least
Contributing To NeMo-Gym (GRL Sokoban Resource Server)
1) Necessary information
i. Corresponding dataset on the spreadsheet
ii. Description of the prompt (source + domain)
steptool to push boxes onto goals efficiently.iii. Description of the environment
resources_servers/grl_sokoban/sokoban_env, adapted from GRL.iv. Description of the verifier
success=truewhen all boxes are placed on goals; reward from env logic; zero on failure./verifycomputes final reward and cleans up session state.v. Legal approval status
2) Simple correctness check
i. Commands used to run the server
ii. Resulting rollout and judges (5 examples)
resources_servers/grl_sokoban/data/example_rollouts.jsonlsuccess=truesuccess=falseiii. Additional notes
/seed_sessionbefore/step.3) Tests
Test files / command
Coverage notes
4) Reward profiling
Models
Method
done/max_steps.Commands
Results
resources_servers/grl_sokoban/data/qwen3_4b_eval/reward-analysis.mdResults from running Qwen3-4B on 3,200 rollouts (200 prompts × 16 rollouts):
Overall Metrics
Tool Call Statistics
Reward Distribution
Performance by Tool Call Count
Key Observations
5) Training
Trained using GRPO on Qwen3-4B-Instruct-2507 using 1600 training examples and 400 validation examples for 1 epoch, using 6x6 Sokoban configuration.
Results
Initially tried training on Qwen3-4B (thinking mode) and results weren't good. For Qwen3-4B-Instruct-2507, we can see based on the training/validation curves that the overall trend is upward showing slight improvement (#261 (comment)), which could be more significant if trained on more data or more epochs.
Appendix
responses_api_agents/game_agentfrom the Tetris PR.