feat: add FrozenLake multi-turn tool-call GRPO training example#168
Merged
feat: add FrozenLake multi-turn tool-call GRPO training example#168
Conversation
xiaoyifan
approved these changes
Mar 5, 2026
Add a complete GRPO training example for FrozenLake with multi-turn tool calls using the eval-protocol SDK for rollouts and Fireworks hosted trainer/deployment infrastructure. Key changes: - New frozen_lake example: train_frozen_lake.py, verify_rollout.py, seeds.jsonl - Per-position loss_mask in GRPO loss for multi-turn episodes (only model-generated completion tokens receive gradients, environment/tool tokens are masked) - Training shape support in infra.py (pass training_shape to server, clear manual accelerator settings to let server auto-configure) - Signal handling and robust cleanup (always delete deployment + trainer jobs on exit, capture job IDs even on partial failure) - Log rollout metrics to WandB even when all prompt groups are filtered - Accept UPDATING deployment state in setup_deployment - Compatibility with SDK versions that lack disable_speculative_decoding Made-with: Cursor
…raining Move domain-specific Frozen Lake modules (env, schema, rollout processor) from eval-protocol into the cookbook, so eval-protocol stays generic and the example is self-contained. Key improvements: - FrozenLake rollout processor now uses the generic FireworksV1CompletionsClient with a pluggable tool_call_parser callback - GRPO loss applies per-position loss_mask and reports granular metrics (active_tokens, mask_ratio, mean_adv_loss, mean_kl_penalty, inf_kld) - Training script logs detailed step summaries and uses monotonic WandB step counter to avoid step conflicts on filtered/skipped steps - Filtered steps still push rollout metrics to WandB for visibility Made-with: Cursor
…ompletion With enable_thinking unset, the Qwen3 template doesn't include <think>\n\n</think>\n\n in the generation prompt, so the model generates those tokens as part of its completion. This caused them to receive loss_mask=1.0 and gradients during training. Setting enable_thinking=False makes the template include the empty thinking block in the prompt. The model's completion_ids then start after </think>, correctly excluding template tokens from the loss. Made-with: Cursor
The frozen lake GRPO example imports from eval_protocol for the generic /v1/completions client and rollout processor types. Made-with: Cursor
Both training (loss_mask) and visualization (UI mask) now derive from the same compute_model_output_spans() function, eliminating duplicated turn-boundary logic. Tests verify the two masks agree on model-generated positions after accounting for the logprob coordinate shift. Made-with: Cursor
Contributor
Author
|
Pushed follow-up commit \ with the rebase fixes and validated a full run on \ (). |
febbaa3 to
3f7adf6
Compare
Hecate0821
pushed a commit
that referenced
this pull request
Mar 6, 2026
PR #168 (FrozenLake example) incorrectly overwrote several shared utilities with older/incompatible versions during merge: - Restore TrainStepFns interface in train.py (reverts MinibatchTrainFns rewrite that broke rl_loop.py recipe) - Remove _install_tinker_future_retrieve_compat() from client.py (workaround no longer needed, fixed server-side) - Restore direct disable_speculative_decoding= in config.py (removes unnecessary inspect guard) - Remove grad_accum param and restore apply_shape() in infra.py - Update frozen_lake example to use TrainStepFns (1:1 loop) Made-with: Cursor
This was referenced Mar 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
training/examples/frozen_lake/— eval-protocol stays genericFireworksV1CompletionsClientwith a pluggabletool_call_parsercallbackactive_tokens,mask_ratio,mean_adv_loss,mean_kl_penalty,inf_kldFiles
New: Frozen Lake example
training/examples/frozen_lake/train_frozen_lake.py— main GRPO training scripttraining/examples/frozen_lake/verify_rollout.py— single-rollout verification with eval-protocol UItraining/examples/frozen_lake/frozen_lake_env.py— deterministic FrozenLake environmenttraining/examples/frozen_lake/frozen_lake_schema.py— tool schema, action defs, XML parsingtraining/examples/frozen_lake/frozen_lake_rollout.py— rollout processor wiring generic client with FrozenLake envtraining/examples/frozen_lake/seeds.jsonl— reproducible seed contextsModified: training utilities
training/utils/rl/common.py— _get_loss_mask helper for per-position maskingtraining/utils/rl/grpo.py— apply loss_mask, emit granular metricstraining/utils/rl/train.py— emit rollout metrics on filtered/skipped stepsDependencies
Test plan