Implements "toggle" from Kimi K2.5 by finbarrtimbers · Pull Request #1676 · allenai/open-instruct

finbarrtimbers · 2026-05-11T14:21:48Z

No description provided.

…laude Opus 4.7 <noreply@anthropic.com>

…o-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ored-By: Claude Opus 4.7 <noreply@anthropic.com>

…eka. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

….7 <noreply@anthropic.com>

…uthored-By: Claude Opus 4.7 <noreply@anthropic.com>

….7 <noreply@anthropic.com>

…rmal phase Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request implements the Toggle reward-shaping heuristic from the Kimi K2.5 paper, which alternates between standard scaling and length-penalty phases to control response lengths during training. The implementation includes a new ToggleBudgetTracker class, configuration parameters in StreamingDataLoaderConfig, and integration into the DataPreparationActor. Feedback focuses on critical performance issues in the ToggleBudgetTracker, specifically the inefficient and redundant calculation of percentiles using np.percentile on growing lists within batch loops. It is recommended to cache these values or use streaming percentile estimators to avoid significant training degradation.

gemini-code-assist · 2026-05-11T14:26:39Z

+    def budget(self, dataset) -> float | None:
+        lengths = self.lengths_per_dataset.get(self._key(dataset))
+        if not lengths:
+            return None
+        return float(np.percentile(lengths, self.percentile))


The budget method calls np.percentile on a list that grows indefinitely as training progresses. This operation is $O(N \log N)$ (or $O(N)$ with selection) and is currently called for every sample in every batch via the list comprehension in maybe_apply. This will cause significant performance degradation in the DataPreparationActor thread as the history of correct response lengths grows. Consider caching the budget values once per training step or using a more efficient streaming percentile estimator.

gemini-code-assist · 2026-05-11T14:26:39Z

+        for dataset in datasets:
+            key = self._key(dataset)
+            if key in seen_keys:
+                continue
+            seen_keys.add(key)
+            budget_value = self.budget(dataset)
+            if budget_value is not None:
+                metrics[f"toggle/budget/{'|'.join(key)}"] = budget_value


This loop iterates over all samples in the batch to populate metrics, calling self.budget(dataset) for each one. Since self.budget performs an expensive percentile calculation, this is highly inefficient when many samples belong to the same dataset (which is typical in GRPO). You should iterate over unique dataset keys in the batch instead.

Suggested change

for dataset in datasets:

key = self._key(dataset)

if key in seen_keys:

continue

seen_keys.add(key)

budget_value = self.budget(dataset)

if budget_value is not None:

metrics[f"toggle/budget/{'|'.join(key)}"] = budget_value

unique_keys = {self._key(d) for d in datasets}

budget_map = {key: self.budget_from_key(key) for key in unique_keys}

for key, budget_value in budget_map.items():

if budget_value is not None:

metrics[f"toggle/budget/{'|'.join(key)}"] = budget_value

gemini-code-assist · 2026-05-11T14:26:39Z

+        budgets = np.array(
+            [self.budget(d) if self.budget(d) is not None else np.inf for d in datasets], dtype=np.float64
+        )


Redundant and expensive calls to self.budget(d) inside a list comprehension. This should be optimized by using a pre-calculated map of budgets for the unique datasets present in the current batch to avoid re-computing the same percentile hundreds of times per step.

Suggested change

budgets = np.array(

[self.budget(d) if self.budget(d) is not None else np.inf for d in datasets], dtype=np.float64

)

budgets = np.array(

[budget_map[self._key(d)] if budget_map[self._key(d)] is not None else np.inf for d in datasets], dtype=np.float64

)

finbarrtimbers added 10 commits May 8, 2026 09:16

Add Toggle reward-shaping heuristic from Kimi K2.5. Co-Authored-By: C…

fd39ef7

…laude Opus 4.7 <noreply@anthropic.com>

Add Toggle hyperparameter sweep launcher for qwen3_4b_dapo_math_oc. C…

ecd60c1

…o-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Toggle: handle list-valued dataset field (multi-verifier samples). Co…

670f9a9

…-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Skip p50/m50 in toggle sweep (already running as smoke test). Co-Auth…

ea0b3de

…ored-By: Claude Opus 4.7 <noreply@anthropic.com>

Use grpo_fast.py in toggle sweep so mason auto-saves checkpoints to w…

72eea8c

…eka. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Run toggle sweep with M=250 for p50/p80 Co-Authored-By: Claude Opus 4…

f15c6fc

….7 <noreply@anthropic.com>

Add launch_toggle_evals.sh for AIME evals over sweep checkpoints Co-A…

2605152

…uthored-By: Claude Opus 4.7 <noreply@anthropic.com>

Run toggle sweep with M=750 for p50/p80 Co-Authored-By: Claude Opus 4…

e26b70f

….7 <noreply@anthropic.com>

Rename toggle phases (normal/length_penalty) and start training in no…

51ff637

…rmal phase Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into finbarr/toggle

9c3f162

finbarrtimbers changed the title ~~Implements "toggle~~ Implements "toggle" from Kimi K2.5 May 11, 2026

gemini-code-assist Bot reviewed May 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implements "toggle" from Kimi K2.5#1676

Implements "toggle" from Kimi K2.5#1676
finbarrtimbers wants to merge 10 commits into
mainfrom
finbarr/toggle

finbarrtimbers commented May 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 11, 2026

Uh oh!

gemini-code-assist Bot May 11, 2026

Uh oh!

gemini-code-assist Bot May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

finbarrtimbers commented May 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant