Implements "toggle" from Kimi K2.5#1676
Conversation
…laude Opus 4.7 <noreply@anthropic.com>
…o-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ored-By: Claude Opus 4.7 <noreply@anthropic.com>
…eka. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
….7 <noreply@anthropic.com>
…uthored-By: Claude Opus 4.7 <noreply@anthropic.com>
….7 <noreply@anthropic.com>
…rmal phase Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request implements the Toggle reward-shaping heuristic from the Kimi K2.5 paper, which alternates between standard scaling and length-penalty phases to control response lengths during training. The implementation includes a new ToggleBudgetTracker class, configuration parameters in StreamingDataLoaderConfig, and integration into the DataPreparationActor. Feedback focuses on critical performance issues in the ToggleBudgetTracker, specifically the inefficient and redundant calculation of percentiles using np.percentile on growing lists within batch loops. It is recommended to cache these values or use streaming percentile estimators to avoid significant training degradation.
| def budget(self, dataset) -> float | None: | ||
| lengths = self.lengths_per_dataset.get(self._key(dataset)) | ||
| if not lengths: | ||
| return None | ||
| return float(np.percentile(lengths, self.percentile)) |
There was a problem hiding this comment.
The budget method calls np.percentile on a list that grows indefinitely as training progresses. This operation is maybe_apply. This will cause significant performance degradation in the DataPreparationActor thread as the history of correct response lengths grows. Consider caching the budget values once per training step or using a more efficient streaming percentile estimator.
| for dataset in datasets: | ||
| key = self._key(dataset) | ||
| if key in seen_keys: | ||
| continue | ||
| seen_keys.add(key) | ||
| budget_value = self.budget(dataset) | ||
| if budget_value is not None: | ||
| metrics[f"toggle/budget/{'|'.join(key)}"] = budget_value |
There was a problem hiding this comment.
This loop iterates over all samples in the batch to populate metrics, calling self.budget(dataset) for each one. Since self.budget performs an expensive percentile calculation, this is highly inefficient when many samples belong to the same dataset (which is typical in GRPO). You should iterate over unique dataset keys in the batch instead.
| for dataset in datasets: | |
| key = self._key(dataset) | |
| if key in seen_keys: | |
| continue | |
| seen_keys.add(key) | |
| budget_value = self.budget(dataset) | |
| if budget_value is not None: | |
| metrics[f"toggle/budget/{'|'.join(key)}"] = budget_value | |
| unique_keys = {self._key(d) for d in datasets} | |
| budget_map = {key: self.budget_from_key(key) for key in unique_keys} | |
| for key, budget_value in budget_map.items(): | |
| if budget_value is not None: | |
| metrics[f"toggle/budget/{'|'.join(key)}"] = budget_value |
| budgets = np.array( | ||
| [self.budget(d) if self.budget(d) is not None else np.inf for d in datasets], dtype=np.float64 | ||
| ) |
There was a problem hiding this comment.
Redundant and expensive calls to self.budget(d) inside a list comprehension. This should be optimized by using a pre-calculated map of budgets for the unique datasets present in the current batch to avoid re-computing the same percentile hundreds of times per step.
| budgets = np.array( | |
| [self.budget(d) if self.budget(d) is not None else np.inf for d in datasets], dtype=np.float64 | |
| ) | |
| budgets = np.array( | |
| [budget_map[self._key(d)] if budget_map[self._key(d)] is not None else np.inf for d in datasets], dtype=np.float64 | |
| ) |
No description provided.