fix(replay): draw -1 padding fillers from per-row complement set (re-apply to main)#12
Open
DavidBellamy wants to merge 1 commit intomainfrom
Open
fix(replay): draw -1 padding fillers from per-row complement set (re-apply to main)#12DavidBellamy wants to merge 1 commit intomainfrom
DavidBellamy wants to merge 1 commit intomainfrom
Conversation
The previous routing-replay padding-replacement code:
top_indices[padding_mask] = (
torch.arange(padding_mask.sum(), ...) % scores.shape[1]
)
walks a flat arange mod num_experts to fill -1 slots, ignoring the row
structure. For any row with one or more -1s, the cyclic filler can land
on an expert id that is already present in that same row's existing
topk picks, producing within-row duplicates.
Downstream the router converts top_indices into a [num_tokens, num_experts]
routing_map via one-hot scatter, where duplicates within a row silently
collapse. As a result routing_map.sum() < num_tokens * topk. The
MoEAlltoAllTokenDispatcher then computes input_splits from
routing_map.sum(dim=0) but uses num_out_tokens = num_tokens * topk for
the permuted buffer, so sum(input_splits) < permuted_tokens.shape[0].
The subsequent all_to_all_single call raises:
RuntimeError: Split sizes doesn't match total dim 0 size
This bug is intermittent: it depends on (a) which rows have any -1s,
which is a function of rollout-engine truncation/abort luck, and (b)
the topk / num_experts ratio (collisions are likelier when topk
approaches num_experts).
This change replaces the cyclic filler with a per-row complement-set
draw: for each row, pick the highest-scoring experts NOT already used
in that row, deterministically distinct, in score-rank order. By
construction, no within-row duplicate is ever produced.
Closes radixark#1002.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cherry-picks the fix from #11 onto
main.#11 was merged into
deployon 2026-04-28. The nightlydeployrebuild at 09:00 UTC reconstructsdeployfromradixark/miles:mainplus open LLM360 PRs, so closed-as-merged PRs get dropped. As of today, the fix is no longer present onmain,deploy, ordeploy-promoted— confirmed bygit merge-base --is-ancestor 87c99777 origin/{main,deploy,deploy-promoted}returning false for all three, and by direct file inspection still showingarange(padding_mask.sum()) % scores.shape[1].Landing on
mainmakes the fix permanent against future rebuilds.Closes radixark#1002 (again).
Original PR: #11
Fix commit: 3c88d7d