fix(replay): draw -1 padding fillers from per-row complement set (re-apply to main) by DavidBellamy · Pull Request #12 · LLM360/miles

DavidBellamy · 2026-04-29T23:06:03Z

Cherry-picks the fix from #11 onto main.

#11 was merged into deploy on 2026-04-28. The nightly deploy rebuild at 09:00 UTC reconstructs deploy from radixark/miles:main plus open LLM360 PRs, so closed-as-merged PRs get dropped. As of today, the fix is no longer present on main, deploy, or deploy-promoted — confirmed by git merge-base --is-ancestor 87c99777 origin/{main,deploy,deploy-promoted} returning false for all three, and by direct file inspection still showing arange(padding_mask.sum()) % scores.shape[1].

Landing on main makes the fix permanent against future rebuilds.

Closes radixark#1002 (again).

Original PR: #11
Fix commit: 3c88d7d

The previous routing-replay padding-replacement code: top_indices[padding_mask] = ( torch.arange(padding_mask.sum(), ...) % scores.shape[1] ) walks a flat arange mod num_experts to fill -1 slots, ignoring the row structure. For any row with one or more -1s, the cyclic filler can land on an expert id that is already present in that same row's existing topk picks, producing within-row duplicates. Downstream the router converts top_indices into a [num_tokens, num_experts] routing_map via one-hot scatter, where duplicates within a row silently collapse. As a result routing_map.sum() < num_tokens * topk. The MoEAlltoAllTokenDispatcher then computes input_splits from routing_map.sum(dim=0) but uses num_out_tokens = num_tokens * topk for the permuted buffer, so sum(input_splits) < permuted_tokens.shape[0]. The subsequent all_to_all_single call raises: RuntimeError: Split sizes doesn't match total dim 0 size This bug is intermittent: it depends on (a) which rows have any -1s, which is a function of rollout-engine truncation/abort luck, and (b) the topk / num_experts ratio (collisions are likelier when topk approaches num_experts). This change replaces the cyclic filler with a per-row complement-set draw: for each row, pick the highest-scoring experts NOT already used in that row, deterministically distinct, in score-rank order. By construction, no within-row duplicate is ever produced. Closes radixark#1002.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(replay): draw -1 padding fillers from per-row complement set (re-apply to main)#12

fix(replay): draw -1 padding fillers from per-row complement set (re-apply to main)#12
DavidBellamy wants to merge 1 commit intomainfrom
fix/r3-padding-cycle-collision-to-main

DavidBellamy commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DavidBellamy commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant