Skip to content

fix(replay): draw -1 padding fillers from per-row complement set (re-apply to main)#12

Open
DavidBellamy wants to merge 1 commit intomainfrom
fix/r3-padding-cycle-collision-to-main
Open

fix(replay): draw -1 padding fillers from per-row complement set (re-apply to main)#12
DavidBellamy wants to merge 1 commit intomainfrom
fix/r3-padding-cycle-collision-to-main

Conversation

@DavidBellamy
Copy link
Copy Markdown
Collaborator

Cherry-picks the fix from #11 onto main.

#11 was merged into deploy on 2026-04-28. The nightly deploy rebuild at 09:00 UTC reconstructs deploy from radixark/miles:main plus open LLM360 PRs, so closed-as-merged PRs get dropped. As of today, the fix is no longer present on main, deploy, or deploy-promoted — confirmed by git merge-base --is-ancestor 87c99777 origin/{main,deploy,deploy-promoted} returning false for all three, and by direct file inspection still showing arange(padding_mask.sum()) % scores.shape[1].

Landing on main makes the fix permanent against future rebuilds.

Closes radixark#1002 (again).

Original PR: #11
Fix commit: 3c88d7d

The previous routing-replay padding-replacement code:

    top_indices[padding_mask] = (
        torch.arange(padding_mask.sum(), ...) % scores.shape[1]
    )

walks a flat arange mod num_experts to fill -1 slots, ignoring the row
structure. For any row with one or more -1s, the cyclic filler can land
on an expert id that is already present in that same row's existing
topk picks, producing within-row duplicates.

Downstream the router converts top_indices into a [num_tokens, num_experts]
routing_map via one-hot scatter, where duplicates within a row silently
collapse. As a result routing_map.sum() < num_tokens * topk. The
MoEAlltoAllTokenDispatcher then computes input_splits from
routing_map.sum(dim=0) but uses num_out_tokens = num_tokens * topk for
the permuted buffer, so sum(input_splits) < permuted_tokens.shape[0].
The subsequent all_to_all_single call raises:

    RuntimeError: Split sizes doesn't match total dim 0 size

This bug is intermittent: it depends on (a) which rows have any -1s,
which is a function of rollout-engine truncation/abort luck, and (b)
the topk / num_experts ratio (collisions are likelier when topk
approaches num_experts).

This change replaces the cyclic filler with a per-row complement-set
draw: for each row, pick the highest-scoring experts NOT already used
in that row, deterministically distinct, in score-rank order. By
construction, no within-row duplicate is ever produced.

Closes radixark#1002.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

R3 replay: RuntimeError 'Split sizes doesn't match total dim 0 size' in Megatron all_to_all_single on MoE compute_log_prob

1 participant