Skip to content

Routing Replay for MoEs #42638

@qgallouedec

Description

@qgallouedec

Feature request

RecentRL approaches for training MoE models increasingly rely on Routing Replay, as described in the following papers:

Without going into the training details, Routing Replay requires the ability to override the router during the forward pass, that is, to force the model to use a predefined set of router logits rather than computing new ones. This enables deterministic reproduction of expert selection.

AFAICT, Transformers currently does not expose a way to override router logits or manually control expert selection at inference/training time.

I imagine something along the following lines (minimal example):

from transformers import AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-30B-A3B-Instruct-2507", device_map="auto", dtype="auto")

input_ids = torch.tensor([[1, 2, 3, 4]], device="cuda")

# Standard forward pass, retrieving router logits
outputs = model(input_ids, output_router_logits=True)

# Forward pass with router logits injected (enabling Routing Replay)
model(input_ids, router_logits=outputs.router_logits)

Alternative

If we decide not to implement this feature, it would be nice to provide an example showing how to patch a MoE to enable this.

Motivation

See above.

Your contribution

I think I can do it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions