-
Notifications
You must be signed in to change notification settings - Fork 31.3k
Description
Feature request
RecentRL approaches for training MoE models increasingly rely on Routing Replay, as described in the following papers:
- https://huggingface.co/papers/2507.18071
- https://huggingface.co/papers/2510.11370
- https://huggingface.co/papers/2512.01374
Without going into the training details, Routing Replay requires the ability to override the router during the forward pass, that is, to force the model to use a predefined set of router logits rather than computing new ones. This enables deterministic reproduction of expert selection.
AFAICT, Transformers currently does not expose a way to override router logits or manually control expert selection at inference/training time.
I imagine something along the following lines (minimal example):
from transformers import AutoModelForCausalLM
import torch
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-30B-A3B-Instruct-2507", device_map="auto", dtype="auto")
input_ids = torch.tensor([[1, 2, 3, 4]], device="cuda")
# Standard forward pass, retrieving router logits
outputs = model(input_ids, output_router_logits=True)
# Forward pass with router logits injected (enabling Routing Replay)
model(input_ids, router_logits=outputs.router_logits)Alternative
If we decide not to implement this feature, it would be nice to provide an example showing how to patch a MoE to enable this.
Motivation
See above.
Your contribution
I think I can do it.