Routing Replay for MoEs

### Feature request

RecentRL approaches for training MoE models increasingly rely on **Routing Replay**, as described in the following papers:

- https://huggingface.co/papers/2507.18071
- https://huggingface.co/papers/2510.11370
- https://huggingface.co/papers/2512.01374

Without going into the training details, Routing Replay requires the ability to override the router during the forward pass, that is, to force the model to use a predefined set of router logits rather than computing new ones. This enables deterministic reproduction of expert selection.

AFAICT, Transformers currently does not expose a way to override router logits or manually control expert selection at inference/training time.

I imagine something along the following lines (minimal example):

```python
from transformers import AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-30B-A3B-Instruct-2507", device_map="auto", dtype="auto")

input_ids = torch.tensor([[1, 2, 3, 4]], device="cuda")

# Standard forward pass, retrieving router logits
outputs = model(input_ids, output_router_logits=True)

# Forward pass with router logits injected (enabling Routing Replay)
model(input_ids, router_logits=outputs.router_logits)
```

## Alternative

If we decide not to implement this feature, it would be nice to provide an example showing how to _patch_ a MoE to enable this.

### Motivation

See above.

### Your contribution

I think I can do it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Routing Replay for MoEs #42638

Feature request

Alternative

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Routing Replay for MoEs #42638

Description

Feature request

Alternative

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions