The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models
Zanlin Ni1 Shenzhi Wang1 Yang Yue1 Tianyu Yu2 Weilin Zhao2 Yeguo Hua3
Tianyi Chen3 Jun Song4 Cheng Yu4 Bo Zheng4 Gao Huang1✉
1LeapLab, Tsinghua University 2NLPLab, Tsinghua University 3Tsinghua University 4Alibaba Group
No combinatorial trajectories. No ELBO approximations. No diffusion-specific adaptations.
Just GRPO.
- [2026.03] 🎉 Training code, evaluation scripts, and model checkpoints for MATH-500, HumanEval and MBPP datasets released!
- [2026.01] 📄 Paper available on arXiv!
- [2026.01] 🎉 Training code, evaluation scripts, and model checkpoint on GSM8K released!
Diffusion LLMs (dLLMs) can generate tokens in arbitrary order, which theoretically offers more flexibility than standard left-to-right generation. But does this flexibility actually unlocks unique reasoning capabilities inaccessible to standard AR models?
We found the opposite. Arbitrary-order generation allows models to bypass high-uncertainty tokens (e.g., "Therefore", "Since") — the very tokens that create branching points in reasoning. This premature bypass collapses the solution space, leading to lower reasoning potential (Pass@k).
Our solution is simple: Since AR order preserves better reasoning potential, we just train dLLMs with standard GRPO in AR mode. No bells and whistles.
JustGRPO achieves state-of-the-art performance across reasoning and coding benchmarks:
| Benchmark | Gen Length 128 | Gen Length 256 | Gen Length 512 |
|---|---|---|---|
| GSM8K | 83.8 | 89.1 | 89.8 |
| MATH-500 | 39.0 | 45.1 | 45.2 |
| HumanEval | 37.8 | 49.4 | 48.7 |
| MBPP | 50.6 | 52.4 | 49.0 |
Existing RL methods for dLLMs often require handling the complexity of arbitrary-order generation:
| Challenge | Description |
|---|---|
| Combinatorial trajectories | Optimizing over factorial-sized denoising paths |
| Intractable likelihoods | ELBO-based surrogates instead of true objectives |
| Sampler-learner mismatch | Confidence-based samplers vs. original diffusion prior |
- JustGRPO sidesteps all of this by treating dLLMs as autoregressive models during RL training. The result? Standard GRPO, directly applicable, with exact likelihood computation.
- The core logic of JustGRPO (
grpo.py) fits in ~60 lines: rollout sampling and log-probability loss computation. That's it.
💡 The model still retains parallel decoding at inference time — we only use AR order during training. See our paper for more details.
JustGRPO is designed to be lightweight and dependency-minimal.
git clone https://github.com/LeapLabTHU/JustGRPO.git
cd JustGRPO
pip install -r requirements.txtDependencies:
acceleratetransformersdatasets- Standard evaluation utilities (
sympy,latex2sympy2, etc.)
We provide evaluation and training code for GSM8K, MATH-500, HumanEval, and MBPP.
Model checkpoints:
- LLaDA-Instruct-JustGRPO-GSM8K (GSM8K)
- LLaDA-Instruct-JustGRPO-Math500 (MATH-500)
- LLaDA-Instruct-JustGRPO-Code (HumanEval & MBPP)
torchrun --nproc-per-node=8 eval.py \
--task gsm8k \ # math500/humaneval/mbpp
--ckpt_path /path/to/ckpt \
--gen_length 256 --steps 256 --block_length 32Math (GSM8K / MATH-500):
accelerate launch --num_processes 8 --config_file configs/fsdp.yaml train.py \
--dataset gsm8k \
--grad_accum 8accelerate launch --num_processes 8 --config_file configs/fsdp.yaml train.py \
--dataset math \
--grad_accum 8Code (MBPP / HumanEval):
Code training uses the AceCode-Hard subset, following ml-diffucoder. You can download the dataset here: AceCode-Hard (Google Drive). Place the downloaded file at datasets/acecode_hard.jsonl.
accelerate launch --num_processes 8 --config_file configs/fsdp.yaml train.py \
--dataset code \
--code_data_path datasets/acecode_hard.jsonl \
--grad_accum 8Note: Keep global batch size =
num_gpus×grad_accum= 64.
If you find this work useful, please consider citing our paper.
@article{ni2026flexibility,
title={The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models},
author={Ni, Zanlin and Wang, Shenzhi and Yue, Yang and Yu, Tianyu and Zhao, Weilin and Hua, Yeguo and Chen, Tianyi and Song, Jun and Yu, Cheng and Zheng, Bo and Huang, Gao},
journal={arXiv preprint arXiv:2601.15165},
year={2026}
}This project builds upon the following excellent works:
We sincerely appreciate the authors for making their work open source.

