JustGRPO

The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models

Zanlin Ni¹ Shenzhi Wang¹ Yang Yue¹ Tianyu Yu² Weilin Zhao² Yeguo Hua³

Tianyi Chen³ Jun Song⁴ Cheng Yu⁴ Bo Zheng⁴ Gao Huang^1✉

¹LeapLab, Tsinghua University ²NLPLab, Tsinghua University ³Tsinghua University ⁴Alibaba Group

No combinatorial trajectories. No ELBO approximations. No diffusion-specific adaptations.

Just GRPO.

📢 News

[2026.03] 🎉 Training code, evaluation scripts, and model checkpoints for MATH-500, HumanEval and MBPP datasets released!
[2026.01] 📄 Paper available on arXiv!
[2026.01] 🎉 Training code, evaluation scripts, and model checkpoint on GSM8K released!

Why JustGRPO?

Diffusion LLMs (dLLMs) can generate tokens in arbitrary order, which theoretically offers more flexibility than standard left-to-right generation. But does this flexibility actually unlocks unique reasoning capabilities inaccessible to standard AR models?

We found the opposite. Arbitrary-order generation allows models to bypass high-uncertainty tokens (e.g., "Therefore", "Since") — the very tokens that create branching points in reasoning. This premature bypass collapses the solution space, leading to lower reasoning potential (Pass@k).

Our solution is simple: Since AR order preserves better reasoning potential, we just train dLLMs with standard GRPO in AR mode. No bells and whistles.

Results

JustGRPO achieves state-of-the-art performance across reasoning and coding benchmarks:

Benchmark	Gen Length 128	Gen Length 256	Gen Length 512
GSM8K	83.8	89.1	89.8
MATH-500	39.0	45.1	45.2
HumanEval	37.8	49.4	48.7
MBPP	50.6	52.4	49.0

Simplicity

Existing RL methods for dLLMs often require handling the complexity of arbitrary-order generation:

Challenge	Description
Combinatorial trajectories	Optimizing over factorial-sized denoising paths
Intractable likelihoods	ELBO-based surrogates instead of true objectives
Sampler-learner mismatch	Confidence-based samplers vs. original diffusion prior

JustGRPO sidesteps all of this by treating dLLMs as autoregressive models during RL training. The result? Standard GRPO, directly applicable, with exact likelihood computation.
The core logic of JustGRPO (grpo.py) fits in ~60 lines: rollout sampling and log-probability loss computation. That's it.

💡 The model still retains parallel decoding at inference time — we only use AR order during training. See our paper for more details.

Installation

JustGRPO is designed to be lightweight and dependency-minimal.

git clone https://github.com/LeapLabTHU/JustGRPO.git
cd JustGRPO
pip install -r requirements.txt

Dependencies:

accelerate
transformers
datasets
Standard evaluation utilities (sympy, latex2sympy2, etc.)

Usage

We provide evaluation and training code for GSM8K, MATH-500, HumanEval, and MBPP.

Evaluation

Model checkpoints:

LLaDA-Instruct-JustGRPO-GSM8K (GSM8K)
LLaDA-Instruct-JustGRPO-Math500 (MATH-500)
LLaDA-Instruct-JustGRPO-Code (HumanEval & MBPP)

torchrun --nproc-per-node=8 eval.py \
  --task gsm8k \  # math500/humaneval/mbpp
  --ckpt_path /path/to/ckpt \
  --gen_length 256 --steps 256 --block_length 32

Training

Math (GSM8K / MATH-500):

accelerate launch --num_processes 8 --config_file configs/fsdp.yaml train.py \
  --dataset gsm8k \
  --grad_accum 8

accelerate launch --num_processes 8 --config_file configs/fsdp.yaml train.py \
  --dataset math \
  --grad_accum 8

Code (MBPP / HumanEval):

Code training uses the AceCode-Hard subset, following ml-diffucoder. You can download the dataset here: AceCode-Hard (Google Drive). Place the downloaded file at datasets/acecode_hard.jsonl.

accelerate launch --num_processes 8 --config_file configs/fsdp.yaml train.py \
  --dataset code \
  --code_data_path datasets/acecode_hard.jsonl \
  --grad_accum 8

Note: Keep global batch size = num_gpus × grad_accum = 64.

Citation

If you find this work useful, please consider citing our paper.

@article{ni2026flexibility,
  title={The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models},
  author={Ni, Zanlin and Wang, Shenzhi and Yue, Yang and Yu, Tianyu and Zhao, Weilin and Hua, Yeguo and Chen, Tianyi and Song, Jun and Yu, Cheng and Zheng, Bo and Huang, Gao},
  journal={arXiv preprint arXiv:2601.15165},
  year={2026}
}

Acknowledgments

This project builds upon the following excellent works:

We sincerely appreciate the authors for making their work open source.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JustGRPO

📢 News

Why JustGRPO?

Results

Simplicity

Installation

Usage

Evaluation

Training

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
configs		configs
data		data
datasets		datasets
utils		utils
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
grpo.py		grpo.py
requirements.txt		requirements.txt
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

JustGRPO

📢 News

Why JustGRPO?

Results

Simplicity

Installation

Usage

Evaluation

Training

Citation

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages