Skip to content

LeapLabTHU/JustGRPO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

JustGRPO

The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models

Zanlin Ni1Shenzhi Wang1Yang Yue1Tianyu Yu2Weilin Zhao2Yeguo Hua3

Tianyi Chen3   Jun Song4   Cheng Yu4   Bo Zheng4Gao Huang1✉

1LeapLab, Tsinghua University   2NLPLab, Tsinghua University   3Tsinghua University   4Alibaba Group

Project arXiv License Model

No combinatorial trajectories. No ELBO approximations. No diffusion-specific adaptations.

Just GRPO.

📢 News

  • [2026.03] 🎉 Training code, evaluation scripts, and model checkpoints for MATH-500, HumanEval and MBPP datasets released!
  • [2026.01] 📄 Paper available on arXiv!
  • [2026.01] 🎉 Training code, evaluation scripts, and model checkpoint on GSM8K released!

Why JustGRPO?

Diffusion LLMs (dLLMs) can generate tokens in arbitrary order, which theoretically offers more flexibility than standard left-to-right generation. But does this flexibility actually unlocks unique reasoning capabilities inaccessible to standard AR models?

Mechanism to Pass@k

We found the opposite. Arbitrary-order generation allows models to bypass high-uncertainty tokens (e.g., "Therefore", "Since") — the very tokens that create branching points in reasoning. This premature bypass collapses the solution space, leading to lower reasoning potential (Pass@k).

Our solution is simple: Since AR order preserves better reasoning potential, we just train dLLMs with standard GRPO in AR mode. No bells and whistles.

Results

JustGRPO achieves state-of-the-art performance across reasoning and coding benchmarks:

Accuracy Comparison
Benchmark Gen Length 128 Gen Length 256 Gen Length 512
GSM8K 83.8 89.1 89.8
MATH-500 39.0 45.1 45.2
HumanEval 37.8 49.4 48.7
MBPP 50.6 52.4 49.0

Simplicity

Existing RL methods for dLLMs often require handling the complexity of arbitrary-order generation:

Challenge Description
Combinatorial trajectories Optimizing over factorial-sized denoising paths
Intractable likelihoods ELBO-based surrogates instead of true objectives
Sampler-learner mismatch Confidence-based samplers vs. original diffusion prior
  • JustGRPO sidesteps all of this by treating dLLMs as autoregressive models during RL training. The result? Standard GRPO, directly applicable, with exact likelihood computation.
  • The core logic of JustGRPO (grpo.py) fits in ~60 lines: rollout sampling and log-probability loss computation. That's it.

💡 The model still retains parallel decoding at inference time — we only use AR order during training. See our paper for more details.

Installation

JustGRPO is designed to be lightweight and dependency-minimal.

git clone https://github.com/LeapLabTHU/JustGRPO.git
cd JustGRPO
pip install -r requirements.txt

Dependencies:

  • accelerate
  • transformers
  • datasets
  • Standard evaluation utilities (sympy, latex2sympy2, etc.)

Usage

We provide evaluation and training code for GSM8K, MATH-500, HumanEval, and MBPP.

Evaluation

Model checkpoints:

torchrun --nproc-per-node=8 eval.py \
  --task gsm8k \  # math500/humaneval/mbpp
  --ckpt_path /path/to/ckpt \
  --gen_length 256 --steps 256 --block_length 32

Training

Math (GSM8K / MATH-500):

accelerate launch --num_processes 8 --config_file configs/fsdp.yaml train.py \
  --dataset gsm8k \
  --grad_accum 8
accelerate launch --num_processes 8 --config_file configs/fsdp.yaml train.py \
  --dataset math \
  --grad_accum 8

Code (MBPP / HumanEval):

Code training uses the AceCode-Hard subset, following ml-diffucoder. You can download the dataset here: AceCode-Hard (Google Drive). Place the downloaded file at datasets/acecode_hard.jsonl.

accelerate launch --num_processes 8 --config_file configs/fsdp.yaml train.py \
  --dataset code \
  --code_data_path datasets/acecode_hard.jsonl \
  --grad_accum 8

Note: Keep global batch size = num_gpus × grad_accum = 64.

Citation

If you find this work useful, please consider citing our paper.

@article{ni2026flexibility,
  title={The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models},
  author={Ni, Zanlin and Wang, Shenzhi and Yue, Yang and Yu, Tianyu and Zhao, Weilin and Hua, Yeguo and Chen, Tianyi and Song, Jun and Yu, Cheng and Zheng, Bo and Huang, Gao},
  journal={arXiv preprint arXiv:2601.15165},
  year={2026}
}

Acknowledgments

This project builds upon the following excellent works:

We sincerely appreciate the authors for making their work open source.

About

Minimalist RL for Diffusion LLMs with SOTA reasoning performance (89.1% GSM8K). Official implementation of "The Flexibility Trap".

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages