This repository contains the official code for the ACL 2026 Findings paper "Orchestrating Tokens and Sequences: Dynamic Hybrid Policy Optimization for RLVR."
This implementation is built on top of the verl-0.5.0 release.
Dynamic Hybrid Policy Optimization (DHPO) bridges GRPO and GSPO within a single clipped surrogate objective. It combines token-level and sequence-level importance ratios through dynamic weighting, while using branch-specific clipping to stabilize optimization.
Reinforcement Learning with Verifiable Rewards (RLVR) offers a promising framework for optimizing large language models on reasoning tasks. However, existing RLVR algorithms operate at different granularities, and each comes with complementary strengths and limitations. Group Relative Policy Optimization (GRPO) uses token-level importance ratios, preserving fine-grained credit assignment but often suffering from high variance and instability. In contrast, Group Sequence Policy Optimization (GSPO) applies a single sequence-level importance ratio to all tokens in a response, which better matches sequence-level rewards but sacrifices token-wise credit assignment.
In this paper, we propose Dynamic Hybrid Policy Optimization (DHPO) to bridge GRPO and GSPO within a unified clipped surrogate objective. DHPO combines token-level and sequence-level importance ratios using weighting mechanisms. We study two mixing variants: averaged mixing and entropy-guided mixing. To further stabilize training, we introduce branch-specific clipping, which constrains token-level and sequence-level ratios within separate trust regions before mixing, preventing outliers in either branch from dominating the update.
Across seven challenging mathematical reasoning benchmarks, experiments on both dense and MoE models from the Qwen3 series show that DHPO consistently outperforms GRPO and GSPO.
This repository currently provides DHPO training recipes in verl, including:
- DHPO with averaged mixing
- DHPO with entropy-guided mixing
- Branch-specific clipping for token-level and sequence-level ratio branches
- Example training scripts for Qwen3-based math reasoning experiments
We use the SimpleRL-Zoo-Data dataset from Hugging Face:
- Dataset page: hkust-nlp/SimpleRL-Zoo-Data
- Download reference: hkust-nlp/simpleRL-reason
For the MATH level 3-5 split used here, you can download the files with:
mkdir -p ./data/math35
cd ./data/math35
wget https://huggingface.co/datasets/hkust-nlp/SimpleRL-Zoo-Data/resolve/main/simplelr_qwen_level3to5/train.parquet
wget https://huggingface.co/datasets/hkust-nlp/SimpleRL-Zoo-Data/resolve/main/simplelr_qwen_level3to5/test.parquetOr use the following paths:
TRAIN_FILE=./data/math35/train.parquet
TEST_FILE=./data/math35/test.parquetEdit the following placeholders in your training script:
WORKING_DIR=your_path
MODEL_PATH=/your_model_path/Qwen3-1.7B-base
CKPTS_DIR=/your_path/${exp_name}
LOG_PATH=/your_log_path
# dataset paths: TRAIN_FILE, TEST_FILE
# optionally set WANDB_API_KEY=... (or disable wandb)Example:
bash /verl-0.5.0/recipe/dhpo/qwen3_1.7b_base/train_math35_token-mean_dhpo_entropy.shYou can also use the corresponding dhpo_avg scripts under recipe/dhpo/ for the averaged-mixing variant.
DHPO is activated in the training scripts with settings such as:
actor_rollout_ref.actor.policy_loss.loss_mode=dhpo_avg
actor_rollout_ref.actor.entropy_weight_type=minmax_sigmoid
+actor_rollout_ref.actor.clip_ratio_low_seq=0.2
+actor_rollout_ref.actor.clip_ratio_high_seq=0.28
actor_rollout_ref.actor.clip_ratio_low=0.2
actor_rollout_ref.actor.clip_ratio_high=0.28Parameter meanings:
-
loss_mode=dhpo_entropyvsloss_mode=dhpo_avg: Selects DHPO with either entropy-guided mixing or averaged mixing between token-level and sequence-level importance ratios. -
entropy_weight_type=minmax_sigmoid: Specifies the weight normalization strategy used in the entropy-guided mixing variant. -
clip_ratio_low/highvsclip_ratio_low_seq/high_seq: Enables branch-specific clipping by defining separate clipping ranges for:- the token-level branch (
clip_ratio_low/high) - the sequence-level branch (
clip_ratio_low_seq/high_seq)
- the token-level branch (
Separate trust regions are important for stabilizing DHPO updates when either branch produces ratio outliers.
If you find this repository useful, please cite:
@misc{min2026dhpo,
title={Orchestrating Tokens and Sequences: Dynamic Hybrid Policy Optimization for RLVR},
author={Zijun Min and Bingshuai Liu and Ante Wang and Long Zhang and Anxiang Zeng and Haibo Zhang and Jinsong Su},
year={2026},
eprint={2601.05607},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2601.05607},
}