Dynamic Hybrid Policy Optimization

This repository contains the official code for the ACL 2026 Findings paper "Orchestrating Tokens and Sequences: Dynamic Hybrid Policy Optimization for RLVR."

This implementation is built on top of the verl-0.5.0 release.

Dynamic Hybrid Policy Optimization (DHPO) bridges GRPO and GSPO within a single clipped surrogate objective. It combines token-level and sequence-level importance ratios through dynamic weighting, while using branch-specific clipping to stabilize optimization.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) offers a promising framework for optimizing large language models on reasoning tasks. However, existing RLVR algorithms operate at different granularities, and each comes with complementary strengths and limitations. Group Relative Policy Optimization (GRPO) uses token-level importance ratios, preserving fine-grained credit assignment but often suffering from high variance and instability. In contrast, Group Sequence Policy Optimization (GSPO) applies a single sequence-level importance ratio to all tokens in a response, which better matches sequence-level rewards but sacrifices token-wise credit assignment.

In this paper, we propose Dynamic Hybrid Policy Optimization (DHPO) to bridge GRPO and GSPO within a unified clipped surrogate objective. DHPO combines token-level and sequence-level importance ratios using weighting mechanisms. We study two mixing variants: averaged mixing and entropy-guided mixing. To further stabilize training, we introduce branch-specific clipping, which constrains token-level and sequence-level ratios within separate trust regions before mixing, preventing outliers in either branch from dominating the update.

Across seven challenging mathematical reasoning benchmarks, experiments on both dense and MoE models from the Qwen3 series show that DHPO consistently outperforms GRPO and GSPO.

What is included

This repository currently provides DHPO training recipes in verl, including:

DHPO with averaged mixing
DHPO with entropy-guided mixing
Branch-specific clipping for token-level and sequence-level ratio branches
Example training scripts for Qwen3-based math reasoning experiments

Data Preparation

We use the SimpleRL-Zoo-Data dataset from Hugging Face:

Dataset page: hkust-nlp/SimpleRL-Zoo-Data
Download reference: hkust-nlp/simpleRL-reason

For the MATH level 3-5 split used here, you can download the files with:

mkdir -p ./data/math35
cd ./data/math35
wget https://huggingface.co/datasets/hkust-nlp/SimpleRL-Zoo-Data/resolve/main/simplelr_qwen_level3to5/train.parquet
wget https://huggingface.co/datasets/hkust-nlp/SimpleRL-Zoo-Data/resolve/main/simplelr_qwen_level3to5/test.parquet

Or use the following paths:

TRAIN_FILE=./data/math35/train.parquet
TEST_FILE=./data/math35/test.parquet

Quick Start

Step 1: Configure paths in the script

Edit the following placeholders in your training script:

WORKING_DIR=your_path
MODEL_PATH=/your_model_path/Qwen3-1.7B-base
CKPTS_DIR=/your_path/${exp_name}
LOG_PATH=/your_log_path
# dataset paths: TRAIN_FILE, TEST_FILE
# optionally set WANDB_API_KEY=... (or disable wandb)

Step 2: Launch training

Example:

bash /verl-0.5.0/recipe/dhpo/qwen3_1.7b_base/train_math35_token-mean_dhpo_entropy.sh

You can also use the corresponding dhpo_avg scripts under recipe/dhpo/ for the averaged-mixing variant.

Key DHPO Parameters

DHPO is activated in the training scripts with settings such as:

actor_rollout_ref.actor.policy_loss.loss_mode=dhpo_avg
actor_rollout_ref.actor.entropy_weight_type=minmax_sigmoid
+actor_rollout_ref.actor.clip_ratio_low_seq=0.2
+actor_rollout_ref.actor.clip_ratio_high_seq=0.28
actor_rollout_ref.actor.clip_ratio_low=0.2
actor_rollout_ref.actor.clip_ratio_high=0.28

Parameter meanings:

loss_mode=dhpo_entropy vs loss_mode=dhpo_avg: Selects DHPO with either entropy-guided mixing or averaged mixing between token-level and sequence-level importance ratios.
entropy_weight_type=minmax_sigmoid: Specifies the weight normalization strategy used in the entropy-guided mixing variant.
clip_ratio_low/high vs clip_ratio_low_seq/high_seq: Enables branch-specific clipping by defining separate clipping ranges for:
- the token-level branch (clip_ratio_low/high)
- the sequence-level branch (clip_ratio_low_seq/high_seq)

Separate trust regions are important for stabilizing DHPO updates when either branch produces ratio outliers.

Citation

If you find this repository useful, please cite:

@misc{min2026dhpo,
  title={Orchestrating Tokens and Sequences: Dynamic Hybrid Policy Optimization for RLVR},
  author={Zijun Min and Bingshuai Liu and Ante Wang and Long Zhang and Anxiang Zeng and Haibo Zhang and Jinsong Su},
  year={2026},
  eprint={2601.05607},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2601.05607},
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
recipe/dhpo		recipe/dhpo
scripts		scripts
verl		verl
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dynamic Hybrid Policy Optimization

Abstract

What is included

Data Preparation

Quick Start

Step 1: Configure paths in the script

Step 2: Launch training

Key DHPO Parameters

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Dynamic Hybrid Policy Optimization

Abstract

What is included

Data Preparation

Quick Start

Step 1: Configure paths in the script

Step 2: Launch training

Key DHPO Parameters

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages