Skip to content

XMUDeepLIT/DHPO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dynamic Hybrid Policy Optimization

Paper Data Framework

This repository contains the official code for the ACL 2026 Findings paper "Orchestrating Tokens and Sequences: Dynamic Hybrid Policy Optimization for RLVR."

This implementation is built on top of the verl-0.5.0 release.

Dynamic Hybrid Policy Optimization (DHPO) bridges GRPO and GSPO within a single clipped surrogate objective. It combines token-level and sequence-level importance ratios through dynamic weighting, while using branch-specific clipping to stabilize optimization.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) offers a promising framework for optimizing large language models on reasoning tasks. However, existing RLVR algorithms operate at different granularities, and each comes with complementary strengths and limitations. Group Relative Policy Optimization (GRPO) uses token-level importance ratios, preserving fine-grained credit assignment but often suffering from high variance and instability. In contrast, Group Sequence Policy Optimization (GSPO) applies a single sequence-level importance ratio to all tokens in a response, which better matches sequence-level rewards but sacrifices token-wise credit assignment.

In this paper, we propose Dynamic Hybrid Policy Optimization (DHPO) to bridge GRPO and GSPO within a unified clipped surrogate objective. DHPO combines token-level and sequence-level importance ratios using weighting mechanisms. We study two mixing variants: averaged mixing and entropy-guided mixing. To further stabilize training, we introduce branch-specific clipping, which constrains token-level and sequence-level ratios within separate trust regions before mixing, preventing outliers in either branch from dominating the update.

Across seven challenging mathematical reasoning benchmarks, experiments on both dense and MoE models from the Qwen3 series show that DHPO consistently outperforms GRPO and GSPO.

What is included

This repository currently provides DHPO training recipes in verl, including:

  • DHPO with averaged mixing
  • DHPO with entropy-guided mixing
  • Branch-specific clipping for token-level and sequence-level ratio branches
  • Example training scripts for Qwen3-based math reasoning experiments

Data Preparation

We use the SimpleRL-Zoo-Data dataset from Hugging Face:

For the MATH level 3-5 split used here, you can download the files with:

mkdir -p ./data/math35
cd ./data/math35
wget https://huggingface.co/datasets/hkust-nlp/SimpleRL-Zoo-Data/resolve/main/simplelr_qwen_level3to5/train.parquet
wget https://huggingface.co/datasets/hkust-nlp/SimpleRL-Zoo-Data/resolve/main/simplelr_qwen_level3to5/test.parquet

Or use the following paths:

TRAIN_FILE=./data/math35/train.parquet
TEST_FILE=./data/math35/test.parquet

Quick Start

Step 1: Configure paths in the script

Edit the following placeholders in your training script:

WORKING_DIR=your_path
MODEL_PATH=/your_model_path/Qwen3-1.7B-base
CKPTS_DIR=/your_path/${exp_name}
LOG_PATH=/your_log_path
# dataset paths: TRAIN_FILE, TEST_FILE
# optionally set WANDB_API_KEY=... (or disable wandb)

Step 2: Launch training

Example:

bash /verl-0.5.0/recipe/dhpo/qwen3_1.7b_base/train_math35_token-mean_dhpo_entropy.sh

You can also use the corresponding dhpo_avg scripts under recipe/dhpo/ for the averaged-mixing variant.

Key DHPO Parameters

DHPO is activated in the training scripts with settings such as:

actor_rollout_ref.actor.policy_loss.loss_mode=dhpo_avg
actor_rollout_ref.actor.entropy_weight_type=minmax_sigmoid
+actor_rollout_ref.actor.clip_ratio_low_seq=0.2
+actor_rollout_ref.actor.clip_ratio_high_seq=0.28
actor_rollout_ref.actor.clip_ratio_low=0.2
actor_rollout_ref.actor.clip_ratio_high=0.28

Parameter meanings:

  • loss_mode=dhpo_entropy vs loss_mode=dhpo_avg: Selects DHPO with either entropy-guided mixing or averaged mixing between token-level and sequence-level importance ratios.

  • entropy_weight_type=minmax_sigmoid: Specifies the weight normalization strategy used in the entropy-guided mixing variant.

  • clip_ratio_low/high vs clip_ratio_low_seq/high_seq: Enables branch-specific clipping by defining separate clipping ranges for:

    • the token-level branch (clip_ratio_low/high)
    • the sequence-level branch (clip_ratio_low_seq/high_seq)

Separate trust regions are important for stabilizing DHPO updates when either branch produces ratio outliers.

Citation

If you find this repository useful, please cite:

@misc{min2026dhpo,
  title={Orchestrating Tokens and Sequences: Dynamic Hybrid Policy Optimization for RLVR},
  author={Zijun Min and Bingshuai Liu and Ante Wang and Long Zhang and Anxiang Zeng and Haibo Zhang and Jinsong Su},
  year={2026},
  eprint={2601.05607},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2601.05607},
}

About

This repository contains the official code for the ACL 2026 Findings paper "Orchestrating Tokens and Sequences: Dynamic Hybrid Policy Optimization for RLVR."

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors