Add HPSv2 reward model support by MikukuOvO · Pull Request #8 · Rockdu/miles

MikukuOvO · 2026-04-26T15:32:15Z

1. Change Summary

Files changed

miles/rollout/rm_hub/__init__.py
miles/rollout/rm_hub/hps.py
miles/utils/arguments.py
requirements.txt
scripts/run-diffusion-grpo-hps-smoke.sh

What changed

miles/rollout/rm_hub/__init__.py
- Registers rm_type=hps in the reward model dispatch path.
- Uses the batched HPS path when every sample in a batch requests HPS.
- Keeps OCR lazy-imported so importing non-OCR reward paths does not require PaddleOCR to import successfully.
miles/rollout/rm_hub/hps.py
- Adds an HPS / HPSv2.1 reward scorer for diffusion rollout samples.
- Converts rollout tensors from [C, F, H, W] into RGB uint8 HWC images before scoring.
- Loads the HPS ViT-H-14 model through hpsv2.src.open_clip.create_model_and_transforms, then loads the HPSv2 checkpoint weights.
- Computes the reward as the diagonal of image_features @ text_features.T, matching the DanceGRPO HPSv2 reward formula.
- Runs scoring through a Ray actor pool so rollout reward inference can be batched and isolated from the training process.
miles/utils/arguments.py
- Adds HPS runtime knobs: number of workers, GPU resources per worker, batch size, HPS version, and optional local checkpoint path.
requirements.txt
- Adds the hpsv2 runtime dependency.
scripts/run-diffusion-grpo-hps-smoke.sh
- Adds a focused diffusion GRPO smoke script that selects --rm-type hps and wires the HPS-specific runtime arguments.

2. Validation

I cloned the official DanceGRPO repo and used its HPSv2.1 reward implementation as the reference:

repo: https://github.com/XueZeyue/DanceGRPO
reference code path: fastvideo/train_grpo_qwenimage.py
reference formula: preprocess image, tokenize prompt, run HPS ViT-H-14, then use torch.diagonal(image_features @ text_features.T) as reward.

I then ran one focused reward-alignment test on GPU 0 in /root/miniconda3/envs/miles-rollout-test.

The test uses the same 3 fixed prompt/image pairs and compares three paths:

DanceGRPO HPSv2.1 reference implementation.
MILES direct HPSScorer.
MILES Ray-backed hps_rm path with hps_batch_size=1, matching DanceGRPO's per-sample reward call granularity.

3. Experiment Report

idx	image	dancegrpo	miles_scorer	miles_rm	scorer_diff	rm_diff	aligned
0	`cat.png`	0.3012695312	0.3012695312	0.3012695312	0.000e+00	0.000e+00	yes
1	`test.jpg`	0.1445312500	0.1445312500	0.1445312500	0.000e+00	0.000e+00	yes
2	`flow_grpo_fast.png`	0.1787109375	0.1787109375	0.1787109375	0.000e+00	0.000e+00	yes

Result:

raw_max_abs_diff_direct=0.000e+00
raw_max_abs_diff_rm=0.000e+00

The MILES direct scorer and Ray-backed reward path match the DanceGRPO HPSv2.1 reward exactly for these fixed inputs when using the same per-sample reward granularity.

Note: when multiple samples are scored in one AMP batch, HPS can show tiny BF16-level differences around 1.221e-04 versus DanceGRPO's per-sample path. The validation above matches DanceGRPO's actual per-sample scoring path.

4. Test Plan

Add a lightweight unit test for rollout tensor-to-RGB conversion and HPS argument dispatch that does not require downloading HPS weights.
Add an optional GPU/nightly HPS alignment test using cached HPSv2.1 weights, comparing DanceGRPO reference output, MILES direct HPSScorer, and Ray-backed hps_rm with max_abs_diff <= 1e-6.

Wires in HPS / HPSv2.1 (ViT-H-14 + xswu/HPSv2 checkpoint) as a second reward model alongside PickScore. A training run selects it via --rm-type hps and --hps-version v2.1 (default). Includes a smoke script and a standalone compare_reward_models.py harness for PickScore vs HPS correlation checks. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

MikukuOvO and others added 11 commits April 20, 2026 17:04

add pickscore reward model support

03680b2

remove pickscore dtype option

e47acf0

simplify pickscore reward adapter

f96f818

Align PickScore smoke with rollout API

3ae403b

Scope PickScore smoke validation changes

23f038d

Merge remote-tracking branch 'origin/diffusion_RL_v0.1' into reward

bfa6075

Align PickScore smoke with latest diffusion args

7519869

Match FlowGRPO reward std normalization

c1dd521

Make HPS reward branch standalone

47e1e6d

Lazy import OCR reward path

9800a99

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HPSv2 reward model support#8

Add HPSv2 reward model support#8
MikukuOvO wants to merge 11 commits intoRockdu:diffusion_RL_v0.1from
voidreaming:feat/hps-reward

MikukuOvO commented Apr 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MikukuOvO commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Change Summary

Files changed

What changed

2. Validation

3. Experiment Report

4. Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MikukuOvO commented Apr 26, 2026 •

edited

Loading