Skip to content

Add HPSv2 reward model support#8

Draft
MikukuOvO wants to merge 11 commits intoRockdu:diffusion_RL_v0.1from
voidreaming:feat/hps-reward
Draft

Add HPSv2 reward model support#8
MikukuOvO wants to merge 11 commits intoRockdu:diffusion_RL_v0.1from
voidreaming:feat/hps-reward

Conversation

@MikukuOvO
Copy link
Copy Markdown
Collaborator

@MikukuOvO MikukuOvO commented Apr 26, 2026

1. Change Summary

Files changed

  • miles/rollout/rm_hub/__init__.py
  • miles/rollout/rm_hub/hps.py
  • miles/utils/arguments.py
  • requirements.txt
  • scripts/run-diffusion-grpo-hps-smoke.sh

What changed

  • miles/rollout/rm_hub/__init__.py

    • Registers rm_type=hps in the reward model dispatch path.
    • Uses the batched HPS path when every sample in a batch requests HPS.
    • Keeps OCR lazy-imported so importing non-OCR reward paths does not require PaddleOCR to import successfully.
  • miles/rollout/rm_hub/hps.py

    • Adds an HPS / HPSv2.1 reward scorer for diffusion rollout samples.
    • Converts rollout tensors from [C, F, H, W] into RGB uint8 HWC images before scoring.
    • Loads the HPS ViT-H-14 model through hpsv2.src.open_clip.create_model_and_transforms, then loads the HPSv2 checkpoint weights.
    • Computes the reward as the diagonal of image_features @ text_features.T, matching the DanceGRPO HPSv2 reward formula.
    • Runs scoring through a Ray actor pool so rollout reward inference can be batched and isolated from the training process.
  • miles/utils/arguments.py

    • Adds HPS runtime knobs: number of workers, GPU resources per worker, batch size, HPS version, and optional local checkpoint path.
  • requirements.txt

    • Adds the hpsv2 runtime dependency.
  • scripts/run-diffusion-grpo-hps-smoke.sh

    • Adds a focused diffusion GRPO smoke script that selects --rm-type hps and wires the HPS-specific runtime arguments.

2. Validation

I cloned the official DanceGRPO repo and used its HPSv2.1 reward implementation as the reference:

  • repo: https://github.com/XueZeyue/DanceGRPO
  • reference code path: fastvideo/train_grpo_qwenimage.py
  • reference formula: preprocess image, tokenize prompt, run HPS ViT-H-14, then use torch.diagonal(image_features @ text_features.T) as reward.

I then ran one focused reward-alignment test on GPU 0 in /root/miniconda3/envs/miles-rollout-test.

The test uses the same 3 fixed prompt/image pairs and compares three paths:

  • DanceGRPO HPSv2.1 reference implementation.
  • MILES direct HPSScorer.
  • MILES Ray-backed hps_rm path with hps_batch_size=1, matching DanceGRPO's per-sample reward call granularity.

3. Experiment Report

idx image dancegrpo miles_scorer miles_rm scorer_diff rm_diff aligned
0 cat.png 0.3012695312 0.3012695312 0.3012695312 0.000e+00 0.000e+00 yes
1 test.jpg 0.1445312500 0.1445312500 0.1445312500 0.000e+00 0.000e+00 yes
2 flow_grpo_fast.png 0.1787109375 0.1787109375 0.1787109375 0.000e+00 0.000e+00 yes

Result:

  • raw_max_abs_diff_direct=0.000e+00
  • raw_max_abs_diff_rm=0.000e+00

The MILES direct scorer and Ray-backed reward path match the DanceGRPO HPSv2.1 reward exactly for these fixed inputs when using the same per-sample reward granularity.

Note: when multiple samples are scored in one AMP batch, HPS can show tiny BF16-level differences around 1.221e-04 versus DanceGRPO's per-sample path. The validation above matches DanceGRPO's actual per-sample scoring path.

4. Test Plan

  • Add a lightweight unit test for rollout tensor-to-RGB conversion and HPS argument dispatch that does not require downloading HPS weights.
  • Add an optional GPU/nightly HPS alignment test using cached HPSv2.1 weights, comparing DanceGRPO reference output, MILES direct HPSScorer, and Ray-backed hps_rm with max_abs_diff <= 1e-6.

MikukuOvO and others added 11 commits April 20, 2026 17:04
Wires in HPS / HPSv2.1 (ViT-H-14 + xswu/HPSv2 checkpoint) as a second
reward model alongside PickScore. A training run selects it via
--rm-type hps and --hps-version v2.1 (default). Includes a smoke
script and a standalone compare_reward_models.py harness for
PickScore vs HPS correlation checks.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants