Skip to content

Request for rollout data used in Figure 7 and Figure 10 #5

@qbetterk

Description

@qbetterk

Hi ThinkPRM authors,

Thank you for releasing this interesting work. I am currently using ThinkPRM as a baseline in my research, and I would like to add a new comparison using our own reward model / verifier.

For a fair comparison, I would like to evaluate our reward model on the same rollout candidates used in the paper, so that the only variable is the reward model used to select among candidate solutions.

Specifically, I am interested in the rollout data used for:

  1. Figure 7: the Best-of-N evaluation on MATH-500.
    • I noticed that the paper appears to use Qwen2.5-14B for MATH-500 in Figure 7, while Qwen2.5-32B-Instruct is used for AIME ’24. Could you confirm the exact generator model used for the MATH-500 rollouts?
    • If available, could you share the sampled candidate solutions used for the MATH-500 Best-of-N evaluation?
  2. Figure 10: the Best-of-N evaluation on GPQA-Physics.
    • My understanding is that this uses Qwen2.5-32B-Instruct as the generator model.
    • If available, could you share the sampled candidate solutions used for the GPQA-Physics evaluation?

Having access to the exact same rollouts would make the comparison much fairer, since it avoids differences caused by sampling randomness from the generator model and isolates the effect of the reward model / verifier.

If the full rollout data cannot be released publicly, I would also greatly appreciate any guidance on how to reproduce the evaluation setup as closely as possible.

Thanks again for the great work!

cc @zmykevin

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions