Hi ThinkPRM authors,
Thank you for releasing this interesting work. I am currently using ThinkPRM as a baseline in my research, and I would like to add a new comparison using our own reward model / verifier.
For a fair comparison, I would like to evaluate our reward model on the same rollout candidates used in the paper, so that the only variable is the reward model used to select among candidate solutions.
Specifically, I am interested in the rollout data used for:
- Figure 7: the Best-of-N evaluation on MATH-500.
- I noticed that the paper appears to use Qwen2.5-14B for MATH-500 in Figure 7, while Qwen2.5-32B-Instruct is used for AIME ’24. Could you confirm the exact generator model used for the MATH-500 rollouts?
- If available, could you share the sampled candidate solutions used for the MATH-500 Best-of-N evaluation?
- Figure 10: the Best-of-N evaluation on GPQA-Physics.
- My understanding is that this uses Qwen2.5-32B-Instruct as the generator model.
- If available, could you share the sampled candidate solutions used for the GPQA-Physics evaluation?
Having access to the exact same rollouts would make the comparison much fairer, since it avoids differences caused by sampling randomness from the generator model and isolates the effect of the reward model / verifier.
If the full rollout data cannot be released publicly, I would also greatly appreciate any guidance on how to reproduce the evaluation setup as closely as possible.
Thanks again for the great work!
cc @zmykevin
Hi ThinkPRM authors,
Thank you for releasing this interesting work. I am currently using ThinkPRM as a baseline in my research, and I would like to add a new comparison using our own reward model / verifier.
For a fair comparison, I would like to evaluate our reward model on the same rollout candidates used in the paper, so that the only variable is the reward model used to select among candidate solutions.
Specifically, I am interested in the rollout data used for:
Having access to the exact same rollouts would make the comparison much fairer, since it avoids differences caused by sampling randomness from the generator model and isolates the effect of the reward model / verifier.
If the full rollout data cannot be released publicly, I would also greatly appreciate any guidance on how to reproduce the evaluation setup as closely as possible.
Thanks again for the great work!
cc @zmykevin