Hi WritingBench authors,
Thank you for open-sourcing WritingBench and the evaluation scripts.
We are trying to reproduce the length subset result for Qwen2.5-7B-Instruct, and we would like to confirm whether our current setting is comparable to the leaderboard / paper setting.
What we have done:
- We aligned the generation configuration with the official recommendation as closely as possible:
temperature = 0.7
top_p = 0.8
top_k = 20
max_tokens = 8192 (limited by our serving platform)
- We also verified the scoring / aggregation logic carefully:
official generate_response.py + evaluate_benchmark.py + calculate_scores.py
All of these routes consistently give us a result around:
Overall ≈ 5.50
length_R ≈ 5.50
The only unavoidable difference we found is in the Claude judge call:
- we use claude_sonnet-4-5-20250929
- for Claude Sonnet 4.5, the provider does not allow
temperature and top_p to be specified at the same time
- so for evaluation we used the closest compatible setting:
top_p = 0.95
max_tokens = 2048
temperature omitted, letting the provider default apply
Given that the rest of the pipeline seems aligned, we would like to ask:
- Is the public leaderboard / paper result for
Qwen2.5-7B-Instruct directly comparable to the current open-source evaluation scripts?
- Was the reported result produced with the released Claude-based evaluation pipeline, or with the WritingBench critic model?
- Could the difference come from a different judge deployment environment or parameter handling?
We would really appreciate any clarification. Thank you very much!
Hi WritingBench authors,
Thank you for open-sourcing WritingBench and the evaluation scripts.
We are trying to reproduce the
lengthsubset result forQwen2.5-7B-Instruct, and we would like to confirm whether our current setting is comparable to the leaderboard / paper setting.What we have done:
temperature = 0.7top_p = 0.8top_k = 20max_tokens = 8192(limited by our serving platform)official
generate_response.py+evaluate_benchmark.py+calculate_scores.pyAll of these routes consistently give us a result around:
Overall ≈ 5.50length_R ≈ 5.50The only unavoidable difference we found is in the Claude judge call:
temperatureandtop_pto be specified at the same timetop_p = 0.95max_tokens = 2048temperatureomitted, letting the provider default applyGiven that the rest of the pipeline seems aligned, we would like to ask:
Qwen2.5-7B-Instructdirectly comparable to the current open-source evaluation scripts?We would really appreciate any clarification. Thank you very much!