Questions about your benchmark result in the paper

Thanks for your excellent work and for open‑sourcing the code to the community!

I have a few questions regarding the AIME24/25 and AMC23 benchmark results in **Table 2** of your paper **“Parallel-R1: Towards Parallel Thinking via Reinforcement Learning.”**

1. **Regarding Qwen3‑4B (released 2025‑04)**: This model provides both a *thinking* mode and a *non‑thinking* mode. Which mode did you use during your evaluation and training?

2. It seems that your reported results differ significantly from those in the Qwen3 technical report (https://arxiv.org/pdf/2505.09388). Their reported AIME24 and AIME25 scores are **25.0%** and **19.1%**, which are much higher than the values in your table (**2.9%** and **1.3%**). Could this discrepancy be due to your evaluation settings (for example, a limited generation context length) or another factor?

3. Based on the above and considering that your code uses `max_token = 3000`, I suspect that you may have evaluated and trained **Qwen3‑4B** in thinking mode, but with a generation length that is too short for its reasoning traces. Could this be the cause of the large performance gap?

Thanks again for your contributions!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about your benchmark result in the paper #10

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Questions about your benchmark result in the paper #10

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions