Thanks for your excellent work and for open‑sourcing the code to the community!
I have a few questions regarding the AIME24/25 and AMC23 benchmark results in Table 2 of your paper “Parallel-R1: Towards Parallel Thinking via Reinforcement Learning.”
-
Regarding Qwen3‑4B (released 2025‑04): This model provides both a thinking mode and a non‑thinking mode. Which mode did you use during your evaluation and training?
-
It seems that your reported results differ significantly from those in the Qwen3 technical report (https://arxiv.org/pdf/2505.09388). Their reported AIME24 and AIME25 scores are 25.0% and 19.1%, which are much higher than the values in your table (2.9% and 1.3%). Could this discrepancy be due to your evaluation settings (for example, a limited generation context length) or another factor?
-
Based on the above and considering that your code uses max_token = 3000, I suspect that you may have evaluated and trained Qwen3‑4B in thinking mode, but with a generation length that is too short for its reasoning traces. Could this be the cause of the large performance gap?
Thanks again for your contributions!
Thanks for your excellent work and for open‑sourcing the code to the community!
I have a few questions regarding the AIME24/25 and AMC23 benchmark results in Table 2 of your paper “Parallel-R1: Towards Parallel Thinking via Reinforcement Learning.”
Regarding Qwen3‑4B (released 2025‑04): This model provides both a thinking mode and a non‑thinking mode. Which mode did you use during your evaluation and training?
It seems that your reported results differ significantly from those in the Qwen3 technical report (https://arxiv.org/pdf/2505.09388). Their reported AIME24 and AIME25 scores are 25.0% and 19.1%, which are much higher than the values in your table (2.9% and 1.3%). Could this discrepancy be due to your evaluation settings (for example, a limited generation context length) or another factor?
Based on the above and considering that your code uses
max_token = 3000, I suspect that you may have evaluated and trained Qwen3‑4B in thinking mode, but with a generation length that is too short for its reasoning traces. Could this be the cause of the large performance gap?Thanks again for your contributions!