Prerequisites
Feature Description
PR #20297 (merged ~2025-03-11) added --reasoning-budget-message, which injects
a string at the hard token cutoff to signal the model to wrap up. The injection
still happens at exactly token N with zero warning. The model gets no time to
finish its current thought before the </think> boundary.
Graceful termination before the cutoff is expected to improve reasoning
performance. See Motivation and Possible Implementation for details.
Motivation
Raw truncation of the thinking trace measurably reduces answer quality compared
to graceful termination. Muennighoff et al. (s1: Simple test-time scaling,
arXiv:2501.19393) showed that appending an end-of-thinking delimiter with
"Final Answer:" before the hard cut outperforms naïve truncation. This was
confirmed by follow-up work (arXiv:2505.05315), which states directly that
"the S1 approach performs better than directly truncating the full reasoning
trajectory, underscoring the importance of preserving the solution segment."
The current --reasoning-budget-message implementation injects a wrap-up message
at exactly token N, leaving the model zero tokens to actually act on it. This is
functionally equivalent to truncation with a cosmetic suffix — the model cannot
produce a meaningful conclusion within 0 remaining tokens.
Possible Implementation
Three approaches in increasing complexity:
Option A — message offset
Inject the budget message at budget - offset instead of budget, giving the
model offset tokens to act on the wrap-up signal. The s1 paper
(https://arxiv.org/abs/2501.19393) shows this outperforms raw truncation.
Change in the COUNTING state of reasoning-budget.cpp:
if (ctx->remaining <= ctx->warn_offset) // was: <= 0
Files: reasoning-budget.cpp, reasoning-budget.h, common.h, arg.cpp,
sampling.cpp. Default 0 preserves current behavior.
Option B — separate budgeting
Split total budget into thinking (t) and conclusion (s) phases, t + s = budget.
Force </think> at token t, leave s tokens for a conclusion before the answer.
Follow-up work on s1 (https://arxiv.org/abs/2505.05315) shows this outperforms
the basic offset approach. Would add a second parameter, e.g.
--reasoning-budget-conclusion M.
Option C — logit biasing toward </think>
Ramp up the logit weight of </think> as the budget runs low, then force it at
the hard limit. No dependency on an injected phrase, model-agnostic. NVIDIA NIM
uses this approach with a 10% extension window for sentence completion. More
invasive (sampler changes) but cleanest long-term.
Options A and B build on the existing message injection architecture. Option C
is a parallel approach that could coexist with either.
Happy to test on ROCm/gfx1100 (Qwen3.5-27B) and contribute a draft of Option A
or B if there is interest.
Prerequisites
Feature Description
PR #20297 (merged ~2025-03-11) added
--reasoning-budget-message, which injectsa string at the hard token cutoff to signal the model to wrap up. The injection
still happens at exactly token N with zero warning. The model gets no time to
finish its current thought before the
</think>boundary.Graceful termination before the cutoff is expected to improve reasoning
performance. See Motivation and Possible Implementation for details.
Motivation
Raw truncation of the thinking trace measurably reduces answer quality compared
to graceful termination. Muennighoff et al. (s1: Simple test-time scaling,
arXiv:2501.19393) showed that appending an end-of-thinking delimiter with
"Final Answer:" before the hard cut outperforms naïve truncation. This was
confirmed by follow-up work (arXiv:2505.05315), which states directly that
"the S1 approach performs better than directly truncating the full reasoning
trajectory, underscoring the importance of preserving the solution segment."
The current --reasoning-budget-message implementation injects a wrap-up message
at exactly token N, leaving the model zero tokens to actually act on it. This is
functionally equivalent to truncation with a cosmetic suffix — the model cannot
produce a meaningful conclusion within 0 remaining tokens.
Possible Implementation
Three approaches in increasing complexity:
Option A — message offset
Inject the budget message at
budget - offsetinstead ofbudget, giving themodel
offsettokens to act on the wrap-up signal. The s1 paper(https://arxiv.org/abs/2501.19393) shows this outperforms raw truncation.
Change in the COUNTING state of reasoning-budget.cpp:
Files: reasoning-budget.cpp, reasoning-budget.h, common.h, arg.cpp,
sampling.cpp. Default 0 preserves current behavior.
Option B — separate budgeting
Split total budget into thinking (t) and conclusion (s) phases, t + s = budget.
Force
</think>at token t, leave s tokens for a conclusion before the answer.Follow-up work on s1 (https://arxiv.org/abs/2505.05315) shows this outperforms
the basic offset approach. Would add a second parameter, e.g.
--reasoning-budget-conclusion M.
Option C — logit biasing toward
</think>Ramp up the logit weight of
</think>as the budget runs low, then force it atthe hard limit. No dependency on an injected phrase, model-agnostic. NVIDIA NIM
uses this approach with a 10% extension window for sentence completion. More
invasive (sampler changes) but cleanest long-term.
Options A and B build on the existing message injection architecture. Option C
is a parallel approach that could coexist with either.
Happy to test on ROCm/gfx1100 (Qwen3.5-27B) and contribute a draft of Option A
or B if there is interest.