Skip to content

Feature Request: Graceful reasoning budget termination. Avoid mid-sentence cutoff. #20632

@CleStil

Description

@CleStil

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

PR #20297 (merged ~2025-03-11) added --reasoning-budget-message, which injects
a string at the hard token cutoff to signal the model to wrap up. The injection
still happens at exactly token N with zero warning. The model gets no time to
finish its current thought before the </think> boundary.

Graceful termination before the cutoff is expected to improve reasoning
performance. See Motivation and Possible Implementation for details.

Motivation

Raw truncation of the thinking trace measurably reduces answer quality compared
to graceful termination. Muennighoff et al. (s1: Simple test-time scaling,
arXiv:2501.19393) showed that appending an end-of-thinking delimiter with
"Final Answer:" before the hard cut outperforms naïve truncation. This was
confirmed by follow-up work (arXiv:2505.05315), which states directly that
"the S1 approach performs better than directly truncating the full reasoning
trajectory, underscoring the importance of preserving the solution segment."

The current --reasoning-budget-message implementation injects a wrap-up message
at exactly token N, leaving the model zero tokens to actually act on it. This is
functionally equivalent to truncation with a cosmetic suffix — the model cannot
produce a meaningful conclusion within 0 remaining tokens.

Possible Implementation

Three approaches in increasing complexity:

Option A — message offset
Inject the budget message at budget - offset instead of budget, giving the
model offset tokens to act on the wrap-up signal. The s1 paper
(https://arxiv.org/abs/2501.19393) shows this outperforms raw truncation.
Change in the COUNTING state of reasoning-budget.cpp:

if (ctx->remaining <= ctx->warn_offset)  // was: <= 0

Files: reasoning-budget.cpp, reasoning-budget.h, common.h, arg.cpp,
sampling.cpp. Default 0 preserves current behavior.

Option B — separate budgeting
Split total budget into thinking (t) and conclusion (s) phases, t + s = budget.
Force </think> at token t, leave s tokens for a conclusion before the answer.
Follow-up work on s1 (https://arxiv.org/abs/2505.05315) shows this outperforms
the basic offset approach. Would add a second parameter, e.g.
--reasoning-budget-conclusion M.

Option C — logit biasing toward </think>
Ramp up the logit weight of </think> as the budget runs low, then force it at
the hard limit. No dependency on an injected phrase, model-agnostic. NVIDIA NIM
uses this approach with a 10% extension window for sentence completion. More
invasive (sampler changes) but cleanest long-term.

Options A and B build on the existing message injection architecture. Option C
is a parallel approach that could coexist with either.

Happy to test on ROCm/gfx1100 (Qwen3.5-27B) and contribute a draft of Option A
or B if there is interest.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions