Feature Request: Graceful reasoning budget termination. Avoid mid-sentence cutoff.

### Prerequisites

- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the [README.md](https://github.com/ggml-org/llama.cpp/blob/master/README.md).
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggml-org/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description

PR #20297 (merged ~2025-03-11) added `--reasoning-budget-message`, which injects
a string at the hard token cutoff to signal the model to wrap up. The injection
still happens at exactly token N with zero warning. The model gets no time to
finish its current thought before the `</think>` boundary.

Graceful termination before the cutoff is expected to improve reasoning
performance. See Motivation and Possible Implementation for details.


### Motivation

Raw truncation of the thinking trace measurably reduces answer quality compared
to graceful termination. Muennighoff et al. (s1: Simple test-time scaling,
arXiv:2501.19393) showed that appending an end-of-thinking delimiter with
"Final Answer:" before the hard cut outperforms naïve truncation. This was
confirmed by follow-up work (arXiv:2505.05315), which states directly that
"the S1 approach performs better than directly truncating the full reasoning
trajectory, underscoring the importance of preserving the solution segment."

The current --reasoning-budget-message implementation injects a wrap-up message
at exactly token N, leaving the model zero tokens to actually act on it. This is
functionally equivalent to truncation with a cosmetic suffix — the model cannot
produce a meaningful conclusion within 0 remaining tokens.

### Possible Implementation

Three approaches in increasing complexity:

Option A — message offset
Inject the budget message at `budget - offset` instead of `budget`, giving the
model `offset` tokens to act on the wrap-up signal. The s1 paper
(https://arxiv.org/abs/2501.19393) shows this outperforms raw truncation.
Change in the COUNTING state of reasoning-budget.cpp:

    if (ctx->remaining <= ctx->warn_offset)  // was: <= 0

Files: reasoning-budget.cpp, reasoning-budget.h, common.h, arg.cpp,
sampling.cpp. Default 0 preserves current behavior.

Option B — separate budgeting
Split total budget into thinking (t) and conclusion (s) phases, t + s = budget.
Force `</think>` at token t, leave s tokens for a conclusion before the answer.
Follow-up work on s1 (https://arxiv.org/abs/2505.05315) shows this outperforms
the basic offset approach. Would add a second parameter, e.g.
--reasoning-budget-conclusion M.

Option C — logit biasing toward `</think>`
Ramp up the logit weight of `</think>` as the budget runs low, then force it at
the hard limit. No dependency on an injected phrase, model-agnostic. NVIDIA NIM
uses this approach with a 10% extension window for sentence completion. More
invasive (sampler changes) but cleanest long-term.

Options A and B build on the existing message injection architecture. Option C
is a parallel approach that could coexist with either.

Happy to test on ROCm/gfx1100 (Qwen3.5-27B) and contribute a draft of Option A
or B if there is interest.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Graceful reasoning budget termination. Avoid mid-sentence cutoff. #20632

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Graceful reasoning budget termination. Avoid mid-sentence cutoff. #20632

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions