UPSTREAM PR #20297: Handle reasoning budget by loci-dev · Pull Request #1237 · auroralabs-loci/llama.cpp

loci-dev · 2026-03-10T02:17:13Z

Note

Source pull request: ggml-org/llama.cpp#20297

Adds proper handling for --reasoning-budget.

Currently, --reasoning-budget is just a stub that handles one case: 0, and the only think that does is set enable_thinking to false.

This PR adds the following flags:

--enable-reasoning (short -ere) - enable reasoning via kwargs on model
--disable-reasoning (short -dre) - disable reasoning via kwargs on model
--reasoning-budget-message - a message to be appended before the reasoning close marker to inform the model the reasoning was terminated due to budget constraints, i.e. " ... reasoning budget exceeded" or "... okay, now let's answer."

Also, --reasoning-budget now adds an extra grammar with a mechanism called delayed launch. Basically what this does is, when the opening trigger for the grammar is triggered, tokens get counted down and we also watch for the disarm trigger. If the disarm trigger doesn't get triggered before the countdown, the grammar gets launched.

This allows setting a real token budget limit on reasoning for models. This also allows disabling thinking for models that normally do not allow that by setting the budget to 0, which is now a different behavior than --disable-reasoning. Note: while possible, this isn't recommended since if a model was trained to only work with reasoning, it might exhibit aberrant behavior (for example, try to open an extra reasoning section).

Supersedes #17750

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

loci-review · 2026-03-10T04:05:06Z

Overview

Analysis of 112,727 functions across 5 commits implementing reasoning/thinking capabilities for LLMs. 302 modified (0.27%), 26 new, 0 removed, 112,399 unchanged. Power consumption impact is negligible across all binaries:

Binaries analyzed:

build.bin.llama-cvector-generator: +0.137%
build.bin.llama-tts: +0.107%
build.bin.libllama.so: +0.090%
build.bin.llama-bench: +0.148%
build.bin.llama-tokenize: +0.172%
build.bin.llama-quantize: +0.032%
build.bin.libmtmd.so: +0.000%
build.bin.libggml-base.so: +0.000%
build.bin.libggml-cpu.so: +0.000%
build.bin.libggml.so: +0.000%
build.bin.llama-qwen2vl-cli: +0.000%
build.bin.llama-gemma3-cli: +0.000%
build.bin.llama-gguf-split: +0.000%
build.bin.llama-llava-cli: +0.000%
build.bin.llama-minicpmv-cli: +0.000%

Impact: Minor — All changes confined to CLI argument parsing (one-time startup cost). Zero impact on inference hot paths (matrix operations, attention, KV cache, GPU kernels).

Function Analysis

Device list parsing (arg.cpp lambda 131, llama-cvector-generator/llama-tts): Response time 24ns → 13,501ns (+54,968%). Adds parse_device_list() with string splitting and device validation for --device-draft parameter. Absolute overhead: 13.5μs at startup only.

Reasoning validation (arg.cpp lambda 114, both binaries): Response time 30ns → 8,542ns (+28,265%). Implements tri-state --reasoning flag with is_truthy/is_falsey/is_autoy validation plus std::map insertion for template kwargs. Replaces JSON-based control with explicit validation.

Chat template file loading (arg.cpp lambda 117, both binaries): Response time 29ns → 5,245ns (+17,769%). Adds read_file() call for --chat-template-file parameter. Expected I/O overhead for custom template loading.

Cache type validation (arg.cpp lambdas 134-136, both binaries): Response time 24ns → 1,485ns (+5,911%). Calls kv_cache_type_from_str() to validate against 9 supported formats (F32, F16, BF16, Q8_0, Q4_0, Q4_1, IQ4_NL, Q5_0, Q5_1) for speculative decoding parameters.

STL container operations (libllama.so, llama-tts): std::_Hashtable::begin() +179%, std::map::end() +230%, std::vector::begin() +215%. Compiler code generation differences in standard library template instantiations. Absolute increases: 180-186ns during graph construction/parser initialization, not inference loops.

All remaining analyzed functions show changes consistent with initialization-only overhead.

Additional Findings

Inference paths preserved: Matrix operations (GEMM), attention mechanisms, quantization kernels, KV cache operations, and all GPU backends (CUDA, Metal, Vulkan, HIP, SYCL) show zero changes. The reasoning feature operates at application logic level, using existing inference infrastructure without modification. Total startup overhead: ~33μs (0.0003-0.003% of model loading time). Changes are appropriate trade-offs for enhanced reasoning capabilities.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

pwilkin and others added 5 commits March 9, 2026 00:47

v1

1114407

Finished!

2bd9e9e

Handlie cli

d12d699

Reasoning sampler

b201c80

Apply suggestions from code review

1dbfbcb

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

loci-dev temporarily deployed to PROD__AL_DEMO March 10, 2026 02:17 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 10 times, most recently from 3c7b997 to 5ac00d6 Compare March 17, 2026 02:18

loci-dev force-pushed the main branch 12 times, most recently from 88f82d8 to 8c39ead Compare March 25, 2026 02:17

loci-dev force-pushed the main branch from 8c39ead to 418d9f2 Compare March 26, 2026 02:17

loci-dev force-pushed the main branch 11 times, most recently from 1497621 to a67a372 Compare April 3, 2026 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #20297: Handle reasoning budget#1237

UPSTREAM PR #20297: Handle reasoning budget#1237
loci-dev wants to merge 5 commits intomainfrom
loci/pr-20297-reasoning-budget

loci-dev commented Mar 10, 2026

Uh oh!

loci-review bot commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Mar 10, 2026

Uh oh!

loci-review bot commented Mar 10, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants