Skip to content

UPSTREAM PR #20297: Handle reasoning budget#1237

Open
loci-dev wants to merge 5 commits intomainfrom
loci/pr-20297-reasoning-budget
Open

UPSTREAM PR #20297: Handle reasoning budget#1237
loci-dev wants to merge 5 commits intomainfrom
loci/pr-20297-reasoning-budget

Conversation

@loci-dev
Copy link
Copy Markdown

Note

Source pull request: ggml-org/llama.cpp#20297

Adds proper handling for --reasoning-budget.

Currently, --reasoning-budget is just a stub that handles one case: 0, and the only think that does is set enable_thinking to false.

This PR adds the following flags:

  • --enable-reasoning (short -ere) - enable reasoning via kwargs on model
  • --disable-reasoning (short -dre) - disable reasoning via kwargs on model
  • --reasoning-budget-message - a message to be appended before the reasoning close marker to inform the model the reasoning was terminated due to budget constraints, i.e. " ... reasoning budget exceeded" or "... okay, now let's answer."

Also, --reasoning-budget now adds an extra grammar with a mechanism called delayed launch. Basically what this does is, when the opening trigger for the grammar is triggered, tokens get counted down and we also watch for the disarm trigger. If the disarm trigger doesn't get triggered before the countdown, the grammar gets launched.

This allows setting a real token budget limit on reasoning for models. This also allows disabling thinking for models that normally do not allow that by setting the budget to 0, which is now a different behavior than --disable-reasoning. Note: while possible, this isn't recommended since if a model was trained to only work with reasoning, it might exhibit aberrant behavior (for example, try to open an extra reasoning section).

Supersedes #17750

pwilkin and others added 5 commits March 9, 2026 00:47
@loci-review
Copy link
Copy Markdown

loci-review bot commented Mar 10, 2026

Overview

Analysis of 112,727 functions across 5 commits implementing reasoning/thinking capabilities for LLMs. 302 modified (0.27%), 26 new, 0 removed, 112,399 unchanged. Power consumption impact is negligible across all binaries:

Binaries analyzed:

  • build.bin.llama-cvector-generator: +0.137%
  • build.bin.llama-tts: +0.107%
  • build.bin.libllama.so: +0.090%
  • build.bin.llama-bench: +0.148%
  • build.bin.llama-tokenize: +0.172%
  • build.bin.llama-quantize: +0.032%
  • build.bin.libmtmd.so: +0.000%
  • build.bin.libggml-base.so: +0.000%
  • build.bin.libggml-cpu.so: +0.000%
  • build.bin.libggml.so: +0.000%
  • build.bin.llama-qwen2vl-cli: +0.000%
  • build.bin.llama-gemma3-cli: +0.000%
  • build.bin.llama-gguf-split: +0.000%
  • build.bin.llama-llava-cli: +0.000%
  • build.bin.llama-minicpmv-cli: +0.000%

Impact: Minor — All changes confined to CLI argument parsing (one-time startup cost). Zero impact on inference hot paths (matrix operations, attention, KV cache, GPU kernels).

Function Analysis

Device list parsing (arg.cpp lambda 131, llama-cvector-generator/llama-tts): Response time 24ns → 13,501ns (+54,968%). Adds parse_device_list() with string splitting and device validation for --device-draft parameter. Absolute overhead: 13.5μs at startup only.

Reasoning validation (arg.cpp lambda 114, both binaries): Response time 30ns → 8,542ns (+28,265%). Implements tri-state --reasoning flag with is_truthy/is_falsey/is_autoy validation plus std::map insertion for template kwargs. Replaces JSON-based control with explicit validation.

Chat template file loading (arg.cpp lambda 117, both binaries): Response time 29ns → 5,245ns (+17,769%). Adds read_file() call for --chat-template-file parameter. Expected I/O overhead for custom template loading.

Cache type validation (arg.cpp lambdas 134-136, both binaries): Response time 24ns → 1,485ns (+5,911%). Calls kv_cache_type_from_str() to validate against 9 supported formats (F32, F16, BF16, Q8_0, Q4_0, Q4_1, IQ4_NL, Q5_0, Q5_1) for speculative decoding parameters.

STL container operations (libllama.so, llama-tts): std::_Hashtable::begin() +179%, std::map::end() +230%, std::vector::begin() +215%. Compiler code generation differences in standard library template instantiations. Absolute increases: 180-186ns during graph construction/parser initialization, not inference loops.

All remaining analyzed functions show changes consistent with initialization-only overhead.

Additional Findings

Inference paths preserved: Matrix operations (GEMM), attention mechanisms, quantization kernels, KV cache operations, and all GPU backends (CUDA, Metal, Vulkan, HIP, SYCL) show zero changes. The reasoning feature operates at application logic level, using existing inference infrastructure without modification. Total startup overhead: ~33μs (0.0003-0.003% of model loading time). Changes are appropriate trade-offs for enhanced reasoning capabilities.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

@loci-dev loci-dev force-pushed the main branch 10 times, most recently from 3c7b997 to 5ac00d6 Compare March 17, 2026 02:18
@loci-dev loci-dev force-pushed the main branch 12 times, most recently from 88f82d8 to 8c39ead Compare March 25, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 11 times, most recently from 1497621 to a67a372 Compare April 3, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants