Conversation
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
OverviewAnalysis of 112,727 functions across 5 commits implementing reasoning/thinking capabilities for LLMs. 302 modified (0.27%), 26 new, 0 removed, 112,399 unchanged. Power consumption impact is negligible across all binaries: Binaries analyzed:
Impact: Minor — All changes confined to CLI argument parsing (one-time startup cost). Zero impact on inference hot paths (matrix operations, attention, KV cache, GPU kernels). Function AnalysisDevice list parsing ( Reasoning validation ( Chat template file loading ( Cache type validation ( STL container operations (libllama.so, llama-tts): std::_Hashtable::begin() +179%, std::map::end() +230%, std::vector::begin() +215%. Compiler code generation differences in standard library template instantiations. Absolute increases: 180-186ns during graph construction/parser initialization, not inference loops. All remaining analyzed functions show changes consistent with initialization-only overhead. Additional FindingsInference paths preserved: Matrix operations (GEMM), attention mechanisms, quantization kernels, KV cache operations, and all GPU backends (CUDA, Metal, Vulkan, HIP, SYCL) show zero changes. The reasoning feature operates at application logic level, using existing inference infrastructure without modification. Total startup overhead: ~33μs (0.0003-0.003% of model loading time). Changes are appropriate trade-offs for enhanced reasoning capabilities. 🔎 Full breakdown: Loci Inspector |
3c7b997 to
5ac00d6
Compare
88f82d8 to
8c39ead
Compare
1497621 to
a67a372
Compare
Note
Source pull request: ggml-org/llama.cpp#20297
Adds proper handling for
--reasoning-budget.Currently,
--reasoning-budgetis just a stub that handles one case: 0, and the only think that does is set enable_thinking to false.This PR adds the following flags:
--enable-reasoning(short-ere) - enable reasoning via kwargs on model--disable-reasoning(short-dre) - disable reasoning via kwargs on model--reasoning-budget-message- a message to be appended before the reasoning close marker to inform the model the reasoning was terminated due to budget constraints, i.e. " ... reasoning budget exceeded" or "... okay, now let's answer."Also,
--reasoning-budgetnow adds an extra grammar with a mechanism called delayed launch. Basically what this does is, when the opening trigger for the grammar is triggered, tokens get counted down and we also watch for the disarm trigger. If the disarm trigger doesn't get triggered before the countdown, the grammar gets launched.This allows setting a real token budget limit on reasoning for models. This also allows disabling thinking for models that normally do not allow that by setting the budget to 0, which is now a different behavior than
--disable-reasoning. Note: while possible, this isn't recommended since if a model was trained to only work with reasoning, it might exhibit aberrant behavior (for example, try to open an extra reasoning section).Supersedes #17750