Skip to content

UPSTREAM PR #21240: Relax prefill parser to allow space.#1324

Open
loci-dev wants to merge 1 commit intomainfrom
loci/pr-21240-relax-prefill-parser
Open

UPSTREAM PR #21240: Relax prefill parser to allow space.#1324
loci-dev wants to merge 1 commit intomainfrom
loci/pr-21240-relax-prefill-parser

Conversation

@loci-dev
Copy link
Copy Markdown

@loci-dev loci-dev commented Apr 1, 2026

Note

Source pull request: ggml-org/llama.cpp#21240

Overview

As in title.

Additional information

Prefill parser was strictly requiring the reasoning marker at the very start of the message, which interfered with models that liked to insert eg. a newline there.

Requirements

@loci-review
Copy link
Copy Markdown

loci-review bot commented Apr 1, 2026

Overview

Impact: Minor - No performance concerns identified.

Function Analysis: 24 modified functions (0.02% of 124,016 total). Changes isolated to chat template parser enhancement and compiler optimizations in auxiliary tools.

Binaries Analyzed (15 total):

Binary Power Change
build.bin.llama-tts -0.038%
build.bin.llama-cvector-generator -0.046%
build.bin.libllama.so -0.0001%
build.bin.llama-bench +0.0003%
build.bin.libmtmd.so 0.0%
build.bin.libggml-cpu.so 0.0%
build.bin.libggml-base.so 0.0%
build.bin.libggml.so 0.0%
build.bin.llama-tokenize 0.0%
build.bin.llama-gemma3-cli 0.0%
build.bin.llama-gguf-split 0.0%
build.bin.llama-llava-cli 0.0%
build.bin.llama-minicpmv-cli 0.0%
build.bin.llama-quantize 0.0%
build.bin.llama-qwen2vl-cli 0.0%

Total system power consumption: -0.018% (negligible).

Function Analysis

common_chat_peg_builder::prefix() (llama-tts, llama-cvector-generator):

  • Response time: +302% (7.7μs → 31.1μs, +23.4μs)
  • Throughput time: +5.2% (110ns → 115ns, +5.7ns)
  • Justification: Intentional change adds + space() to enable flexible whitespace handling in chat templates. The 23.4μs increase occurs during one-time parser initialization, not inference. Functional improvement outweighs negligible performance impact.

Compiler optimizations (6 functions): std::vector::end (-69% response time), std::chrono::operator- (-40% throughput time), httplib::detail::parse_http_date (-44% throughput time), std::vector::_M_move_assign (-25% throughput time), std::pair constructor (-4% response time), httplib::detail::websocket_accept_key (-31% throughput time). All show improved code generation with no source changes.

Minor compiler artifacts (5 functions): nlohmann::basic_json::create (+50% throughput time, +132ns), httplib::Client::Get/Patch/Put (+18-23% throughput time, +25-26ns). Absolute impacts negligible; functions are I/O-bound or infrequent.

Other analyzed functions saw negligible changes.

Flame Graph Comparison

Function: common_chat_peg_builder::prefix() (build.bin.llama-tts)

Base version:
Flame Graph: build.bin.llama-tts::ZN23common_chat_peg_builder6prefixERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7

Target version:
Flame Graph: build.bin.llama-tts::ZN23common_chat_peg_builder6prefixERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7

The target version introduces new call chains for space() (7.7μs) and operator+() (15.7μs) that create deeper execution paths with sequence building and vector operations, explaining the response time increase. The change is intentional to support flexible whitespace matching in chat templates.

Additional Findings

No inference impact: Zero changes to llama_decode(), matrix operations, attention mechanisms, KV cache, quantization, or GPU kernels. Core inference libraries (libllama.so, libggml-*.so) show zero power consumption change, confirming changes are isolated to non-critical paths.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

@loci-dev loci-dev force-pushed the main branch 9 times, most recently from 126cd1f to a8215be Compare April 8, 2026 02:18
@loci-dev loci-dev force-pushed the main branch 7 times, most recently from e800934 to a024d9c Compare April 15, 2026 02:19
@loci-dev loci-dev force-pushed the main branch 6 times, most recently from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants