Skip to content

UPSTREAM PR #21003: grammar: increase MAX_REPETITION_THRESHOLD + make it configurable via envvar#1314

Open
loci-dev wants to merge 2 commits intomainfrom
loci/pr-21003-config-max-repetition-threshold
Open

UPSTREAM PR #21003: grammar: increase MAX_REPETITION_THRESHOLD + make it configurable via envvar#1314
loci-dev wants to merge 2 commits intomainfrom
loci/pr-21003-config-max-repetition-threshold

Conversation

@loci-dev
Copy link
Copy Markdown

Note

Source pull request: ggml-org/llama.cpp#21003

Overview

For very big tool calling environments (like OpenClaw) the current limit is insufficient. Even a bigger limit might be insufficient, so on top of increasing it I'm making it configurable.

Additional information

Together with #20961 should help with ggml-org/llama.cpp#20879

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, told Claude to add the envvar config

@loci-review
Copy link
Copy Markdown

loci-review bot commented Mar 29, 2026

Overview

This PR introduces runtime configuration for grammar complexity limits with minimal performance impact. Analysis of 123,190 functions across 14 binaries identified 28 modified functions (0.023%), with 0 new and 0 removed functions.

Power Consumption Changes:

  • build.bin.libllama.so: -0.029% (261,565.70 → 261,489.39 nJ)
  • build.bin.llama-tts: 0.0% (364,393.39 → 364,393.49 nJ)
  • build.bin.llama-cvector-generator: -0.0% (359,191.18 → 359,191.13 nJ)
  • build.bin.libmtmd.so: +0.0% (190,100.04 → 190,100.07 nJ)
  • build.bin.llama-bench: -0.0% (158,579.72 → 158,579.31 nJ)
  • build.bin.llama-gguf-split: 0.0% (2,864.08 nJ, unchanged)
  • build.bin.llama-llava-cli: 0.0% (277.87 nJ, unchanged)
  • build.bin.llama-minicpmv-cli: 0.0% (277.87 nJ, unchanged)
  • build.bin.llama-quantize: 0.0% (43,471.53 nJ, unchanged)
  • build.bin.llama-qwen2vl-cli: 0.0% (277.87 nJ, unchanged)
  • build.bin.llama-tokenize: 0.0% (38,146.39 nJ, unchanged)
  • build.bin.llama-gemma3-cli: 0.0% (277.87 nJ, unchanged)
  • build.bin.libggml.so: 0.0% (5,136.91 nJ, unchanged)
  • build.bin.libggml-cpu.so: 0.0% (175,774.68 nJ, unchanged)
  • build.bin.libggml-base.so: 0.0% (74,160.26 nJ, unchanged)

Function Analysis

Primary Change: llama_grammar_parser constructor (_ZN20llama_grammar_parserC1EPK11llama_vocab in build.bin.libllama.so)

  • Response time: 234ns → 1,356ns (+1,122ns, +479%)
  • Throughput time: 26ns → 139ns (+113ns, +434%)
  • Source change: Refactored from compile-time constant MAX_REPETITION_THRESHOLD = 2000 to runtime-configurable member variable with default 50,000. Added getenv("LLAMA_GRAMMAR_MAX_REPS") and stoull() parsing for environment variable support.
  • Assessment: Expected regression for cold-path initialization function. The 1.1μs overhead occurs once per parser instance, enabling runtime flexibility for complex grammar rules. Not in inference hot path.

Standard Library Functions (9 functions, compiler optimization artifacts):

  • Iterator operations: operator+ (+62%, +63ns), operator- (-44%, -73ns improvement)
  • Vector operations: vector copy constructor (-5%, -74ns improvement), _M_check_len (-3.5%, -28ns improvement), _M_realloc_insert variants (+1.7% and +0.9%, +59ns and +44ns)
  • Regex operations: find (+0.4%, +88ns), _BracketMatcher::operator() (+1.9%, +150ns), _M_insert_subexpr_begin (-0.14%, -9ns improvement)
  • Source changes: None. Performance variations result from compiler code generation differences (stack canary reorganization, code layout, instruction scheduling).
  • Assessment: Mixed results with 4 improvements and 5 regressions, all under 150ns absolute change. Not performance-critical (preprocessing/initialization only).

Flame Graph Comparison

Function: llama_grammar_parser constructor - Selected to illustrate the runtime configuration overhead introduced by environment variable parsing.

Base version:
Base version flame graph

Target version:
Target version flame graph

The target version adds expensive string processing operations (stoull, __stoa, basic_string construction totaling ~1,000ns) for environment variable parsing, replacing the simple compile-time constant initialization in the base version.

Additional Findings

No impact on inference performance: All modified functions are in preprocessing/initialization paths. Critical inference operations (matrix operations, attention mechanisms, KV cache, quantization) remain unchanged. The 25x increase in grammar complexity threshold (2,000 → 50,000) enables more sophisticated structured output rules without affecting token generation performance.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

@Ayman161803
Copy link
Copy Markdown

Looks like a duplicate effort to this PR

@loci-dev loci-dev force-pushed the main branch 2 times, most recently from 8fec234 to 82160d6 Compare March 31, 2026 02:17
@loci-review
Copy link
Copy Markdown

loci-review bot commented Mar 31, 2026

Overview

Analysis of 2 commits modifying grammar parsing functionality. Out of 123,190 total functions, 28 were modified (0.02%), with 0 new and 0 removed functions.

Power Consumption Changes:

  • build.bin.libllama.so: -0.03% (261,565.70 → 261,488.50 nJ)
  • All other binaries (build.bin.llama-tts, build.bin.llama-cvector-generator, build.bin.libmtmd.so, build.bin.llama-bench, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-tokenize): 0% change

Function Analysis

llama_grammar_parser constructor (build.bin.libllama.so):

  • Response time: 234ns → 1,356ns (+1,122ns, +479%)
  • Throughput time: 26ns → 139ns (+113ns, +434%)
  • Added environment variable parsing (LLAMA_GRAMMAR_MAX_REPS) with getenv() and stoull() for runtime configuration. Increased default threshold from 2,000 to 50,000 repetitions (25x). One-time initialization cost, not in inference hot path.

find (grammar trigger pattern) (build.bin.libllama.so):

  • Response time: 22,127ns → 22,214ns (+88ns, +0.40%)
  • Throughput time: 205ns → 295ns (+89ns, +44%)
  • Compiler code layout added extra unconditional branch at entry. Called per-token during structured output validation. No source changes.

operator (regex bracket matcher) (build.bin.libllama.so):

  • Response time: 7,792ns → 7,941ns (+149ns, +2%)
  • Throughput time: 428ns → 577ns (+149ns, +35%)
  • Entry block 6x slower (28ns → 164ns) due to compiler optimization artifact. Standard library function, no source modifications.

Improvements:

  • operator- (buffer iterator): -73ns response time (-44%), from data structure change (unordered_map → vector)
  • vector copy constructor: -74ns response time (-5%), from optimized stack canary initialization
  • _M_check_len: -28ns response time (-3%), compiler optimization

Other analyzed functions showed minor changes (<60ns) in vector reallocation and iterator operations, primarily from compiler code generation differences.

Flame Graph Comparison

Function: llama_grammar_parser constructor (build.bin.libllama.so)

Base version:
Base Flame Graph

Target version:
Target Flame Graph

The target version adds heavy string operations: stoull() (422ns) and basic_string construction (556ns) for environment variable parsing, shifting from simple container initialization (234ns total) to configuration-heavy initialization (1,356ns total).

Additional Findings

Changes are isolated to grammar parsing (non-critical path). Core inference operations (matrix operations, attention, KV cache) unaffected. Grammar validation adds 89-229ns per token only when structured output is enabled, representing <0.02% of typical inference time. The 25x grammar complexity increase enables sophisticated JSON/XML schemas for production use cases. No GPU operations or ML inference paths impacted.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

@loci-dev loci-dev force-pushed the main branch 12 times, most recently from 126cd1f to a8215be Compare April 8, 2026 02:18
@loci-dev loci-dev force-pushed the main branch 7 times, most recently from e800934 to a024d9c Compare April 15, 2026 02:19
@loci-dev loci-dev force-pushed the main branch 2 times, most recently from 1254f75 to 245e873 Compare April 16, 2026 09:24
@loci-dev loci-dev force-pushed the main branch 4 times, most recently from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants