UPSTREAM PR #21003: grammar: increase MAX_REPETITION_THRESHOLD + make it configurable via envvar by loci-dev · Pull Request #1314 · auroralabs-loci/llama.cpp

loci-dev · 2026-03-29T03:10:12Z

Note

Source pull request: ggml-org/llama.cpp#21003

Overview

For very big tool calling environments (like OpenClaw) the current limit is insufficient. Even a bigger limit might be insufficient, so on top of increasing it I'm making it configurable.

Additional information

Together with #20961 should help with ggml-org/llama.cpp#20879

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, told Claude to add the envvar config

… envvar

loci-review · 2026-03-29T04:25:19Z

Overview

This PR introduces runtime configuration for grammar complexity limits with minimal performance impact. Analysis of 123,190 functions across 14 binaries identified 28 modified functions (0.023%), with 0 new and 0 removed functions.

Power Consumption Changes:

build.bin.libllama.so: -0.029% (261,565.70 → 261,489.39 nJ)
build.bin.llama-tts: 0.0% (364,393.39 → 364,393.49 nJ)
build.bin.llama-cvector-generator: -0.0% (359,191.18 → 359,191.13 nJ)
build.bin.libmtmd.so: +0.0% (190,100.04 → 190,100.07 nJ)
build.bin.llama-bench: -0.0% (158,579.72 → 158,579.31 nJ)
build.bin.llama-gguf-split: 0.0% (2,864.08 nJ, unchanged)
build.bin.llama-llava-cli: 0.0% (277.87 nJ, unchanged)
build.bin.llama-minicpmv-cli: 0.0% (277.87 nJ, unchanged)
build.bin.llama-quantize: 0.0% (43,471.53 nJ, unchanged)
build.bin.llama-qwen2vl-cli: 0.0% (277.87 nJ, unchanged)
build.bin.llama-tokenize: 0.0% (38,146.39 nJ, unchanged)
build.bin.llama-gemma3-cli: 0.0% (277.87 nJ, unchanged)
build.bin.libggml.so: 0.0% (5,136.91 nJ, unchanged)
build.bin.libggml-cpu.so: 0.0% (175,774.68 nJ, unchanged)
build.bin.libggml-base.so: 0.0% (74,160.26 nJ, unchanged)

Function Analysis

Primary Change: llama_grammar_parser constructor (_ZN20llama_grammar_parserC1EPK11llama_vocab in build.bin.libllama.so)

Response time: 234ns → 1,356ns (+1,122ns, +479%)
Throughput time: 26ns → 139ns (+113ns, +434%)
Source change: Refactored from compile-time constant MAX_REPETITION_THRESHOLD = 2000 to runtime-configurable member variable with default 50,000. Added getenv("LLAMA_GRAMMAR_MAX_REPS") and stoull() parsing for environment variable support.
Assessment: Expected regression for cold-path initialization function. The 1.1μs overhead occurs once per parser instance, enabling runtime flexibility for complex grammar rules. Not in inference hot path.

Standard Library Functions (9 functions, compiler optimization artifacts):

Iterator operations: operator+ (+62%, +63ns), operator- (-44%, -73ns improvement)
Vector operations: vector copy constructor (-5%, -74ns improvement), _M_check_len (-3.5%, -28ns improvement), _M_realloc_insert variants (+1.7% and +0.9%, +59ns and +44ns)
Regex operations: find (+0.4%, +88ns), _BracketMatcher::operator() (+1.9%, +150ns), _M_insert_subexpr_begin (-0.14%, -9ns improvement)
Source changes: None. Performance variations result from compiler code generation differences (stack canary reorganization, code layout, instruction scheduling).
Assessment: Mixed results with 4 improvements and 5 regressions, all under 150ns absolute change. Not performance-critical (preprocessing/initialization only).

Flame Graph Comparison

Function: llama_grammar_parser constructor - Selected to illustrate the runtime configuration overhead introduced by environment variable parsing.

Base version:

Target version:

The target version adds expensive string processing operations (stoull, __stoa, basic_string construction totaling ~1,000ns) for environment variable parsing, replacing the simple compile-time constant initialization in the base version.

Additional Findings

No impact on inference performance: All modified functions are in preprocessing/initialization paths. Critical inference operations (matrix operations, attention mechanisms, KV cache, quantization) remain unchanged. The 25x increase in grammar complexity threshold (2,000 → 50,000) enables more sophisticated structured output rules without affecting token generation performance.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

Ayman161803 · 2026-03-30T01:08:59Z

Looks like a duplicate effort to this PR

loci-review · 2026-03-31T10:18:14Z

Overview

Analysis of 2 commits modifying grammar parsing functionality. Out of 123,190 total functions, 28 were modified (0.02%), with 0 new and 0 removed functions.

Power Consumption Changes:

build.bin.libllama.so: -0.03% (261,565.70 → 261,488.50 nJ)
All other binaries (build.bin.llama-tts, build.bin.llama-cvector-generator, build.bin.libmtmd.so, build.bin.llama-bench, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-tokenize): 0% change

Function Analysis

llama_grammar_parser constructor (build.bin.libllama.so):

Response time: 234ns → 1,356ns (+1,122ns, +479%)
Throughput time: 26ns → 139ns (+113ns, +434%)
Added environment variable parsing (LLAMA_GRAMMAR_MAX_REPS) with getenv() and stoull() for runtime configuration. Increased default threshold from 2,000 to 50,000 repetitions (25x). One-time initialization cost, not in inference hot path.

find (grammar trigger pattern) (build.bin.libllama.so):

Response time: 22,127ns → 22,214ns (+88ns, +0.40%)
Throughput time: 205ns → 295ns (+89ns, +44%)
Compiler code layout added extra unconditional branch at entry. Called per-token during structured output validation. No source changes.

operator (regex bracket matcher) (build.bin.libllama.so):

Response time: 7,792ns → 7,941ns (+149ns, +2%)
Throughput time: 428ns → 577ns (+149ns, +35%)
Entry block 6x slower (28ns → 164ns) due to compiler optimization artifact. Standard library function, no source modifications.

Improvements:

operator- (buffer iterator): -73ns response time (-44%), from data structure change (unordered_map → vector)
vector copy constructor: -74ns response time (-5%), from optimized stack canary initialization
_M_check_len: -28ns response time (-3%), compiler optimization

Other analyzed functions showed minor changes (<60ns) in vector reallocation and iterator operations, primarily from compiler code generation differences.

Flame Graph Comparison

Function: llama_grammar_parser constructor (build.bin.libllama.so)

Base version:

Target version:

The target version adds heavy string operations: stoull() (422ns) and basic_string construction (556ns) for environment variable parsing, shifting from simple container initialization (234ns total) to configuration-heavy initialization (1,356ns total).

Additional Findings

Changes are isolated to grammar parsing (non-critical path). Core inference operations (matrix operations, attention, KV cache) unaffected. Grammar validation adds 89-229ns per token only when structured output is enabled, representing <0.02% of typical inference time. The 25x grammar complexity increase enables sophisticated JSON/XML schemas for production use cases. No GPU operations or ML inference paths impacted.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

pwilkin added 2 commits March 25, 2026 19:05

grammar: increase MAX_REPETITION_THRESHOLD + make it configurable via…

8100810

… envvar

make repetition test more evil

ae274f1

loci-dev temporarily deployed to PROD__AL_DEMO March 29, 2026 03:10 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 2 times, most recently from 8fec234 to 82160d6 Compare March 31, 2026 02:17

loci-dev temporarily deployed to PROD__AL_DEMO March 31, 2026 08:58 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 12 times, most recently from 126cd1f to a8215be Compare April 8, 2026 02:18

loci-dev force-pushed the main branch 7 times, most recently from e800934 to a024d9c Compare April 15, 2026 02:19

loci-dev force-pushed the main branch 2 times, most recently from 1254f75 to 245e873 Compare April 16, 2026 09:24

loci-dev force-pushed the main branch 4 times, most recently from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #21003: grammar: increase MAX_REPETITION_THRESHOLD + make it configurable via envvar#1314

UPSTREAM PR #21003: grammar: increase MAX_REPETITION_THRESHOLD + make it configurable via envvar#1314
loci-dev wants to merge 2 commits intomainfrom
loci/pr-21003-config-max-repetition-threshold

loci-dev commented Mar 29, 2026

Uh oh!

loci-review bot commented Mar 29, 2026

Uh oh!

Ayman161803 commented Mar 30, 2026

Uh oh!

loci-review bot commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

loci-dev commented Mar 29, 2026

Overview

Additional information

Requirements

Uh oh!

loci-review bot commented Mar 29, 2026

Overview

Function Analysis

Flame Graph Comparison

Additional Findings

Uh oh!

Ayman161803 commented Mar 30, 2026

Uh oh!

loci-review bot commented Mar 31, 2026

Overview

Function Analysis

Flame Graph Comparison

Additional Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants