UPSTREAM PR #21003: grammar: increase MAX_REPETITION_THRESHOLD + make it configurable via envvar#1314
UPSTREAM PR #21003: grammar: increase MAX_REPETITION_THRESHOLD + make it configurable via envvar#1314
Conversation
OverviewThis PR introduces runtime configuration for grammar complexity limits with minimal performance impact. Analysis of 123,190 functions across 14 binaries identified 28 modified functions (0.023%), with 0 new and 0 removed functions. Power Consumption Changes:
Function AnalysisPrimary Change:
Standard Library Functions (9 functions, compiler optimization artifacts):
Flame Graph ComparisonFunction: The target version adds expensive string processing operations ( Additional FindingsNo impact on inference performance: All modified functions are in preprocessing/initialization paths. Critical inference operations (matrix operations, attention mechanisms, KV cache, quantization) remain unchanged. The 25x increase in grammar complexity threshold (2,000 → 50,000) enables more sophisticated structured output rules without affecting token generation performance. 🔎 Full breakdown: Loci Inspector |
|
Looks like a duplicate effort to this PR |
8fec234 to
82160d6
Compare
OverviewAnalysis of 2 commits modifying grammar parsing functionality. Out of 123,190 total functions, 28 were modified (0.02%), with 0 new and 0 removed functions. Power Consumption Changes:
Function Analysisllama_grammar_parser constructor (build.bin.libllama.so):
find (grammar trigger pattern) (build.bin.libllama.so):
operator (regex bracket matcher) (build.bin.libllama.so):
Improvements:
Other analyzed functions showed minor changes (<60ns) in vector reallocation and iterator operations, primarily from compiler code generation differences. Flame Graph ComparisonFunction: llama_grammar_parser constructor (build.bin.libllama.so) The target version adds heavy string operations: Additional FindingsChanges are isolated to grammar parsing (non-critical path). Core inference operations (matrix operations, attention, KV cache) unaffected. Grammar validation adds 89-229ns per token only when structured output is enabled, representing <0.02% of typical inference time. The 25x grammar complexity increase enables sophisticated JSON/XML schemas for production use cases. No GPU operations or ML inference paths impacted. 🔎 Full breakdown: Loci Inspector |
126cd1f to
a8215be
Compare
e800934 to
a024d9c
Compare
1254f75 to
245e873
Compare
7638ab4 to
f1b46d5
Compare




Note
Source pull request: ggml-org/llama.cpp#21003
Overview
For very big tool calling environments (like OpenClaw) the current limit is insufficient. Even a bigger limit might be insufficient, so on top of increasing it I'm making it configurable.
Additional information
Together with #20961 should help with ggml-org/llama.cpp#20879
Requirements