Skip to content

UPSTREAM PR #21139: grammar: make MAX_REPETITION_THRESHOLD configurable via env var#1315

Open
loci-dev wants to merge 1 commit intomainfrom
loci/pr-21139-patch-2
Open

UPSTREAM PR #21139: grammar: make MAX_REPETITION_THRESHOLD configurable via env var#1315
loci-dev wants to merge 1 commit intomainfrom
loci/pr-21139-patch-2

Conversation

@loci-dev
Copy link
Copy Markdown

Note

Source pull request: ggml-org/llama.cpp#21139

The hardcoded threshold of 2000 causes grammar parsing to fail for legitimate tool-calling schemas with many optional parameters.

Add LLAMA_GRAMMAR_MAX_REPETITIONS env var to override the default. When unset, behaviour is unchanged (default 2000).

May fix grammar failures reported in openclaw/openclaw#32916, openclaw/openclaw#38569, openclaw/openclaw#38899.

AI was used in an assistive capacity to identify the threshold constant and review the approach. The code change and testing were done manually.

Overview

Adds a helper function that reads LLAMA_GRAMMAR_MAX_REPETITIONS from the environment, replacing the hardcoded MAX_REPETITION_THRESHOLD macro. Three call sites updated. No API changes, fully backwards compatible.

Additional information

Tested on with Qwen3.5-122B-A10B and 21 OpenClaw tools (message tool: 109 optional params, browser: 48). Default threshold (2000): grammar fails on every request. Threshold at 20000: grammar parses successfully.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES - used to locate the threshold constant and review the approach. Code and testing done manually.

The hardcoded threshold of 2000 causes grammar parsing to fail for legitimate tool-calling schemas with many optional parameters.

Add LLAMA_GRAMMAR_MAX_REPETITIONS env var to override the default.  When unset, behaviour is unchanged (default 2000).

May fix grammar failures reported in openclaw/openclaw#32916,
openclaw/openclaw#38569, openclaw/openclaw#38899.
@loci-review
Copy link
Copy Markdown

loci-review bot commented Mar 29, 2026

Overview

Analysis of 123,911 functions across 15 binaries reveals minimal performance impact from a single commit adding grammar configuration. Only 49 functions (0.04%) modified, all C++ STL template instantiations with no source code changes.

Function counts: 123,911 total | 49 modified | 2 new | 0 removed | 123,860 unchanged

Power consumption changes:

  • build.bin.libllama.so: 262,227 nJ → 262,266 nJ (+0.015%)
  • build.bin.llama-cvector-generator: 360,029 nJ (0% change)
  • build.bin.llama-tts: 365,448 nJ (0% change)
  • build.bin.llama-bench: 159,922 nJ (0% change)
  • build.bin.libmtmd.so: 195,354 nJ (0% change)
  • build.bin.libggml-cpu.so: 178,502 nJ (0% change)
  • build.bin.libggml-base.so: 74,170 nJ (0% change)
  • build.bin.libggml.so: 5,137 nJ (0% change)
  • build.bin.llama-gguf-split: 2,864 nJ (0% change)
  • build.bin.llama-llava-cli: 278 nJ (0% change)
  • build.bin.llama-minicpmv-cli: 278 nJ (0% change)
  • build.bin.llama-quantize: 43,671 nJ (0% change)
  • build.bin.llama-qwen2vl-cli: 278 nJ (0% change)
  • build.bin.llama-tokenize: 38,399 nJ (0% change)
  • build.bin.llama-gemma3-cli: 278 nJ (0% change)

Function Analysis

Top regressions (compiler optimization artifacts, not source changes):

  • std::vector<std::pair<int, std::pair<unsigned long, unsigned long>>>::end() in build.bin.libllama.so: Response time 81ns → 265ns (+226%), throughput 60ns → 243ns (+307%). Used in recurrent memory serialization, not inference hot path.

  • std::_Rb_tree::begin() (ggml_backend_device map) in build.bin.libllama.so: Response time 83ns → 265ns (+220%), throughput 63ns → 245ns (+289%). Called ~80 times during model loading, total overhead ~15μs.

  • std::vector<std::pair<wchar_t, wchar_t>>::back() in build.bin.libllama.so: Response time 259ns → 448ns (+73%), throughput 70ns → 260ns (+272%). Used in Unicode tokenization, adds ~19-38μs per 100 characters.

Top improvements:

  • std::_Rb_tree::_S_key() (sequence map) in build.bin.libllama.so: Response time 299ns → 113ns (-62%), throughput 246ns → 60ns (-76%). Benefits continuous batching, saves ~19μs per decode step.

  • std::_Rb_tree::end() (llm_kv map) in build.bin.libllama.so: Response time 263ns → 80ns (-70%), throughput 243ns → 60ns (-75%). Improves model metadata lookups.

Other analyzed functions showed similar compiler-driven variations in STL template instantiations with negligible real-world impact.

Additional Findings

All changes are compiler code generation artifacts from template instantiation differences—no source code modifications to these functions. The grammar configuration change (commit 90fe2f9) is isolated to llama-grammar.cpp and unrelated to observed performance variations. No impact on inference hot paths: matrix operations, attention mechanisms, KV cache, or GPU kernels remain unchanged. Net effect on inference is negligible to slightly positive due to improvements in sequence management functions used during continuous batching.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

@loci-dev loci-dev force-pushed the main branch 11 times, most recently from fd3ce9d to 1770118 Compare April 6, 2026 02:18
@loci-dev loci-dev force-pushed the main branch 8 times, most recently from 385b1fc to 06d9e10 Compare April 13, 2026 02:18
@loci-dev loci-dev force-pushed the main branch 8 times, most recently from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants