fix: GLM 4.5 streaming tool-call parsing + grammar error handling by Gunther-Schulz · Pull Request #19612 · ggml-org/llama.cpp

Gunther-Schulz · 2026-02-14T00:47:13Z

Disclosure: All code and this PR description were written by AI (Cursor).

Summary

This PR fixes GLM 4.5 (and related XML tool-call) streaming behaviour so that:

The server no longer hangs when tool_choice=auto and the model never outputs the grammar trigger (parse-only path, no grammar for AUTO).
The "Failed to parse up to error: ... attempting to parse an empty input" log spam is removed when streaming partial tool-call arguments.
Plain-text arg_value content (e.g. subagent_type=explore) is parsed correctly instead of throwing partial forever, so Task/tool calls complete.
Grammar/sampler runtime errors (e.g. "Unexpected empty grammar stack") return HTTP 500 and release the slot instead of aborting the server.

Changes

1. GLM 4.5 parse-only for AUTO (existing branch commit)

common/chat.cpp (GLM 4.5 init): Use grammar only when tool_choice == required; for auto, do not set grammar/trigger so tool calls are detected by parsing decoded text (vLLM-style).
Tests: Relax test-chat assert to allow empty grammar when the test message has tool_calls.
Addresses server hang when the model never outputs the trigger (e.g. Misc. bug: GLM-4.7-Flash enters corrupted state with grammar trigger loop, memory leak, and gibberish output #19068).

2. Avoid "parse empty input" log (json-partial + XML parser)

common/json-partial.cpp: When the SAX parser reports an error at position 0 (empty substring to re-parse), return false immediately without calling json::parse or logging "Failed to parse up to error". This avoids noisy logs for all formats (GLM 4.5, GPT-OSS, Granite, Hermes, etc.) when streaming partial or invalid JSON.
common/chat-parser-xml-toolcall.cpp: Only call try_consume_json() when the remainder after <arg_value> looks like the start of a JSON value (", {, [, digit, -, or prefix of true/false/null). Otherwise treat as incomplete/plain text and skip the JSON parser (vLLM-style), avoiding SAX error-at-position-0 for e.g. exp before explore.

3. Plain-text arg_value fallthrough + server grammar error handling

common/chat-parser-xml-toolcall.cpp: When the remainder does not look like JSON (e.g. explore), do not throw; skip try_consume_json() and fall through to the existing plain-text path (try_find_val_end()). Fixes Task tool (e.g. subagent_type=explore) never completing.
tools/server/server-context.cpp: Wrap common_sampler_accept() and common_sampler_sample_and_accept_n() in try/catch for std::runtime_error. On catch (e.g. "Unexpected empty grammar stack" from llama-grammar.cpp), log, send HTTP 500 with "Grammar constraint violation", release the slot, and continue so the server stays up.

Testing

test-json-partial and test-chat-parser pass.
Manual: GLM 4.5 with tool_choice=auto, streaming tool calls (e.g. Task with subagent_type=explore) complete without log spam; server does not abort on grammar errors.

…QUIRED - In common_chat_params_init_glm_4_5: set grammar_lazy=false; build grammar only when has_tools && tool_choice==REQUIRED (vLLM-style: no trigger/grammar for AUTO, detect tool calls by parsing decoded text). - Relax test-chat assert: allow empty grammar when test message has tool_calls (GLM 4.5 AUTO no longer sets grammar). Fixes server hang when model never outputs trigger (e.g. llama.cpp ggml-org#19068). Co-authored-by: Cursor <cursoragent@cursor.com>

- json-partial: when SAX error is at position 0, return false without calling json::parse or logging 'Failed to parse up to error' (covers all formats: GLM 4.5, GPT-OSS, Granite, Hermes, etc.) - chat-parser-xml-toolcall: only call try_consume_json when remainder looks like start of JSON value (", {, [, digit, -, true/false/null); otherwise treat as plain text/partial (vLLM-style, avoids SAX error-at-position-0 for e.g. 'exp' before 'explore') Co-authored-by: Cursor <cursoragent@cursor.com>

- chat-parser-xml-toolcall: when remainder does not look like JSON start (e.g. 'explore'), skip try_consume_json and fall through to plain-text path instead of throwing; fixes Task tool (subagent_type=explore) never completing - server-context: catch std::runtime_error from common_sampler_accept and common_sampler_sample_and_accept_n (e.g. 'Unexpected empty grammar stack'); return 500 and release slot instead of aborting Co-authored-by: Cursor <cursoragent@cursor.com>

pwilkin · 2026-03-17T23:10:43Z

Obsoleted by #18675

Gunther-Schulz and others added 3 commits February 14, 2026 00:42

Gunther-Schulz requested review from ggerganov, ngxson and pwilkin as code owners February 14, 2026 00:47

github-actions bot added testing Everything test related examples server labels Feb 14, 2026

pwilkin closed this Mar 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: GLM 4.5 streaming tool-call parsing + grammar error handling#19612

fix: GLM 4.5 streaming tool-call parsing + grammar error handling#19612
Gunther-Schulz wants to merge 3 commits intoggml-org:masterfrom
Gunther-Schulz:fix/glm45-tool-parse-only-auto

Gunther-Schulz commented Feb 14, 2026

Uh oh!

pwilkin commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Gunther-Schulz commented Feb 14, 2026

Summary

Changes

1. GLM 4.5 parse-only for AUTO (existing branch commit)

2. Avoid "parse empty input" log (json-partial + XML parser)

3. Plain-text arg_value fallthrough + server grammar error handling

Testing

Related

Uh oh!

pwilkin commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants