Name and Version
$ ./llama-server --version
version: 8455 (58c81f7)
built with AppleClang 17.0.0.17000319 for Darwin arm64
Operating systems
Windows, BSD, Mac, Linux
Which llama.cpp modules do you know to be affected?
libllama (core library), llama-server, llama-cli
Command line
# Repro 1: Llama 3.2
llama-server -m Llama-3.2-3B-Instruct-Q4_K_M.gguf --jinja
curl http://localhost:8080/v1/chat/completions -d '{"messages":[{"role":"user","content":"Write a hello world C program. Just the code, no explanation."}],"tools":[{"type":"function","function":{"name":"get_weather","description":"Get weather","parameters":{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}}}],"temperature":0,"max_tokens":200}'
# Repro 2: GPT-OSS
llama-server -m gpt-oss-20b-mxfp4.gguf --jinja
curl http://localhost:8080/v1/chat/completions -d '{"messages":[{"role":"user","content":"Give me a person"}],"response_format":{"type":"json_schema","json_schema":{"name":"p","schema":{"type":"object","properties":{"name":{"type":"string"},"age":{"type":"integer"}},"required":["name","age"]}}},"temperature":0,"max_tokens":200}'
Problem description & steps to reproduce
Since #18675 (autoparser refactoring), common_chat_peg_parse throws std::runtime_error (chat.cpp L1792) when the PEG parser can't consume the entire model output. The models below produce well-formed, useful output. The parser successfully extracts it during streaming (partial parse). On the final parse of the same text, result.fail() is true and the server throws.
result.fail() here does not mean the output is malformed. It means the parser stopped consuming input before reaching end-of-input. The content up to that point was already successfully parsed and, in streaming mode, delivered to the client. The throw discards a valid response and returns HTTP 500.
In streaming mode, this is a major server malfunction: SSE chunks are delivered to the client, then the stream terminates without a finish_reason chunk or data: [DONE]. Clients expecting a well-formed SSE sequence get a raw JSON error object injected into the event stream instead.
This also affects library consumers calling common_chat_templates_apply / common_chat_parse with tools directly (e.g. fllama, which calls the same library code path, not HTTP).
First Bad Commit
566059a
Relevant log output
gptoss_repro_server.log
llama_server_log.txt
gptoss_streaming.txt
llama_streaming.txt
Logs
Name and Version
$ ./llama-server --version
version: 8455 (58c81f7)
built with AppleClang 17.0.0.17000319 for Darwin arm64
Operating systems
Windows, BSD, Mac, Linux
Which llama.cpp modules do you know to be affected?
libllama (core library), llama-server, llama-cli
Command line
Problem description & steps to reproduce
Since #18675 (autoparser refactoring),
common_chat_peg_parsethrowsstd::runtime_error(chat.cpp L1792) when the PEG parser can't consume the entire model output. The models below produce well-formed, useful output. The parser successfully extracts it during streaming (partial parse). On the final parse of the same text,result.fail()is true and the server throws.result.fail()here does not mean the output is malformed. It means the parser stopped consuming input before reaching end-of-input. The content up to that point was already successfully parsed and, in streaming mode, delivered to the client. The throw discards a valid response and returns HTTP 500.In streaming mode, this is a major server malfunction: SSE chunks are delivered to the client, then the stream terminates without a
finish_reasonchunk ordata: [DONE]. Clients expecting a well-formed SSE sequence get a raw JSON error object injected into the event stream instead.This also affects library consumers calling
common_chat_templates_apply/common_chat_parsewith tools directly (e.g. fllama, which calls the same library code path, not HTTP).First Bad Commit
566059a
Relevant log output
gptoss_repro_server.log
llama_server_log.txt
gptoss_streaming.txt
llama_streaming.txt
Logs