Skip to content

Misc. bug: Autoparser throws on final parse after streaming succeeds #20814

@jpohhhhh

Description

@jpohhhhh

Name and Version

$ ./llama-server --version
version: 8455 (58c81f7)
built with AppleClang 17.0.0.17000319 for Darwin arm64

Operating systems

Windows, BSD, Mac, Linux

Which llama.cpp modules do you know to be affected?

libllama (core library), llama-server, llama-cli

Command line

# Repro 1: Llama 3.2
llama-server -m Llama-3.2-3B-Instruct-Q4_K_M.gguf --jinja
curl http://localhost:8080/v1/chat/completions -d '{"messages":[{"role":"user","content":"Write a hello world C program. Just the code, no explanation."}],"tools":[{"type":"function","function":{"name":"get_weather","description":"Get weather","parameters":{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}}}],"temperature":0,"max_tokens":200}'

# Repro 2: GPT-OSS
llama-server -m gpt-oss-20b-mxfp4.gguf --jinja
curl http://localhost:8080/v1/chat/completions -d '{"messages":[{"role":"user","content":"Give me a person"}],"response_format":{"type":"json_schema","json_schema":{"name":"p","schema":{"type":"object","properties":{"name":{"type":"string"},"age":{"type":"integer"}},"required":["name","age"]}}},"temperature":0,"max_tokens":200}'

Problem description & steps to reproduce

Since #18675 (autoparser refactoring), common_chat_peg_parse throws std::runtime_error (chat.cpp L1792) when the PEG parser can't consume the entire model output. The models below produce well-formed, useful output. The parser successfully extracts it during streaming (partial parse). On the final parse of the same text, result.fail() is true and the server throws.

result.fail() here does not mean the output is malformed. It means the parser stopped consuming input before reaching end-of-input. The content up to that point was already successfully parsed and, in streaming mode, delivered to the client. The throw discards a valid response and returns HTTP 500.

In streaming mode, this is a major server malfunction: SSE chunks are delivered to the client, then the stream terminates without a finish_reason chunk or data: [DONE]. Clients expecting a well-formed SSE sequence get a raw JSON error object injected into the event stream instead.

This also affects library consumers calling common_chat_templates_apply / common_chat_parse with tools directly (e.g. fllama, which calls the same library code path, not HTTP).

First Bad Commit

566059a

Relevant log output

gptoss_repro_server.log
llama_server_log.txt
gptoss_streaming.txt
llama_streaming.txt

Logs

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions