Skip to content

Invalid JSON produced in streamed tool_call arguments when using Hermes 3 parser #3838

@fonfonya

Description

@fonfonya

Describe the bug
When using the OpenVINO Model Server OpenAI-compatible /v3 endpoint with stream=true and tool_choice="required" (Hermes 3 tool parser), the streamed arguments field sometimes contains JSON fragments that appear to be split or positioned incorrectly. This results in the final concatenated arguments not being valid JSON.


To Reproduce

  1. Prepare the model repository using export_model.py:
python export_model.py text_generation \
  --source_model Qwen/Qwen3-4B-Instruct-2507 \
  --weight-format int8 \
  --config_file_path models/config.json \
  --model_repository_path models \
  --tool_parser hermes3
  1. Launch OVMS:
docker run -d --user $(id -u):$(id -g) --rm \
  -p 8000:8000 \
  -v $(pwd)/models:/models \
  openvino/model_server:weekly \
  --rest_port 8000 \
  --model_repository_path models \
  --source_model Qwen/Qwen3-4B-Instruct-2507 \
  --tool_parser hermes3 \
  --task text_generation \
  --enable_prefix_caching true
  1. Client code:
import openai

client = openai.OpenAI(
    api_key="",
    base_url="http://127.0.0.1:8000/v3",
)

output = ""
for chunk in client.chat.completions.create(
    messages=[{"role": "user", "content": "1+1"}],
    model="Qwen/Qwen3-4B-Instruct-2507",
    stream=True,
    temperature=0.0,
    tool_choice="required",
    tools=[
        {
            "type": "function",
            "function": {
                "name": "search",
                "description": "",
                "parameters": {
                    "type": "object",
                    "properties": {"query": {"type": "string"}},
                    "required": ["query"],
                },
            },
        }
    ],
):
    arguments = chunk.choices[0].delta.tool_calls[0].function.arguments
    if arguments is not None:
        output += arguments

print(output)
  1. Observed output:
{"query": "1+1"}}\n</tool_call>

Expected behavior
The streamed fragments for the arguments field should concatenate into valid JSON.
For example, the expected final output should be:

{"query": "1+1"}

Logs

[2025-12-05 07:30:30.243][152][llm_calculator][debug][servable.cpp:206] Generated subsequent streaming response: data: {"choices":[{"finish_reason":null,"in
dex":0,"logprobs":null,"delta":{"tool_calls":[{"id":"BZS5lI66X","type":"function","index":0,"function":{"name":"search"}}]}}],"created":1764919794,"model":"
Qwen/Qwen3-4B-Instruct-2507","object":"chat.completion.chunk"}
                                                                                                                                                            
[2025-12-05 07:30:30.243][152][llm_calculator][debug][http_llm_calculator.cc:143] LLMCalculator  [Node: LLMExecutor] Response prepared, sending it down the
graph
[2025-12-05 07:30:30.244][152][llm_calculator][debug][http_llm_calculator.cc:156] LLMCalculator  [Node: LLMExecutor] Process end
[2025-12-05 07:30:30.252][152][llm_calculator][debug][http_llm_calculator.cc:80] LLMCalculator  [Node: LLMExecutor] Process start
[2025-12-05 07:30:30.355][152][llm_calculator][debug][http_llm_calculator.cc:136] LLMCalculator  [Node: LLMExecutor] Received partial execution results
[2025-12-05 07:30:30.358][152][llm_calculator][debug][http_llm_calculator.cc:156] LLMCalculator  [Node: LLMExecutor] Process end
[2025-12-05 07:30:30.358][152][llm_calculator][debug][http_llm_calculator.cc:80] LLMCalculator  [Node: LLMExecutor] Process start
[2025-12-05 07:30:30.485][152][llm_calculator][debug][http_llm_calculator.cc:136] LLMCalculator  [Node: LLMExecutor] Received partial execution results
[2025-12-05 07:30:30.487][152][llm_calculator][debug][servable.cpp:206] Generated subsequent streaming response: data: {"choices":[{"finish_reason":null,"index":0,"logprobs":null,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"{\""}}]}}],"created":1764919794,"model":"Qwen/Qwen3-4B-Instruct-2507","obj
ect":"chat.completion.chunk"}


[2025-12-05 07:30:30.487][152][llm_calculator][debug][http_llm_calculator.cc:143] LLMCalculator  [Node: LLMExecutor] Response prepared, sending it down the graph
[2025-12-05 07:30:30.487][152][llm_calculator][debug][http_llm_calculator.cc:156] LLMCalculator  [Node: LLMExecutor] Process end
[2025-12-05 07:30:30.488][152][llm_calculator][debug][http_llm_calculator.cc:80] LLMCalculator  [Node: LLMExecutor] Process start
[2025-12-05 07:30:30.609][152][llm_calculator][debug][http_llm_calculator.cc:136] LLMCalculator  [Node: LLMExecutor] Received partial execution results
[2025-12-05 07:30:30.611][152][llm_calculator][debug][servable.cpp:206] Generated subsequent streaming response: data: {"choices":[{"finish_reason":null,"in
dex":0,"logprobs":null,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"query"}}]}}],"created":1764919794,"model":"Qwen/Qwen3-4B-Instruct-2507","o
bject":"chat.completion.chunk"}                                                                                                                             

[2025-12-05 07:30:30.611][152][llm_calculator][debug][http_llm_calculator.cc:143] LLMCalculator  [Node: LLMExecutor] Response prepared, sending it down the
graph
[2025-12-05 07:30:30.611][152][llm_calculator][debug][http_llm_calculator.cc:156] LLMCalculator  [Node: LLMExecutor] Process end
[2025-12-05 07:30:30.611][152][llm_calculator][debug][http_llm_calculator.cc:80] LLMCalculator  [Node: LLMExecutor] Process start
[2025-12-05 07:30:30.732][152][llm_calculator][debug][http_llm_calculator.cc:136] LLMCalculator  [Node: LLMExecutor] Received partial execution results
[2025-12-05 07:30:30.733][152][llm_calculator][debug][servable.cpp:206] Generated subsequent streaming response: data: {"choices":[{"finish_reason":null,"in
dex":0,"logprobs":null,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\":"}}]}}],"created":1764919794,"model":"Qwen/Qwen3-4B-Instruct-2507","obj
ect":"chat.completion.chunk"}


[2025-12-05 07:30:30.733][152][llm_calculator][debug][http_llm_calculator.cc:143] LLMCalculator  [Node: LLMExecutor] Response prepared, sending it down the
graph
[2025-12-05 07:30:30.733][152][llm_calculator][debug][http_llm_calculator.cc:156] LLMCalculator  [Node: LLMExecutor] Process end
[2025-12-05 07:30:30.733][152][llm_calculator][debug][http_llm_calculator.cc:80] LLMCalculator  [Node: LLMExecutor] Process start
[2025-12-05 07:30:30.849][149][llm_executor][info][llm_executor.hpp:66] All requests: 1; Scheduled requests: 1; Cache usage 55.6%;
[2025-12-05 07:30:30.849][152][llm_calculator][debug][http_llm_calculator.cc:136] LLMCalculator  [Node: LLMExecutor] Received partial execution results
[2025-12-05 07:30:30.855][152][llm_calculator][debug][servable.cpp:206] Generated subsequent streaming response: data: {"choices":[{"finish_reason":null,"in
dex":0,"logprobs":null,"delta":{"tool_calls":[{"index":0,"function":{"arguments":" \""}}]}}],"created":1764919794,"model":"Qwen/Qwen3-4B-Instruct-2507","obj
ect":"chat.completion.chunk"}


[2025-12-05 07:30:30.855][152][llm_calculator][debug][http_llm_calculator.cc:143] LLMCalculator  [Node: LLMExecutor] Response prepared, sending it down the
graph
[2025-12-05 07:30:30.855][152][llm_calculator][debug][http_llm_calculator.cc:156] LLMCalculator  [Node: LLMExecutor] Process end
[2025-12-05 07:30:30.855][152][llm_calculator][debug][http_llm_calculator.cc:80] LLMCalculator  [Node: LLMExecutor] Process start
[2025-12-05 07:30:31.017][152][llm_calculator][debug][http_llm_calculator.cc:136] LLMCalculator  [Node: LLMExecutor] Received partial execution results
[2025-12-05 07:30:31.022][152][llm_calculator][debug][servable.cpp:206] Generated subsequent streaming response: data: {"choices":[{"finish_reason":null,"in
dex":0,"logprobs":null,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"1"}}]}}],"created":1764919794,"model":"Qwen/Qwen3-4B-Instruct-2507","objec
t":"chat.completion.chunk"}


[2025-12-05 07:30:31.022][152][llm_calculator][debug][http_llm_calculator.cc:143] LLMCalculator  [Node: LLMExecutor] Response prepared, sending it down the
graph
[2025-12-05 07:30:31.022][152][llm_calculator][debug][http_llm_calculator.cc:156] LLMCalculator  [Node: LLMExecutor] Process end
[2025-12-05 07:30:31.022][152][llm_calculator][debug][http_llm_calculator.cc:80] LLMCalculator  [Node: LLMExecutor] Process start
[2025-12-05 07:30:31.141][152][llm_calculator][debug][http_llm_calculator.cc:136] LLMCalculator  [Node: LLMExecutor] Received partial execution results
[2025-12-05 07:30:31.142][152][llm_calculator][debug][http_llm_calculator.cc:156] LLMCalculator  [Node: LLMExecutor] Process end
[2025-12-05 07:30:31.142][152][llm_calculator][debug][http_llm_calculator.cc:80] LLMCalculator  [Node: LLMExecutor] Process start
[2025-12-05 07:30:31.261][152][llm_calculator][debug][http_llm_calculator.cc:136] LLMCalculator  [Node: LLMExecutor] Received partial execution results
[2025-12-05 07:30:31.262][152][llm_calculator][debug][http_llm_calculator.cc:156] LLMCalculator  [Node: LLMExecutor] Process end
[2025-12-05 07:30:31.262][152][llm_calculator][debug][http_llm_calculator.cc:80] LLMCalculator  [Node: LLMExecutor] Process start
[2025-12-05 07:30:31.383][152][llm_calculator][debug][http_llm_calculator.cc:136] LLMCalculator  [Node: LLMExecutor] Received partial execution results
[2025-12-05 07:30:31.385][152][llm_calculator][debug][servable.cpp:206] Generated subsequent streaming response: data: {"choices":[{"finish_reason":null,"in
dex":0,"logprobs":null,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"+1\"}}\n"}}]}}],"created":1764919794,"model":"Qwen/Qwen3-4B-Instruct-2507"
,"object":"chat.completion.chunk"}


[2025-12-05 07:30:31.385][152][llm_calculator][debug][http_llm_calculator.cc:143] LLMCalculator  [Node: LLMExecutor] Response prepared, sending it down the
graph
[2025-12-05 07:30:31.385][152][llm_calculator][debug][http_llm_calculator.cc:156] LLMCalculator  [Node: LLMExecutor] Process end
[2025-12-05 07:30:31.385][152][llm_calculator][debug][http_llm_calculator.cc:80] LLMCalculator  [Node: LLMExecutor] Process start
[2025-12-05 07:30:31.507][152][llm_calculator][debug][http_llm_calculator.cc:136] LLMCalculator  [Node: LLMExecutor] Received partial execution results
[2025-12-05 07:30:31.508][152][llm_calculator][debug][servable.cpp:206] Generated subsequent streaming response: data: {"choices":[{"finish_reason":null,"in
dex":0,"logprobs":null,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"</"}}]}}],"created":1764919794,"model":"Qwen/Qwen3-4B-Instruct-2507","obje
ct":"chat.completion.chunk"}


[2025-12-05 07:30:31.508][152][llm_calculator][debug][http_llm_calculator.cc:143] LLMCalculator  [Node: LLMExecutor] Response prepared, sending it down the
graph
[2025-12-05 07:30:31.508][152][llm_calculator][debug][http_llm_calculator.cc:156] LLMCalculator  [Node: LLMExecutor] Process end
[2025-12-05 07:30:31.508][152][llm_calculator][debug][http_llm_calculator.cc:80] LLMCalculator  [Node: LLMExecutor] Process start
[2025-12-05 07:30:31.621][152][llm_calculator][debug][http_llm_calculator.cc:136] LLMCalculator  [Node: LLMExecutor] Received partial execution results
[2025-12-05 07:30:31.622][152][llm_calculator][debug][servable.cpp:226] Generated complete streaming response: data: {"choices":[{"finish_reason":"stop","in
dex":0,"logprobs":null,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"tool_call>"}}]}}],"created":1764919794,"model":"Qwen/Qwen3-4B-Instruct-250
7","object":"chat.completion.chunk"}

data: [DONE]

Configuration

  1. OVMS version

    • OpenVINO Model Server: 2025.4.0.15ce0188a
    • OpenVINO backend: 2025.4.0.0rc2
  2. OVMS config.json file

  3. CPU only

  4. Model repository directory structure

  5. Model or publicly available similar model that reproduces the issue


Additional context

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions