Skip to content

OVMS 2026.1 JSON request parsed successfully but pipeline input text sent to LLM says otherwise #4181

@jurw2201

Description

@jurw2201

Describe the bug
Sending a request containing an array inside content for user role message to the GenAI endpoint (/v3/chat/completions), the ovms TRACE log outputs the original JSON sent, but the generated pipeline input text sent to LLM isn't as expected. Only the last item in the content array is used.

To Reproduce
Steps to reproduce the behavior:
(using Powershell)

  1. Start OVMS GenAI endpoint (model was previously exported using optimum-cli to openvino IR, standard Granite 4.1 chat template saved in openvino_tokenizer.xml)

$ovms_config = '{"PERFORMANCE_HINT":"LATENCY","KV_CACHE_PRECISION":"u8","CACHE_DIR":"E:\\ovms_models\\cache"}'
& "D:\programs\ovms\ovms.exe" `
--rest_port=8000 `
--model_name="granite-4.1-instruct-int4" `
--model_path="E:\ovms_models\granite-4.1-instruct-int4" `
--task=text_generation `
--target_device="GPU.0" `
--max_num_batched_tokens=8192 `
--enable_prefix_caching=true `
--plugin_config=$ovms_config `
--tool_parser=hermes3 `
--enable_tool_guided_generation true `
--max_num_seqs=1 `
--log_level=TRACE

  1. Send the following command:

$uri = "http://localhost:8000/v3/chat/completions"
$body = @{
messages = @(
@{
role = "system"
content = "You are a highly skilled software engineer."
},
@{
role = "user"
content = @(
@{
type = "text"
text = "`ninitialize memory bank`n"
},
@{
type = "text"
text = "`n# task_progress RECOMMENDED`n`nWhen starting a new task, it is recommended to include a todo list.`n"
},
@{
type = "text"
text = "<environment_details>`nSome environment details`n</environment_details>"
}
)
}
)
model = "granite-4.1-instruct-int4"
}
$json = $body | ConvertTo-Json -Depth 20
Invoke-RestMethod `
-Uri $uri `
-Method Post `
-Body $json `
-ContentType "application/json"

  1. Check ovms logs for request body and pipeline input text:
    [llm_calculator][debug][servable.cpp:314] Request body: {"model":"granite-4.1-instruct-int4","messages":[{"role":"system","content":"You are a highly skilled software engineer."},{"role":"user","content":[{"type":"text","text":"\ninitialize memory bank\n"},{"type":"text","text":"\n# task_progress RECOMMENDED\n\nWhen starting a new task, it is recommended to include a todo list.\n"},{"type":"text","text":"<environment_details>\nSome environment details\n</environment_details>"}]}]}
    [llm_calculator][debug][servable.cpp:315] Request uri: /v3/chat/completions
    [llm_calculator][trace][servable.cpp:232] Pipeline input text: <|start_of_role|>system<|end_of_role|>You are a highly skilled software engineer.<|end_of_text|>
    <|start_of_role|>user<|end_of_role|><environment_details>
    Some environment details
    </environment_details><|end_of_text|>
    <|start_of_role|>assistant<|end_of_role|>

  2. Note only the last item of the content array has been used.

Expected behavior
Logs should show the following pipeline input text
<|start_of_role|>system<|end_of_role|>You are a highly skilled software engineer.<|end_of_text|>
<|start_of_role|>user<|end_of_role|>
initialize memory bank

# task_progress RECOMMENDED

When starting a new task, it is recommended to include a todo list.

<environment_details>
Some environment details
</environment_details><|end_of_text|>

Logs

[2026-05-05 11:40:13.780][33388][serving][debug][drogon_http_server.cpp:97] Request URI /v3/chat/completions dispatched to streaming thread pool
[2026-05-05 11:40:13.780][33388][serving][debug][http_server.cpp:173] REST request /v3/chat/completions
[2026-05-05 11:40:13.780][33388][serving][debug][http_server.cpp:184] Processing HTTP request: POST /v3/chat/completions body: 664 bytes
[2026-05-05 11:40:13.780][33388][serving][debug][http_rest_api_handler.cpp:549] Model name from deduced from JSON: granite-4.1-instruct-int4
[2026-05-05 11:40:13.781][33388][serving][debug][mediapipegraphdefinition.cpp:369] Successfully waited for mediapipe definition: granite-4.1-instruct-int4
[2026-05-05 11:40:13.781][33388][serving][debug][mediapipegraphdefinition.cpp:263] Creating Mediapipe graph executor: granite-4.1-instruct-int4
[2026-05-05 11:40:13.781][33388][serving][debug][mediapipegraphexecutor.hpp:123] Start unary KServe request mediapipe graph: granite-4.1-instruct-int4 execution
[2026-05-05 11:40:13.781][44520][llm_calculator][debug][http_llm_calculator.cc:70] LLMCalculator [Node: LLMExecutor] Open start
[2026-05-05 11:40:13.781][44520][llm_calculator][debug][http_llm_calculator.cc:76] LLMCalculator [Node: LLMExecutor] Open end
[2026-05-05 11:40:13.781][44520][llm_calculator][debug][http_llm_calculator.cc:80] LLMCalculator [Node: LLMExecutor] Process start
[2026-05-05 11:40:13.781][44520][llm_calculator][debug][servable.cpp:314] Request body: {"model":"granite-4.1-instruct-int4","messages":[{"role":"system","content":"You are a highly skilled software engineer."},{"role":"user","content":[{"type":"text","text":"\ninitialize memory bank\n"},{"type":"text","text":"\n# task_progress RECOMMENDED\n\nWhen starting a new task, it is recommended to include a todo list.\n"},{"type":"text","text":"<environment_details>\nSome environment details\n</environment_details>"}]}]}
[2026-05-05 11:40:13.781][44520][llm_calculator][debug][servable.cpp:315] Request uri: /v3/chat/completions
[2026-05-05 11:40:13.781][44520][llm_calculator][debug][http_llm_calculator.cc:96] LLMCalculator [Node: LLMExecutor] Request loaded successfully
[2026-05-05 11:40:13.781][44520][llm_calculator][debug][openai_completions.cpp:419] Parsed messages successfully
[2026-05-05 11:40:13.781][44520][llm_calculator][debug][http_llm_calculator.cc:115] LLMCalculator [Node: LLMExecutor] Request parsed successfully
[2026-05-05 11:40:13.813][44520][llm_calculator][trace][servable.cpp:232] Pipeline input text: <|start_of_role|>system<|end_of_role|>You are a highly skilled software engineer.<|end_of_text|>
<|start_of_role|>user<|end_of_role|><environment_details>
Some environment details
</environment_details><|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>
[2026-05-05 11:40:13.813][44520][llm_calculator][trace][servable.cpp:233] prompt_token_ids: [100264, 9125, 100265, 2675, 527, 264, 7701, 26611, 3241, 24490, 13, 100257, 198, 100264, 882, 100265, 27, 24175, 13563, 397, 8538, 4676, 3649, 198, 524, 24175, 13563, 29, 100257, 198, 100264, 78191, 100265]
[2026-05-05 11:40:13.813][44520][llm_calculator][debug][http_llm_calculator.cc:122] LLMCalculator [Node: LLMExecutor] Input for the pipeline prepared successfully
[2026-05-05 11:40:13.813][44520][llm_calculator][trace][servable.cpp:46] Notifying executor thread

Configuration

  1. OVMS version

OpenVINO Model Server 2026.1.0.72cc0624
OpenVINO backend 2026.1.0-21367-63e31528c62-releases/2026/1
OpenVINO GenAI backend 2026.1.0.0-2957-1dabb8c2255
Bazel build flags: --config=win_mp_on_py_off

  1. OVMS config.json file

{
"architectures": [
"GraniteForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"attention_multiplier": 0.0078125,
"bos_token_id": 100257,
"dtype": "bfloat16",
"embedding_multiplier": 12.0,
"eos_token_id": 100257,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.1,
"intermediate_size": 12800,
"logits_scaling": 16.0,
"max_position_embeddings": 131072,
"mlp_bias": false,
"model_type": "granite",
"num_attention_heads": 32,
"num_hidden_layers": 40,
"num_key_value_heads": 8,
"pad_token_id": 100256,
"residual_multiplier": 0.22,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 10000000,
"tie_word_embeddings": true,
"transformers_version": "4.57.6",
"use_cache": true,
"vocab_size": 100352
}

  1. CPU, accelerator's versions if applicable

using Intel ARC B580

  1. Model repository directory structure

Directory: E:\ovms_models\granite-4.1-instruct-int4

Mode LastWriteTime Length Name


-a--- 2026-05-04 7:53 PM 829 config.json
-a--- 2026-05-04 8:53 PM 196 generation_config.json
-a--- 2026-05-05 11:39 AM 1322 graph.pbtxt
-a--- 2026-05-04 7:53 PM 916646 merges.txt
-a--- 2026-05-04 7:58 PM 458 openvino_config.json
-a--- 2026-05-04 7:53 PM 1448197 openvino_detokenizer.bin
-a--- 2026-05-05 10:38 AM 13348 openvino_detokenizer.xml
-a--- 2026-05-04 7:58 PM 5314158177 openvino_model.bin
-a--- 2026-05-04 7:58 PM 3540092 openvino_model.xml
-a--- 2026-05-04 7:53 PM 3695636 openvino_tokenizer.bin
-a--- 2026-05-05 11:00 AM 30314 openvino_tokenizer.xml
-a--- 2026-05-04 7:53 PM 609 special_tokens_map.json
-a--- 2026-05-04 10:22 PM 22694 tokenizer_config.json
-a--- 2026-05-04 7:53 PM 7153421 tokenizer.json
-a--- 2026-05-04 7:53 PM 1612704 vocab.json

  1. Model or publicly available similar model that reproduces the issue
    ibm-granite/granite-4.1-8b or variants using the same chat template.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions