Affected versions: any build after 773489e / #9244
Symptom: In streaming mode, the first content token for non-reasoning models is sent to the client twice. Reasoning models, non-streaming mode, and /v1/completions are unaffected. Visible in backend traces: chat_deltas.content has the first token duplicated, while response is correct.
Root cause: 773489e switched PredictStream to TASK_RESPONSE_TYPE_OAI_CHAT, which causes server_task_result_cmpl_partial::to_json_oaicompat_chat() to return a JSON array. For the first token (n_decoded == 1) that array has two elements: a role-init chunk {role:"assistant", content:null} followed by the actual content chunk {content:"<first token>"}.
In backend/cpp/llama-cpp/grpc-server.cpp, the loop that processes this array calls attach_chat_deltas(reply, first_result.get()) for every element with the same raw_result pointer. Since oaicompat_msg_diffs contains the first token's diff, both the role-init reply and the content reply get ChatDelta.Content = "<first token>" stamped on them. Go receives both over gRPC, accumulates both into allChatDeltas, and the streaming callback emits the first token's content twice to the SSE client.
Fix: In the array iteration loops in grpc-server.cpp (PredictStream, lines ~1721 and ~1747), skip attach_chat_deltas for role-init elements — detectable by the presence of "role" in choices[0].delta:
for (const auto & res : res_json) {
auto reply = build_reply_from_json(res, result.get());
// Skip role-init elements (delta has "role" key, no actual content/reasoning diffs)
bool is_role_init = res.contains("choices") && !res["choices"].empty() &&
res["choices"][0].value("delta", json::object()).contains("role");
if (!is_role_init) {
attach_chat_deltas(reply, result.get());
}
writer->Write(reply);
}
Affected versions: any build after 773489e / #9244
Symptom: In streaming mode, the first content token for non-reasoning models is sent to the client twice. Reasoning models, non-streaming mode, and /v1/completions are unaffected. Visible in backend traces:
chat_deltas.contenthas the first token duplicated, whileresponseis correct.Root cause: 773489e switched
PredictStreamtoTASK_RESPONSE_TYPE_OAI_CHAT, which causesserver_task_result_cmpl_partial::to_json_oaicompat_chat()to return a JSON array. For the first token (n_decoded == 1) that array has two elements: a role-init chunk{role:"assistant", content:null}followed by the actual content chunk{content:"<first token>"}.In
backend/cpp/llama-cpp/grpc-server.cpp, the loop that processes this array callsattach_chat_deltas(reply, first_result.get())for every element with the sameraw_resultpointer. Sinceoaicompat_msg_diffscontains the first token's diff, both the role-init reply and the content reply getChatDelta.Content = "<first token>"stamped on them. Go receives both over gRPC, accumulates both intoallChatDeltas, and the streaming callback emits the first token's content twice to the SSE client.Fix: In the array iteration loops in
grpc-server.cpp(PredictStream, lines ~1721 and ~1747), skipattach_chat_deltasfor role-init elements — detectable by the presence of"role"inchoices[0].delta: