Skip to content

feat(openai): expose completion token IDs in chat completion responses#13

Open
DavidBellamy wants to merge 19 commits intomainfrom
feat/completion-token-ids-llm360-fork
Open

feat(openai): expose completion token IDs in chat completion responses#13
DavidBellamy wants to merge 19 commits intomainfrom
feat/completion-token-ids-llm360-fork

Conversation

@DavidBellamy
Copy link
Copy Markdown
Collaborator

Summary

Add an opt-in mechanism to return the worker-emitted completion token IDs alongside the chat completion response, so callers can avoid re-tokenizing the assistant's reply when they need exact token sequences (e.g. multi-turn conversations where the next prompt is built from prior tokens, or training pipelines that record rollout token streams).

Three small additions to the OpenAI chat-completion entrypoint:

  1. ChatCompletionRequest: new return_completion_token_ids: bool = False request field.
  2. ChatCompletionResponseChoice: new completion_token_ids: Optional[List[int]] = None response field. Omitted from JSON when null (consistent with the existing prompt_token_ids / hidden_states fields).
  3. serving_chat._build_chat_response: extracts output_ids from the worker ret_item and populates the new field when the flag is set.

Plus tests covering serialization of both fields.

Why

When the application layer needs the exact token IDs the model emitted (not the detokenized string), there's currently no way to get them through the chat completion endpoint without enabling `return_meta_info` and parsing free-form metadata, or re-tokenizing the response text on the client. The latter can diverge from what the model actually emitted, especially for multimodal or special-tokenized content.

This mirrors the existing `return_prompt_token_ids` / `prompt_token_ids` pair on the request side, keeping the API surface symmetric.

Constraints

  • The flag rejects streaming (`stream=true`) with a clear `ValueError`, matching the existing `return_prompt_token_ids` behavior.
  • Default is `false`; existing callers see no change.
  • No new dependencies, no worker-side changes (IDs already on `ret_item["output_ids"]`).

Test plan

  • Unit tests for `ChatCompletionRequest` accepting the new flag and `ChatCompletionResponseChoice` round-tripping the field.
  • Streaming-incompatibility check raises `ValueError`.

Provenance

This is one of five focused PRs that supersede #3. The remaining pieces of #3 land in their own PRs:

  • mooncake_transfer_engine.py ibv_reg_mr concurrency lock
  • http_server.py tokenizer_sha256 endpoint
  • server.rs Miles `/add_worker` shim
  • router.rs TITO debug logging

mickqian and others added 19 commits April 4, 2026 23:37
…alistic perf and auto-discover ut (sgl-project#22086)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
…gl-project#21649)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Add an opt-in mechanism to return the worker-emitted completion token IDs
alongside the chat completion response, so callers can avoid re-tokenizing
the assistant's reply when they need exact token sequences.

Three additions to the OpenAI chat-completion entrypoint:

1. ChatCompletionRequest: new return_completion_token_ids: bool = False
   request field.
2. ChatCompletionResponseChoice: new completion_token_ids: Optional[List[int]]
   response field. Omitted from JSON when null (consistent with the existing
   prompt_token_ids / hidden_states fields).
3. serving_chat._build_chat_response: extracts output_ids from the worker
   ret_item and populates the new field when the flag is set.

Plus tests covering serialization of both fields.

Mirrors the existing return_prompt_token_ids / prompt_token_ids pair on the
request side, keeping the API surface symmetric. Default is false; existing
callers see no change. Streaming is rejected with a clear ValueError matching
the return_prompt_token_ids behavior.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants