feat(openai): expose completion token IDs in chat completion responses#13
Open
DavidBellamy wants to merge 19 commits intomainfrom
Open
feat(openai): expose completion token IDs in chat completion responses#13DavidBellamy wants to merge 19 commits intomainfrom
DavidBellamy wants to merge 19 commits intomainfrom
Conversation
…alistic perf and auto-discover ut (sgl-project#22086) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
…gl-project#21649) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Add an opt-in mechanism to return the worker-emitted completion token IDs alongside the chat completion response, so callers can avoid re-tokenizing the assistant's reply when they need exact token sequences. Three additions to the OpenAI chat-completion entrypoint: 1. ChatCompletionRequest: new return_completion_token_ids: bool = False request field. 2. ChatCompletionResponseChoice: new completion_token_ids: Optional[List[int]] response field. Omitted from JSON when null (consistent with the existing prompt_token_ids / hidden_states fields). 3. serving_chat._build_chat_response: extracts output_ids from the worker ret_item and populates the new field when the flag is set. Plus tests covering serialization of both fields. Mirrors the existing return_prompt_token_ids / prompt_token_ids pair on the request side, keeping the API surface symmetric. Default is false; existing callers see no change. Streaming is rejected with a clear ValueError matching the return_prompt_token_ids behavior.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add an opt-in mechanism to return the worker-emitted completion token IDs alongside the chat completion response, so callers can avoid re-tokenizing the assistant's reply when they need exact token sequences (e.g. multi-turn conversations where the next prompt is built from prior tokens, or training pipelines that record rollout token streams).
Three small additions to the OpenAI chat-completion entrypoint:
ChatCompletionRequest: newreturn_completion_token_ids: bool = Falserequest field.ChatCompletionResponseChoice: newcompletion_token_ids: Optional[List[int]] = Noneresponse field. Omitted from JSON when null (consistent with the existingprompt_token_ids/hidden_statesfields).serving_chat._build_chat_response: extractsoutput_idsfrom the workerret_itemand populates the new field when the flag is set.Plus tests covering serialization of both fields.
Why
When the application layer needs the exact token IDs the model emitted (not the detokenized string), there's currently no way to get them through the chat completion endpoint without enabling `return_meta_info` and parsing free-form metadata, or re-tokenizing the response text on the client. The latter can diverge from what the model actually emitted, especially for multimodal or special-tokenized content.
This mirrors the existing `return_prompt_token_ids` / `prompt_token_ids` pair on the request side, keeping the API surface symmetric.
Constraints
Test plan
Provenance
This is one of five focused PRs that supersede #3. The remaining pieces of #3 land in their own PRs: