feat(openai): expose completion token IDs in chat completion responses by DavidBellamy · Pull Request #13 · LLM360/sglang

DavidBellamy · 2026-05-04T00:55:28Z

Summary

Add an opt-in mechanism to return the worker-emitted completion token IDs alongside the chat completion response, so callers can avoid re-tokenizing the assistant's reply when they need exact token sequences (e.g. multi-turn conversations where the next prompt is built from prior tokens, or training pipelines that record rollout token streams).

Three small additions to the OpenAI chat-completion entrypoint:

ChatCompletionRequest: new return_completion_token_ids: bool = False request field.
ChatCompletionResponseChoice: new completion_token_ids: Optional[List[int]] = None response field. Omitted from JSON when null (consistent with the existing prompt_token_ids / hidden_states fields).
serving_chat._build_chat_response: extracts output_ids from the worker ret_item and populates the new field when the flag is set.

Plus tests covering serialization of both fields.

Why

When the application layer needs the exact token IDs the model emitted (not the detokenized string), there's currently no way to get them through the chat completion endpoint without enabling `return_meta_info` and parsing free-form metadata, or re-tokenizing the response text on the client. The latter can diverge from what the model actually emitted, especially for multimodal or special-tokenized content.

This mirrors the existing `return_prompt_token_ids` / `prompt_token_ids` pair on the request side, keeping the API surface symmetric.

Constraints

The flag rejects streaming (`stream=true`) with a clear `ValueError`, matching the existing `return_prompt_token_ids` behavior.
Default is `false`; existing callers see no change.
No new dependencies, no worker-side changes (IDs already on `ret_item["output_ids"]`).

Test plan

Unit tests for `ChatCompletionRequest` accepting the new flag and `ChatCompletionResponseChoice` round-tripping the field.
Streaming-incompatibility check raises `ValueError`.

Provenance

This is one of five focused PRs that supersede #3. The remaining pieces of #3 land in their own PRs:

mooncake_transfer_engine.py ibv_reg_mr concurrency lock
http_server.py tokenizer_sha256 endpoint
server.rs Miles `/add_worker` shim
router.rs TITO debug logging

…alistic perf and auto-discover ut (sgl-project#22086) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>

…gl-project#21649) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>

…roject#18639)

…rt (sgl-project#18642)

…a for qwen-vl (sgl-project#18781)

…ject#19731)

…-project#20066)

Add an opt-in mechanism to return the worker-emitted completion token IDs alongside the chat completion response, so callers can avoid re-tokenizing the assistant's reply when they need exact token sequences. Three additions to the OpenAI chat-completion entrypoint: 1. ChatCompletionRequest: new return_completion_token_ids: bool = False request field. 2. ChatCompletionResponseChoice: new completion_token_ids: Optional[List[int]] response field. Omitted from JSON when null (consistent with the existing prompt_token_ids / hidden_states fields). 3. serving_chat._build_chat_response: extracts output_ids from the worker ret_item and populates the new field when the flag is set. Plus tests covering serialization of both fields. Mirrors the existing return_prompt_token_ids / prompt_token_ids pair on the request side, keeping the API surface symmetric. Default is false; existing callers see no change. Streaming is rejected with a clear ValueError matching the return_prompt_token_ids behavior.

mickqian and others added 19 commits April 4, 2026 23:37

[diffusion] CI: improve diffusion comparison benchmark setting for re…

43654ef

…alistic perf and auto-discover ut (sgl-project#22086) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

[Fix] Fix nightly tests (sgl-project#22140)

164bc0a

Enable IndexCache for DeepSeek V3.2 (sgl-project#21405)

07f57fc

Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>

fix: TRT-LLM MHA CUDA illegal address with EAGLE v2 + DP attention (s…

c1927e1

…gl-project#21649) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>

[Hotfix] Fix router gemm on sm103 (sgl-project#22134)

1519acf

[1/8] [sglang-miles] True on-policy training support for FSDP2 (sgl-p…

74c9eab

…roject#18639)

[2/8] [sglang-miles] R3 (Rollout Routing Replay) DeepEP and MTP suppo…

453eb15

…rt (sgl-project#18642)

[3/8] [sglang-miles] Support INT4 QAT for RL (sgl-project#18565)

2bc3f68

[4/8] [sglang-miles] PD disaggregation for RL (sgl-project#18646)

277147c

[5/8] [sglang-miles] MTP related fix (sgl-project#18647)

7af1f15

[6/8] [sglang-miles] tmp fix for vlm training: use legacy_load_mm_dat…

24def86

…a for qwen-vl (sgl-project#18781)

[7/8] [sglang-miles] support better token id return for TITO (sgl-pro…

69018ba

…ject#19731)

[8/8] [feat] Support cross turn token in after last user message (sgl…

3606aec

…-project#20066)

[sglang-miles] fix weight checker (sgl-project#21494)

98b5440

P2P Weight Update features for miles (sgl-project#21278)

629aa25

Fix is_multimodal_gen attr not in v0.5.10 ModelConfig

d335128

Fix flashinfer fused_moe with topk>8 (sgl-project#22201)

0e3e23e

Add maintain-deploy workflow for auto-merging PRs into deploy branch

58fc036

github-actions Bot added diffusion lora blackwell deepseek labels May 4, 2026

DavidBellamy mentioned this pull request May 4, 2026

W-TITO bullet 1 + W2 Miles shim + smg response-flatten Cargo patch #3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(openai): expose completion token IDs in chat completion responses#13

feat(openai): expose completion token IDs in chat completion responses#13
DavidBellamy wants to merge 19 commits intomainfrom
feat/completion-token-ids-llm360-fork

DavidBellamy commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

DavidBellamy commented May 4, 2026

Summary

Why

Constraints

Test plan

Provenance

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants