fix: keep partial content when reasoning block is truncated by max_tokens by rivetphilbot · Pull Request #47 · 1CatAI/1Cat-vLLM

rivetphilbot · 2026-05-19T10:14:07Z

Problem

A non-streaming chat completion can come back with an empty content field even though the model generated coherent tokens. It happens whenever a request runs out of token budget while still inside the <think> reasoning block (i.e. max_tokens is reached before </think> is emitted).

Root cause

In OpenAIServingChat (non-streaming path, vllm/entrypoints/openai/chat_completion/serving.py), after the reasoning parser runs:

reasoning, content = reasoning_parser.extract_reasoning(parser_input_text, request=request)
if output.token_ids is not None and content is not None:
    try:
        content_ids = reasoning_parser.extract_content_ids(as_list(output.token_ids))
        content = tokenizer.decode(content_ids, skip_special_tokens=True)
    except Exception:
        pass

extract_reasoning correctly returns the partial thinking text as content. The handler then re-decodes from token ids via extract_content_ids, which returns [] when the reasoning block was never closed. tokenizer.decode([]) yields "", which overwrites the correct, non-empty content. The response ships empty.

Reproduced on a Qwen3 reasoning model: a prompt that asks the model to "think step by step" with a small max_tokens returns HTTP 200, finish_reason: "length", dozens of generated tokens — and empty content. The same prompt with adequate max_tokens (so </think> is reached) answers correctly.

Fix

Only override content with the re-decoded text when extract_content_ids actually returns ids. When it returns [] (unclosed think block), keep the content that extract_reasoning already produced.

The closed-think happy path is unchanged — extract_content_ids returns a non-empty list there and the re-decode proceeds exactly as before.

Verification

On a V100 build serving Qwen3 with the qwen3 reasoning parser:

Truncated-think request (max_tokens small) — before: empty content; after: partial reasoning text returned.
Ample-budget request (</think> reached) — unchanged, correct answer.
Long-context needle retrieval (20K, 51K tokens) — unaffected.

Scoped to one block, 11/-4 lines, no line-ending changes.

The non-streaming chat handler re-decodes the completion text from token ids via `extract_content_ids` after `extract_reasoning` has already produced the content string. `extract_content_ids` returns an empty list when the reasoning block was never closed -- e.g. generation hit `max_tokens` while still inside `<think>`. `tokenizer.decode([])` then yields an empty string, which overwrites the (correct, non-empty) content that `extract_reasoning` already extracted. The response goes out with empty `content` despite the model having generated coherent tokens. Only override `content` with the re-decoded text when `extract_content_ids` actually returns ids. When it returns `[]`, keep what `extract_reasoning` produced so truncated-think responses still carry their partial text. The closed-think happy path is unaffected -- `extract_content_ids` returns a non-empty list there and the re-decode proceeds as before.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: keep partial content when reasoning block is truncated by max_tokens#47

fix: keep partial content when reasoning block is truncated by max_tokens#47
rivetphilbot wants to merge 1 commit into
1CatAI:mainfrom
rivetphilbot:fix-empty-content-truncated-reasoning

rivetphilbot commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rivetphilbot commented May 19, 2026

Problem

Root cause

Fix

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant