Fix missing disagg_request_id fallback in context responses by pich4ya · Pull Request #12893 · NVIDIA/TensorRT-LLM

pich4ya · 2026-04-09T14:16:23Z

Summary

This change makes the disaggregated proxy tolerant of context-phase responses that include ctx_request_id but omit disagg_request_id.

Instead of failing hard in _verify_ctx_response, the proxy now falls back to ctx_request_id and logs a warning.

Why

On a live 2-node DGX Spark disaggregated setup (gpt-oss-120b, UCX KV-cache transfer over RoCE/RDMA), the context server returned:

disaggregated_params.ctx_request_id != None
disaggregated_params.disagg_request_id == None

The proxy then raised:

Invalid disaggregated params in context phase response. disagg_request_id is None

That prevented an otherwise healthy context/generation/proxy stack from serving requests.

Rationale for the fallback

For context-first disaggregated flow, ctx_request_id is already the request identifier that the generation side uses to continue the request. When disagg_request_id is absent but ctx_request_id is present, using ctx_request_id preserves the existing request identity instead of aborting the request.

This is intentionally conservative:

still errors if disaggregated_params is missing
still errors if ctx_request_id is missing
only fills disagg_request_id when it is the sole missing field
emits a warning so the underlying worker-side omission remains visible

Test

Added a regression test to verify _verify_ctx_response accepts a context-phase response with:

ctx_request_id set
disagg_request_id unset

and backfills disagg_request_id from ctx_request_id.

Validation performed

Local validation:

python3 -m py_compile tensorrt_llm/serve/openai_disagg_service.py tests/unittest/disaggregated/test_openai_disagg_service.py

I could not run the full pytest target in this lightweight clone because the local environment does not include TensorRT-LLM's Python test dependencies (for example transformers).

Reproduction environment

Observed on:

2-node DGX Spark setup
TensorRT-LLM 1.3.0rc10
disaggregated serving
context server on node 1
generation server on node 2
proxy on node 1
UCX KV-cache transfer over RoCE/RDMA

Summary by CodeRabbit

Bug Fixes
- Enhanced error handling for missing request identifiers in disaggregated service responses. The system now implements fallback logic to automatically populate missing identifiers instead of raising errors.

coderabbitai · 2026-04-09T14:21:13Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8037faf3-7d40-4b0f-ba9c-5471014c9f4f

📥 Commits

Reviewing files that changed from the base of the PR and between 3e942cc and 5909f2c.

📒 Files selected for processing (2)

tensorrt_llm/serve/openai_disagg_service.py
tests/unittest/disaggregated/test_openai_disagg_service.py

📝 Walkthrough

Walkthrough

Modified error handling in the context-phase response verification to fall back to using ctx_request_id when disagg_request_id is missing, replacing the previous ValueError with a warning log. Added corresponding unit test to verify this fallback behavior.

Changes

Cohort / File(s)	Summary
Context Response Verification Fallback `tensorrt_llm/serve/openai_disagg_service.py`, `tests/unittest/disaggregated/test_openai_disagg_service.py`	Changed `_verify_ctx_response` to log a warning and fall back to `ctx_request_id` instead of throwing `ValueError` when `disagg_request_id` is missing. Added test case validating the fallback behavior populates both identifiers correctly.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The PR description provides a comprehensive explanation of the issue, solution, rationale, test coverage, and validation performed, but does not follow the required repository template structure with ticket/issue reference and type prefix.	Add the required template format: start with [ticket/issue/None][type] prefix (e.g., [None][fix]) and ensure all required checklist items are addressed.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The pull request title clearly and concisely describes the main change: adding a fallback mechanism for missing disagg_request_id in context responses.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Fix missing disagg request id fallback in context responses

5909f2c

pich4ya requested a review from a team as a code owner April 9, 2026 14:16

pich4ya requested a review from hchings April 9, 2026 14:16

github-actions bot assigned pich4ya Apr 9, 2026

svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Apr 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix missing disagg_request_id fallback in context responses#12893

Fix missing disagg_request_id fallback in context responses#12893
pich4ya wants to merge 1 commit intoNVIDIA:mainfrom
pich4ya:fix/disagg-missing-request-id

pich4ya commented Apr 9, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 9, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pich4ya commented Apr 9, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Rationale for the fallback

Test

Validation performed

Reproduction environment

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 9, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pich4ya commented Apr 9, 2026 •

edited by coderabbitai bot

Loading