Skip to content

Fix missing disagg_request_id fallback in context responses#12893

Open
pich4ya wants to merge 1 commit intoNVIDIA:mainfrom
pich4ya:fix/disagg-missing-request-id
Open

Fix missing disagg_request_id fallback in context responses#12893
pich4ya wants to merge 1 commit intoNVIDIA:mainfrom
pich4ya:fix/disagg-missing-request-id

Conversation

@pich4ya
Copy link
Copy Markdown

@pich4ya pich4ya commented Apr 9, 2026

Summary

This change makes the disaggregated proxy tolerant of context-phase responses that include ctx_request_id but omit disagg_request_id.

Instead of failing hard in _verify_ctx_response, the proxy now falls back to ctx_request_id and logs a warning.

Why

On a live 2-node DGX Spark disaggregated setup (gpt-oss-120b, UCX KV-cache transfer over RoCE/RDMA), the context server returned:

  • disaggregated_params.ctx_request_id != None
  • disaggregated_params.disagg_request_id == None

The proxy then raised:

Invalid disaggregated params in context phase response. disagg_request_id is None

That prevented an otherwise healthy context/generation/proxy stack from serving requests.

Rationale for the fallback

For context-first disaggregated flow, ctx_request_id is already the request identifier that the generation side uses to continue the request. When disagg_request_id is absent but ctx_request_id is present, using ctx_request_id preserves the existing request identity instead of aborting the request.

This is intentionally conservative:

  • still errors if disaggregated_params is missing
  • still errors if ctx_request_id is missing
  • only fills disagg_request_id when it is the sole missing field
  • emits a warning so the underlying worker-side omission remains visible

Test

Added a regression test to verify _verify_ctx_response accepts a context-phase response with:

  • ctx_request_id set
  • disagg_request_id unset

and backfills disagg_request_id from ctx_request_id.

Validation performed

Local validation:

  • python3 -m py_compile tensorrt_llm/serve/openai_disagg_service.py tests/unittest/disaggregated/test_openai_disagg_service.py

I could not run the full pytest target in this lightweight clone because the local environment does not include TensorRT-LLM's Python test dependencies (for example transformers).

Reproduction environment

Observed on:

  • 2-node DGX Spark setup
  • TensorRT-LLM 1.3.0rc10
  • disaggregated serving
  • context server on node 1
  • generation server on node 2
  • proxy on node 1
  • UCX KV-cache transfer over RoCE/RDMA

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced error handling for missing request identifiers in disaggregated service responses. The system now implements fallback logic to automatically populate missing identifiers instead of raising errors.

@pich4ya pich4ya requested a review from a team as a code owner April 9, 2026 14:16
@pich4ya pich4ya requested a review from hchings April 9, 2026 14:16
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 9, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8037faf3-7d40-4b0f-ba9c-5471014c9f4f

📥 Commits

Reviewing files that changed from the base of the PR and between 3e942cc and 5909f2c.

📒 Files selected for processing (2)
  • tensorrt_llm/serve/openai_disagg_service.py
  • tests/unittest/disaggregated/test_openai_disagg_service.py

📝 Walkthrough

Walkthrough

Modified error handling in the context-phase response verification to fall back to using ctx_request_id when disagg_request_id is missing, replacing the previous ValueError with a warning log. Added corresponding unit test to verify this fallback behavior.

Changes

Cohort / File(s) Summary
Context Response Verification Fallback
tensorrt_llm/serve/openai_disagg_service.py, tests/unittest/disaggregated/test_openai_disagg_service.py
Changed _verify_ctx_response to log a warning and fall back to ctx_request_id instead of throwing ValueError when disagg_request_id is missing. Added test case validating the fallback behavior populates both identifiers correctly.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description provides a comprehensive explanation of the issue, solution, rationale, test coverage, and validation performed, but does not follow the required repository template structure with ticket/issue reference and type prefix. Add the required template format: start with [ticket/issue/None][type] prefix (e.g., [None][fix]) and ensure all required checklist items are addressed.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title clearly and concisely describes the main change: adding a fallback mechanism for missing disagg_request_id in context responses.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community want to contribute PRs initiated from Community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants