feat(frontend): Support disagg with vllm processor#9503
Conversation
WalkthroughThis PR implements routed-engine dispatch for the vLLM Python chat processor by introducing reusable PreprocessedRouting infrastructure, exposing it to Python via PyO3 bindings, integrating it into discovery, and conditionally routing vLLM-preprocessed requests through the Rust PrefillRouter when available. ChangesvLLM Python processor routed-engine integration
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
I went back and forth with local agent during dev, and missed this. I think I was pushing them too hard to simplify and reuse. Signed-off-by: Graham King <grahamk@nvidia.com>
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
Signed-off-by: Graham King <grahamk@nvidia.com>
356d902 to
84ab294
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@components/src/dynamo/frontend/tests/test_vllm_processor_unit.py`:
- Around line 204-213: The fake routed-engine used by tests (_FakeRoutedEngine
and its generate method) yields raw dicts but real routed items are objects with
methods is_error(), comments(), and data(); update the stub so it yields objects
(e.g., a small inner class or named wrapper) that implement is_error() -> False
(or True for error cases), comments() -> appropriate metadata, and data() -> the
original dict payload (and keep the existing default item structure like
{"token_ids":[101],"index":0}); this will let VllmProcessor exercise the
routed-engine unwrap logic instead of taking the internal-error branch.
In `@components/src/dynamo/frontend/vllm_processor.py`:
- Around line 630-633: The fallback path that calls self.router.generate (inside
the _nvtx.annotate block) is not passing the request context, so request
IDs/cancellation linkage are lost when routed_engine is unavailable; update the
call to self.router.generate(dynamo_preproc, annotated=False, context=context)
(or the correct context parameter name used in this module) so the direct client
fallback receives the same context, and ensure any other non-KV fallback calls
in vllm_processor.py also forward that context to preserve request
tracing/cancellation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e5b88f87-f302-4a75-90d1-3d71da96a0bd
📒 Files selected for processing (11)
components/src/dynamo/frontend/sglang_processor.pycomponents/src/dynamo/frontend/tests/test_vllm_processor_unit.pycomponents/src/dynamo/frontend/vllm_processor.pylib/bindings/python/rust/lib.rslib/bindings/python/rust/llm.rslib/bindings/python/rust/llm/entrypoint.rslib/bindings/python/rust/llm/routed_engine.rslib/llm/src/discovery/watcher.rslib/llm/src/entrypoint.rslib/llm/src/entrypoint/input.rslib/llm/src/entrypoint/input/common.rs
Signed-off-by: Graham King <grahamk@nvidia.com>
Signed-off-by: Graham King <grahamk@nvidia.com>
krishung5
left a comment
There was a problem hiding this comment.
Overall LGTM!
General perf question: For the chat-processor path, do we see any perf impact from the new per-chunk costs (pythonize, mpsc channel hop.. etc.)?
For MM aware routing - test_serve_deployment[mm_agg_router_chat_processor_qwen3-vl-2b] is the test that would exercise the new routed_engine path with MM routing, but it's post_merge test. Could you trigger a post_merge pipeline or verify locally for sanity check?
| logger.debug( | ||
| "[mm-routing] KvRouter.generate() called without " | ||
| "mm_routing_info (text-only)" | ||
| if self.routed_engine is not None: |
There was a problem hiding this comment.
After this PR, I think self.routed_engine would never set ti None, so the elif self.is_kv_router and else will never be reached. Can we remove these or did I miss some use case for these two branches?
All the details in #9440 .
Closes: #9440
Before, the vllm/sglang pre-processor would do it's thing, and then call either
Client::generateorKvRouter::generatewhich are both in Rust and push the tokenized request to the backend.Now it calls new
RoutedEngine::generatein exactly the same way, which is also Rust. This wrapsClientor kv router withPrefillRouterwhich adds disagg support. It is quite elegant, the Python hardly changes.Later we will add
Migationin thatRoutedEngine.Assisted-By: Claude Opus/4.7 (plan, review)
Assisted-By: Codex GPT/5.5 (spec, execute plan, review)
... and the trusty Code Rabbit of course.