feat(frontend): Support disagg with vllm processor by grahamking · Pull Request #9503 · ai-dynamo/dynamo

grahamking · 2026-05-13T20:16:39Z

All the details in #9440 .

Closes: #9440

Before, the vllm/sglang pre-processor would do it's thing, and then call either Client::generate or KvRouter::generate which are both in Rust and push the tokenized request to the backend.

Now it calls new RoutedEngine::generate in exactly the same way, which is also Rust. This wraps Client or kv router with PrefillRouter which adds disagg support. It is quite elegant, the Python hardly changes.

Later we will add Migation in that RoutedEngine.

Assisted-By: Claude Opus/4.7 (plan, review)
Assisted-By: Codex GPT/5.5 (spec, execute plan, review)
... and the trusty Code Rabbit of course.

All the details in #9440 . Closes: #9440 Signed-off-by: Graham King <grahamk@nvidia.com>

coderabbitai · 2026-05-13T20:21:09Z

Walkthrough

This PR implements routed-engine dispatch for the vLLM Python chat processor by introducing reusable PreprocessedRouting infrastructure, exposing it to Python via PyO3 bindings, integrating it into discovery, and conditionally routing vLLM-preprocessed requests through the Rust PrefillRouter when available.

Changes

vLLM Python processor routed-engine integration

Layer / File(s)	Summary
PreprocessedRouting infrastructure and pipeline builders `lib/llm/src/entrypoint.rs`, `lib/llm/src/entrypoint/input.rs`, `lib/llm/src/entrypoint/input/common.rs`	New `PreprocessedRouting` struct encapsulates prefill operator and routed backend. `build_preprocessed_routing` builder selects routing backend, waits for workers, and creates PrefillRouter. Methods `build_pipeline` and `build_prefill_pipeline` wire full and prefill-only request paths respectively. Entrypoint re-exports and type aliases updated; `ChatEngineFactoryCallback` now receives `PrefillRoutedEngine` argument.
Watcher conditionally builds preprocessed routing `lib/llm/src/discovery/watcher.rs`	Tokenizer loading, KV chooser, prefill router, and worker monitor are now gated on whether local preprocessing is needed. PreprocessedRouting is built when required and its prefill pipeline is passed to Python chat factory. Local Rust chat and completions pipelines now use PreprocessedRouting methods instead of deprecated helpers.
Expose RoutedEngine to Python via PyO3 `lib/bindings/python/rust/llm.rs`, `lib/bindings/python/rust/lib.rs`, `lib/bindings/python/rust/llm/routed_engine.rs`, `lib/bindings/python/rust/llm/entrypoint.rs`	New `routed_engine` module declares RoutedEngine PyO3 class wrapping PrefillRoutedEngine. `generate` method converts Python dicts to PreprocessedRequest, builds execution context with stop/kill propagation, streams responses through Tokio MPSC, and returns AsyncResponseStream. Entrypoint bridge updated to pass wrapped RoutedEngine to Python `chat_engine_factory` callback as third argument.
VllmProcessor routes through routed_engine when available `components/src/dynamo/frontend/vllm_processor.py`	VllmProcessor accepts optional `routed_engine`. New `_inject_routing_metadata` helper merges reasoning fields into kv_kwargs extra_args. Generator, _generator_inner, and _generate_and_stream now accept optional `context` parameter. In _generate_and_stream, when `routed_engine` is set, calls `routed_engine.generate(dynamo_preproc, context=context)`; otherwise preserves KvRouter and Client paths with routing metadata injection for KvRouter. EngineFactory passes `routed_engine` to VllmProcessor.
Update tests and SGLang processor signature `components/src/dynamo/frontend/tests/test_vllm_processor_unit.py`, `components/src/dynamo/frontend/sglang_processor.py`	Reasoning-metadata test updated to use `_inject_routing_metadata`. New async test suite verifies routed-engine dispatch: extra_args propagation with reasoning metadata and mm_processor_kwargs when `is_kv_router=True`, and output transformation to OpenAI chat-completion chunks when `is_kv_router=False`. SglangEngineFactory.chat_engine_factory signature updated to accept `routed_engine` parameter (presently ignored).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 36.84% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	PR description references issue `#9440` with "Closes: `#9440`" and briefly explains the changes, but lacks structured details matching the template.	Expand description with Overview, Details, and "Where should the reviewer start?" sections following the provided template for clarity.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat(frontend): Support disagg with vllm processor' accurately summarizes the main objective: adding disaggregation (disagg) support to the vLLM processor via a new Rust routed engine.
Linked Issues check	✅ Passed	The PR successfully implements all primary coding objectives from `#9440`: new PreprocessedRouting builder with build_prefill_pipeline, PrefillRoutedEngine integration in discovery/watcher, updated chat factory callback signature accepting routed_engine, Python RoutedEngine wrapper with generate method, VllmProcessor and EngineFactory updated to accept and use routed_engine, and comprehensive Python unit tests validating the new flow.
Out of Scope Changes check	✅ Passed	All changes directly support the stated objectives: frontend routing through new routed engine, discovery watcher integration, Python/Rust bindings for the routed engine, vLLM processor updates, and targeted tests. No unrelated refactoring, unplanned feature additions, or scope creep detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

I went back and forth with local agent during dev, and missed this. I think I was pushing them too hard to simplify and reuse. Signed-off-by: Graham King <grahamk@nvidia.com>

grahamking · 2026-05-13T22:00:33Z

@coderabbitai full review

coderabbitai · 2026-05-13T22:00:39Z

✅ Actions performed

Full review triggered.

Signed-off-by: Graham King <grahamk@nvidia.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/src/dynamo/frontend/tests/test_vllm_processor_unit.py`:
- Around line 204-213: The fake routed-engine used by tests (_FakeRoutedEngine
and its generate method) yields raw dicts but real routed items are objects with
methods is_error(), comments(), and data(); update the stub so it yields objects
(e.g., a small inner class or named wrapper) that implement is_error() -> False
(or True for error cases), comments() -> appropriate metadata, and data() -> the
original dict payload (and keep the existing default item structure like
{"token_ids":[101],"index":0}); this will let VllmProcessor exercise the
routed-engine unwrap logic instead of taking the internal-error branch.

In `@components/src/dynamo/frontend/vllm_processor.py`:
- Around line 630-633: The fallback path that calls self.router.generate (inside
the _nvtx.annotate block) is not passing the request context, so request
IDs/cancellation linkage are lost when routed_engine is unavailable; update the
call to self.router.generate(dynamo_preproc, annotated=False, context=context)
(or the correct context parameter name used in this module) so the direct client
fallback receives the same context, and ensure any other non-KV fallback calls
in vllm_processor.py also forward that context to preserve request
tracing/cancellation.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e5b88f87-f302-4a75-90d1-3d71da96a0bd

📥 Commits

Reviewing files that changed from the base of the PR and between 78146fe and 356d902.

📒 Files selected for processing (11)

components/src/dynamo/frontend/sglang_processor.py
components/src/dynamo/frontend/tests/test_vllm_processor_unit.py
components/src/dynamo/frontend/vllm_processor.py
lib/bindings/python/rust/lib.rs
lib/bindings/python/rust/llm.rs
lib/bindings/python/rust/llm/entrypoint.rs
lib/bindings/python/rust/llm/routed_engine.rs
lib/llm/src/discovery/watcher.rs
lib/llm/src/entrypoint.rs
lib/llm/src/entrypoint/input.rs
lib/llm/src/entrypoint/input/common.rs

Signed-off-by: Graham King <grahamk@nvidia.com>

krishung5

Overall LGTM!

General perf question: For the chat-processor path, do we see any perf impact from the new per-chunk costs (pythonize, mpsc channel hop.. etc.)?

For MM aware routing - test_serve_deployment[mm_agg_router_chat_processor_qwen3-vl-2b] is the test that would exercise the new routed_engine path with MM routing, but it's post_merge test. Could you trigger a post_merge pipeline or verify locally for sanity check?

krishung5 · 2026-05-13T23:48:27Z

-                    logger.debug(
-                        "[mm-routing] KvRouter.generate() called without "
-                        "mm_routing_info (text-only)"
+            if self.routed_engine is not None:


After this PR, I think self.routed_engine would never set ti None, so the elif self.is_kv_router and else will never be reached. Can we remove these or did I miss some use case for these two branches?

feat(frontend): Support disagg with vllm processor

4103189

All the details in #9440 . Closes: #9440 Signed-off-by: Graham King <grahamk@nvidia.com>

grahamking requested review from a team as code owners May 13, 2026 20:16

pull-request-size Bot added the size/XL label May 13, 2026

github-actions Bot added feat frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` labels May 13, 2026

dynamo-ops reviewed May 13, 2026

View reviewed changes

Comment thread lib/llm/src/entrypoint/input/common.rs Outdated

Comment thread components/src/dynamo/frontend/vllm_processor.py

Build the pipeline nodes fresh rather than re-using them

b49a733

I went back and forth with local agent during dev, and missed this. I think I was pushing them too hard to simplify and reuse. Signed-off-by: Graham King <grahamk@nvidia.com>

copy-pr-bot Bot temporarily deployed to GITLAB May 13, 2026 22:00 Inactive

Fix the router_engine _and_ kv_router case

84ab294

Signed-off-by: Graham King <grahamk@nvidia.com>

grahamking force-pushed the gk-with-prefill branch from 356d902 to 84ab294 Compare May 13, 2026 22:05

copy-pr-bot Bot temporarily deployed to GITLAB May 13, 2026 22:06 Inactive

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

Comment thread components/src/dynamo/frontend/tests/test_vllm_processor_unit.py

Comment thread components/src/dynamo/frontend/vllm_processor.py

grahamking added 2 commits May 13, 2026 18:08

Add e2e test for disagg with vllm processor

17dd449

Signed-off-by: Graham King <grahamk@nvidia.com>

Thanks Code Rabbit!

28def34

Signed-off-by: Graham King <grahamk@nvidia.com>

grahamking requested a review from a team as a code owner May 13, 2026 22:17

pull-request-size Bot added size/XXL and removed size/XL labels May 13, 2026

copy-pr-bot Bot temporarily deployed to GITLAB May 13, 2026 22:17 Inactive

github-actions Bot added the backend::vllm Relates to the vllm backend label May 13, 2026

rmccorm4 requested review from GuanLuo and krishung5 May 13, 2026 22:29

copy-pr-bot Bot temporarily deployed to GITLAB May 13, 2026 22:50 Inactive

krishung5 reviewed May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(frontend): Support disagg with vllm processor#9503

feat(frontend): Support disagg with vllm processor#9503
grahamking wants to merge 5 commits into
mainfrom
gk-with-prefill

grahamking commented May 13, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 13, 2026 •

edited

Loading

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

Uh oh!

Uh oh!

grahamking commented May 13, 2026

Uh oh!

coderabbitai Bot commented May 13, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

krishung5 left a comment

Uh oh!

krishung5 May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

grahamking commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

Uh oh!

Uh oh!

grahamking commented May 13, 2026

Uh oh!

coderabbitai Bot commented May 13, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

krishung5 left a comment

Choose a reason for hiding this comment

Uh oh!

krishung5 May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

grahamking commented May 13, 2026 •

edited

Loading

coderabbitai Bot commented May 13, 2026 •

edited

Loading