Skip to content

feat(frontend): Support disagg with vllm processor#9503

Open
grahamking wants to merge 5 commits into
mainfrom
gk-with-prefill
Open

feat(frontend): Support disagg with vllm processor#9503
grahamking wants to merge 5 commits into
mainfrom
gk-with-prefill

Conversation

@grahamking
Copy link
Copy Markdown
Contributor

@grahamking grahamking commented May 13, 2026

All the details in #9440 .

Closes: #9440

Before, the vllm/sglang pre-processor would do it's thing, and then call either Client::generate or KvRouter::generate which are both in Rust and push the tokenized request to the backend.

Now it calls new RoutedEngine::generate in exactly the same way, which is also Rust. This wraps Client or kv router with PrefillRouter which adds disagg support. It is quite elegant, the Python hardly changes.

Later we will add Migation in that RoutedEngine.

Assisted-By: Claude Opus/4.7 (plan, review)
Assisted-By: Codex GPT/5.5 (spec, execute plan, review)
... and the trusty Code Rabbit of course.


Open in Devin Review

Review Change Stack

All the details in #9440 .

Closes: #9440

Signed-off-by: Graham King <grahamk@nvidia.com>
@grahamking grahamking requested review from a team as code owners May 13, 2026 20:16
@github-actions github-actions Bot added feat frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` labels May 13, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 13, 2026

Walkthrough

This PR implements routed-engine dispatch for the vLLM Python chat processor by introducing reusable PreprocessedRouting infrastructure, exposing it to Python via PyO3 bindings, integrating it into discovery, and conditionally routing vLLM-preprocessed requests through the Rust PrefillRouter when available.

Changes

vLLM Python processor routed-engine integration

Layer / File(s) Summary
PreprocessedRouting infrastructure and pipeline builders
lib/llm/src/entrypoint.rs, lib/llm/src/entrypoint/input.rs, lib/llm/src/entrypoint/input/common.rs
New PreprocessedRouting struct encapsulates prefill operator and routed backend. build_preprocessed_routing builder selects routing backend, waits for workers, and creates PrefillRouter. Methods build_pipeline and build_prefill_pipeline wire full and prefill-only request paths respectively. Entrypoint re-exports and type aliases updated; ChatEngineFactoryCallback now receives PrefillRoutedEngine argument.
Watcher conditionally builds preprocessed routing
lib/llm/src/discovery/watcher.rs
Tokenizer loading, KV chooser, prefill router, and worker monitor are now gated on whether local preprocessing is needed. PreprocessedRouting is built when required and its prefill pipeline is passed to Python chat factory. Local Rust chat and completions pipelines now use PreprocessedRouting methods instead of deprecated helpers.
Expose RoutedEngine to Python via PyO3
lib/bindings/python/rust/llm.rs, lib/bindings/python/rust/lib.rs, lib/bindings/python/rust/llm/routed_engine.rs, lib/bindings/python/rust/llm/entrypoint.rs
New routed_engine module declares RoutedEngine PyO3 class wrapping PrefillRoutedEngine. generate method converts Python dicts to PreprocessedRequest, builds execution context with stop/kill propagation, streams responses through Tokio MPSC, and returns AsyncResponseStream. Entrypoint bridge updated to pass wrapped RoutedEngine to Python chat_engine_factory callback as third argument.
VllmProcessor routes through routed_engine when available
components/src/dynamo/frontend/vllm_processor.py
VllmProcessor accepts optional routed_engine. New _inject_routing_metadata helper merges reasoning fields into kv_kwargs extra_args. Generator, _generator_inner, and _generate_and_stream now accept optional context parameter. In _generate_and_stream, when routed_engine is set, calls routed_engine.generate(dynamo_preproc, context=context); otherwise preserves KvRouter and Client paths with routing metadata injection for KvRouter. EngineFactory passes routed_engine to VllmProcessor.
Update tests and SGLang processor signature
components/src/dynamo/frontend/tests/test_vllm_processor_unit.py, components/src/dynamo/frontend/sglang_processor.py
Reasoning-metadata test updated to use _inject_routing_metadata. New async test suite verifies routed-engine dispatch: extra_args propagation with reasoning metadata and mm_processor_kwargs when is_kv_router=True, and output transformation to OpenAI chat-completion chunks when is_kv_router=False. SglangEngineFactory.chat_engine_factory signature updated to accept routed_engine parameter (presently ignored).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 36.84% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive PR description references issue #9440 with "Closes: #9440" and briefly explains the changes, but lacks structured details matching the template. Expand description with Overview, Details, and "Where should the reviewer start?" sections following the provided template for clarity.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(frontend): Support disagg with vllm processor' accurately summarizes the main objective: adding disaggregation (disagg) support to the vLLM processor via a new Rust routed engine.
Linked Issues check ✅ Passed The PR successfully implements all primary coding objectives from #9440: new PreprocessedRouting builder with build_prefill_pipeline, PrefillRoutedEngine integration in discovery/watcher, updated chat factory callback signature accepting routed_engine, Python RoutedEngine wrapper with generate method, VllmProcessor and EngineFactory updated to accept and use routed_engine, and comprehensive Python unit tests validating the new flow.
Out of Scope Changes check ✅ Passed All changes directly support the stated objectives: frontend routing through new routed engine, discovery watcher integration, Python/Rust bindings for the routed engine, vLLM processor updates, and targeted tests. No unrelated refactoring, unplanned feature additions, or scope creep detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread lib/llm/src/entrypoint/input/common.rs Outdated
Comment thread components/src/dynamo/frontend/vllm_processor.py
I went back and forth with local agent during dev, and missed this. I
think I was pushing them too hard to simplify and reuse.

Signed-off-by: Graham King <grahamk@nvidia.com>
@grahamking
Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 13, 2026

✅ Actions performed

Full review triggered.

Signed-off-by: Graham King <grahamk@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/src/dynamo/frontend/tests/test_vllm_processor_unit.py`:
- Around line 204-213: The fake routed-engine used by tests (_FakeRoutedEngine
and its generate method) yields raw dicts but real routed items are objects with
methods is_error(), comments(), and data(); update the stub so it yields objects
(e.g., a small inner class or named wrapper) that implement is_error() -> False
(or True for error cases), comments() -> appropriate metadata, and data() -> the
original dict payload (and keep the existing default item structure like
{"token_ids":[101],"index":0}); this will let VllmProcessor exercise the
routed-engine unwrap logic instead of taking the internal-error branch.

In `@components/src/dynamo/frontend/vllm_processor.py`:
- Around line 630-633: The fallback path that calls self.router.generate (inside
the _nvtx.annotate block) is not passing the request context, so request
IDs/cancellation linkage are lost when routed_engine is unavailable; update the
call to self.router.generate(dynamo_preproc, annotated=False, context=context)
(or the correct context parameter name used in this module) so the direct client
fallback receives the same context, and ensure any other non-KV fallback calls
in vllm_processor.py also forward that context to preserve request
tracing/cancellation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e5b88f87-f302-4a75-90d1-3d71da96a0bd

📥 Commits

Reviewing files that changed from the base of the PR and between 78146fe and 356d902.

📒 Files selected for processing (11)
  • components/src/dynamo/frontend/sglang_processor.py
  • components/src/dynamo/frontend/tests/test_vllm_processor_unit.py
  • components/src/dynamo/frontend/vllm_processor.py
  • lib/bindings/python/rust/lib.rs
  • lib/bindings/python/rust/llm.rs
  • lib/bindings/python/rust/llm/entrypoint.rs
  • lib/bindings/python/rust/llm/routed_engine.rs
  • lib/llm/src/discovery/watcher.rs
  • lib/llm/src/entrypoint.rs
  • lib/llm/src/entrypoint/input.rs
  • lib/llm/src/entrypoint/input/common.rs

Comment thread components/src/dynamo/frontend/tests/test_vllm_processor_unit.py
Comment thread components/src/dynamo/frontend/vllm_processor.py
Signed-off-by: Graham King <grahamk@nvidia.com>
Signed-off-by: Graham King <grahamk@nvidia.com>
@grahamking grahamking requested a review from a team as a code owner May 13, 2026 22:17
@github-actions github-actions Bot added the backend::vllm Relates to the vllm backend label May 13, 2026
@rmccorm4 rmccorm4 requested review from GuanLuo and krishung5 May 13, 2026 22:29
Copy link
Copy Markdown
Contributor

@krishung5 krishung5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM!

General perf question: For the chat-processor path, do we see any perf impact from the new per-chunk costs (pythonize, mpsc channel hop.. etc.)?

For MM aware routing - test_serve_deployment[mm_agg_router_chat_processor_qwen3-vl-2b] is the test that would exercise the new routed_engine path with MM routing, but it's post_merge test. Could you trigger a post_merge pipeline or verify locally for sanity check?

logger.debug(
"[mm-routing] KvRouter.generate() called without "
"mm_routing_info (text-only)"
if self.routed_engine is not None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this PR, I think self.routed_engine would never set ti None, so the elif self.is_kv_router and else will never be reached. Can we remove these or did I miss some use case for these two branches?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend::vllm Relates to the vllm backend feat frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use Rust PrefillRouter with the vLLM Python chat processor

3 participants