feat(recipes): add Qwen3.5-0.8B vLLM aggregated recipe by MatejKosec · Pull Request #9285 · ai-dynamo/dynamo

MatejKosec · 2026-05-07T21:46:26Z

Summary

Adds an aggregated single-GPU vLLM recipe for Qwen/Qwen3.5-0.8B under recipes/qwen3.5-0.8b/.
Targets nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.0. The recipe will not work against 1.0.0: that image's vLLM and Transformers pre-date the qwen3_5 model type.
Wires --dyn-reasoning-parser qwen3 and --dyn-tool-call-parser qwen3_coder on the worker. qwen3_coder is Dynamo's registered Qwen3-family XML tool-call parser.

Closes #8988.

Notes

Uses --no-enable-log-requests, the renamed flag introduced in fix(recipes): rename --disable-log-requests to --no-enable-log-requests for vLLM #8693. Older recipes still using --disable-log-requests will fail with unrecognized arguments on 1.1.0.
Directory is named qwen3.5-0.8b/ (with the dot) to track the official model name -- matches the precedent set by kimi-k2.5/.

Aggregated single-GPU vLLM recipe for Qwen/Qwen3.5-0.8B (multimodal, hybrid Mamba+Attention) targeting vllm-runtime:1.1.0. Includes the qwen3 reasoning parser and qwen3_coder tool-call parser flags on the worker. Smoke-tested end-to-end on 1x H100 with both chat completion and tool calling. Signed-off-by: Matej Kosec <mkosec@nvidia.com>

coderabbitai · 2026-05-07T21:50:03Z

Walkthrough

This PR adds a complete Qwen3.5-0.8B aggregated model deployment recipe. It includes a DynamoGraphDeployment manifest (vLLM with Frontend/VllmWorker services), Kubernetes PVC definitions for model and compilation caches, a model-download Job for HuggingFace artifact preparation, and comprehensive README documentation with quick-start instructions and operational guidance.

Changes

Qwen3.5-0.8B Aggregated Deployment Recipe

Layer / File(s)	Summary
Deployment Definition `recipes/qwen3.5-0.8b/vllm/agg/deploy.yaml`	DynamoGraphDeployment `qwen3-5-0-8b-agg` specifies Frontend and VllmWorker services. VllmWorker configures vLLM with Qwen-specific reasoning (`qwen3`) and tool-call (`qwen3_coder`) parsers, multimodal support, tensor-parallel size 1, GPU memory 0.85, max-length 32768, and prefix caching.
Storage Infrastructure `recipes/qwen3.5-0.8b/model-cache/model-cache.yaml`	Two PersistentVolumeClaim manifests: `model-cache` (20Gi) and `compilation-cache` (10Gi), both with `ReadWriteOnce` access and placeholder storageClassName.
Model Download Job `recipes/qwen3.5-0.8b/model-cache/model-download.yaml`	Kubernetes batch Job downloads Qwen/Qwen3.5-0.8B model via huggingface_hub 1.11.0, pulls HF token from `hf-token-secret`, writes to model-cache PVC at `/home/dynamo/.cache/huggingface`, with `backoffLimit: 3` and single completion/parallelism.
Documentation & Quick-Start `recipes/qwen3.5-0.8b/README.md`	README documents configuration options, prerequisites (1x H100 80GB, Dynamo, HF token secret), quick-start commands (create secrets, apply PVCs, deploy graph, test `/v1/chat/completions`), parser wiring details, architecture notes on vLLM model registry and EOS workaround, and validation notes (vLLM runtime 1.1.0, CLI flag `--no-enable-log-requests`, Qwen3.5 hybrid-attention caching behavior).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat(recipes): add Qwen3.5-0.8B vLLM aggregated recipe' clearly describes the main change: adding a new recipe for Qwen3.5-0.8B with vLLM aggregated configuration.
Linked Issues check	✅ Passed	The PR successfully addresses both requirements from issue `#8988`: it pins vLLM to version 1.1.0 (which supports qwen3_5 model type) and uses qwen3_coder as the tool-call parser, with test confirmation of proper architecture resolution and tool-call functionality.
Out of Scope Changes check	✅ Passed	All changes are directly within scope: README documentation, Kubernetes manifests for model caching, job configuration, and vLLM deployment configuration are all necessary for the single-GPU Qwen3.5-0.8B aggregated recipe implementation.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check	✅ Passed	The PR description provides a clear summary, links a closing issue, and includes helpful notes about prerequisites and naming conventions.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/qwen3.5-0.8b/model-cache/model-download.yaml`:
- Around line 17-44: The container "model-download" is missing a securityContext
and runs as root; add a securityContext block under the container spec for
model-download to harden defaults: set runAsNonRoot: true and a non-root
runAsUser (e.g. 1000), set allowPrivilegeEscalation: false, and consider adding
readOnlyRootFilesystem: true and seccompProfile/runtimeClass if desired; update
the container spec (name: model-download) to include this securityContext so the
pod no longer runs as root and privilege escalation is disabled.

In `@recipes/qwen3.5-0.8b/README.md`:
- Around line 73-78: The fenced log snippet in recipes/qwen3.5-0.8b/README.md
lacks a language identifier (triggering markdownlint MD040); update the fence
for that block (the triple backticks surrounding the log lines) to include a
language token such as "text" so it becomes ```text, ensuring the block around
the lines like "2026-05-07T21:16:07  INFO model.__post_init__: Resolved
architecture: Qwen3_5ForConditionalGeneration" is closed with matching
backticks.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 95d0a12e-9c04-458b-94d3-36f62406f148

📥 Commits

Reviewing files that changed from the base of the PR and between 91516b0 and f03f250.

📒 Files selected for processing (4)

recipes/qwen3.5-0.8b/README.md
recipes/qwen3.5-0.8b/model-cache/model-cache.yaml
recipes/qwen3.5-0.8b/model-cache/model-download.yaml
recipes/qwen3.5-0.8b/vllm/agg/deploy.yaml

The model-cache PVC manifest and the HF download Job were template-only (storageClassName: "your-storage-class-name") and don't belong in the recipe -- the deploy.yaml already declares pvcs with create: false, leaving cache provisioning to the operator. Keep only the deploy.yaml + README, and update the Quick Start to point at the deploy directly. Signed-off-by: Matej Kosec <mkosec@nvidia.com>

Signed-off-by: Matej Kosec <mkosec@nvidia.com>

Recipe is small enough to read directly from deploy.yaml; the parser flags and model ID are inline, no separate documentation surface needed. Signed-off-by: Matej Kosec <mkosec@nvidia.com>

MatejKosec requested review from a team as code owners May 7, 2026 21:46

pull-request-size Bot added the size/L label May 7, 2026

github-actions Bot added documentation Improvements or additions to documentation feat labels May 7, 2026

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

Comment thread recipes/qwen3.5-0.8b/model-cache/model-download.yaml Outdated

Comment thread recipes/qwen3.5-0.8b/README.md Outdated

copy-pr-bot Bot temporarily deployed to GITLAB May 7, 2026 21:51 Inactive

dynamo-ops approved these changes May 7, 2026

View reviewed changes

chore(recipes): trim qwen3.5-0.8b README to deploy + parsers

2662982

Signed-off-by: Matej Kosec <mkosec@nvidia.com>

pull-request-size Bot added size/M and removed size/L labels May 7, 2026

chore(recipes): drop qwen3.5-0.8b README

faefa1d

Recipe is small enough to read directly from deploy.yaml; the parser flags and model ID are inline, no separate documentation surface needed. Signed-off-by: Matej Kosec <mkosec@nvidia.com>

MatejKosec requested review from alec-flowers, dynamo-ops and ishandhanani May 7, 2026 21:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(recipes): add Qwen3.5-0.8B vLLM aggregated recipe#9285

feat(recipes): add Qwen3.5-0.8B vLLM aggregated recipe#9285
MatejKosec wants to merge 4 commits intomainfrom
user/mkosec/qwen3.5-vllm-recipe

MatejKosec commented May 7, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 7, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MatejKosec commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Notes

Uh oh!

coderabbitai Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MatejKosec commented May 7, 2026 •

edited

Loading

coderabbitai Bot commented May 7, 2026 •

edited

Loading