Skip to content

feat(recipes): add Qwen3.5-0.8B vLLM aggregated recipe#9285

Open
MatejKosec wants to merge 4 commits intomainfrom
user/mkosec/qwen3.5-vllm-recipe
Open

feat(recipes): add Qwen3.5-0.8B vLLM aggregated recipe#9285
MatejKosec wants to merge 4 commits intomainfrom
user/mkosec/qwen3.5-vllm-recipe

Conversation

@MatejKosec
Copy link
Copy Markdown
Contributor

@MatejKosec MatejKosec commented May 7, 2026

Summary

  • Adds an aggregated single-GPU vLLM recipe for Qwen/Qwen3.5-0.8B under recipes/qwen3.5-0.8b/.
  • Targets nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.0. The recipe will not work against 1.0.0: that image's vLLM and Transformers pre-date the qwen3_5 model type.
  • Wires --dyn-reasoning-parser qwen3 and --dyn-tool-call-parser qwen3_coder on the worker. qwen3_coder is Dynamo's registered Qwen3-family XML tool-call parser.

Closes #8988.

Notes

Aggregated single-GPU vLLM recipe for Qwen/Qwen3.5-0.8B (multimodal, hybrid Mamba+Attention) targeting vllm-runtime:1.1.0. Includes the qwen3 reasoning parser and qwen3_coder tool-call parser flags on the worker. Smoke-tested end-to-end on 1x H100 with both chat completion and tool calling.

Signed-off-by: Matej Kosec <mkosec@nvidia.com>
@MatejKosec MatejKosec requested review from a team as code owners May 7, 2026 21:46
@github-actions github-actions Bot added documentation Improvements or additions to documentation feat labels May 7, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

Review Change Stack

Walkthrough

This PR adds a complete Qwen3.5-0.8B aggregated model deployment recipe. It includes a DynamoGraphDeployment manifest (vLLM with Frontend/VllmWorker services), Kubernetes PVC definitions for model and compilation caches, a model-download Job for HuggingFace artifact preparation, and comprehensive README documentation with quick-start instructions and operational guidance.

Changes

Qwen3.5-0.8B Aggregated Deployment Recipe

Layer / File(s) Summary
Deployment Definition
recipes/qwen3.5-0.8b/vllm/agg/deploy.yaml
DynamoGraphDeployment qwen3-5-0-8b-agg specifies Frontend and VllmWorker services. VllmWorker configures vLLM with Qwen-specific reasoning (qwen3) and tool-call (qwen3_coder) parsers, multimodal support, tensor-parallel size 1, GPU memory 0.85, max-length 32768, and prefix caching.
Storage Infrastructure
recipes/qwen3.5-0.8b/model-cache/model-cache.yaml
Two PersistentVolumeClaim manifests: model-cache (20Gi) and compilation-cache (10Gi), both with ReadWriteOnce access and placeholder storageClassName.
Model Download Job
recipes/qwen3.5-0.8b/model-cache/model-download.yaml
Kubernetes batch Job downloads Qwen/Qwen3.5-0.8B model via huggingface_hub 1.11.0, pulls HF token from hf-token-secret, writes to model-cache PVC at /home/dynamo/.cache/huggingface, with backoffLimit: 3 and single completion/parallelism.
Documentation & Quick-Start
recipes/qwen3.5-0.8b/README.md
README documents configuration options, prerequisites (1x H100 80GB, Dynamo, HF token secret), quick-start commands (create secrets, apply PVCs, deploy graph, test /v1/chat/completions), parser wiring details, architecture notes on vLLM model registry and EOS workaround, and validation notes (vLLM runtime 1.1.0, CLI flag --no-enable-log-requests, Qwen3.5 hybrid-attention caching behavior).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(recipes): add Qwen3.5-0.8B vLLM aggregated recipe' clearly describes the main change: adding a new recipe for Qwen3.5-0.8B with vLLM aggregated configuration.
Linked Issues check ✅ Passed The PR successfully addresses both requirements from issue #8988: it pins vLLM to version 1.1.0 (which supports qwen3_5 model type) and uses qwen3_coder as the tool-call parser, with test confirmation of proper architecture resolution and tool-call functionality.
Out of Scope Changes check ✅ Passed All changes are directly within scope: README documentation, Kubernetes manifests for model caching, job configuration, and vLLM deployment configuration are all necessary for the single-GPU Qwen3.5-0.8B aggregated recipe implementation.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed The PR description provides a clear summary, links a closing issue, and includes helpful notes about prerequisites and naming conventions.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/qwen3.5-0.8b/model-cache/model-download.yaml`:
- Around line 17-44: The container "model-download" is missing a securityContext
and runs as root; add a securityContext block under the container spec for
model-download to harden defaults: set runAsNonRoot: true and a non-root
runAsUser (e.g. 1000), set allowPrivilegeEscalation: false, and consider adding
readOnlyRootFilesystem: true and seccompProfile/runtimeClass if desired; update
the container spec (name: model-download) to include this securityContext so the
pod no longer runs as root and privilege escalation is disabled.

In `@recipes/qwen3.5-0.8b/README.md`:
- Around line 73-78: The fenced log snippet in recipes/qwen3.5-0.8b/README.md
lacks a language identifier (triggering markdownlint MD040); update the fence
for that block (the triple backticks surrounding the log lines) to include a
language token such as "text" so it becomes ```text, ensuring the block around
the lines like "2026-05-07T21:16:07  INFO model.__post_init__: Resolved
architecture: Qwen3_5ForConditionalGeneration" is closed with matching
backticks.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 95d0a12e-9c04-458b-94d3-36f62406f148

📥 Commits

Reviewing files that changed from the base of the PR and between 91516b0 and f03f250.

📒 Files selected for processing (4)
  • recipes/qwen3.5-0.8b/README.md
  • recipes/qwen3.5-0.8b/model-cache/model-cache.yaml
  • recipes/qwen3.5-0.8b/model-cache/model-download.yaml
  • recipes/qwen3.5-0.8b/vllm/agg/deploy.yaml

Comment thread recipes/qwen3.5-0.8b/model-cache/model-download.yaml Outdated
Comment thread recipes/qwen3.5-0.8b/README.md Outdated
The model-cache PVC manifest and the HF download Job were template-only (storageClassName: "your-storage-class-name") and don't belong in the recipe -- the deploy.yaml already declares pvcs with create: false, leaving cache provisioning to the operator. Keep only the deploy.yaml + README, and update the Quick Start to point at the deploy directly.

Signed-off-by: Matej Kosec <mkosec@nvidia.com>
Signed-off-by: Matej Kosec <mkosec@nvidia.com>
@pull-request-size pull-request-size Bot added size/M and removed size/L labels May 7, 2026
Recipe is small enough to read directly from deploy.yaml; the parser flags and model ID are inline, no separate documentation surface needed.

Signed-off-by: Matej Kosec <mkosec@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation feat size/M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Support for qwen3_xml tool call parser and qwen3_5 model type

2 participants