examples/models/qwen3_5_moe: CUDA Engine/Session adapter + OpenAI serving by mergennachin · Pull Request #20043 · pytorch/executorch

mergennachin · 2026-06-04T18:48:09Z

Implement Qwen35MoEEngine / Qwen35MoESession (the model-agnostic
LLMEngine / LLMSession contract) over the exported prefill/decode methods.
serving_capacity() reports a single physical session; the model is
hybrid_recurrent with seek() NotSupported (no prefix reuse). main.cpp is a thin
CLI over the engine/session.

OpenAI serving runs process-isolated and model execution stays in C++: serve.py
is the control plane (FastAPI, chat templating, Qwen XML tool parsing,
validation; no CUDA, no pybind) and spawns qwen3_5_moe_worker
(qwen35_moe_worker.cpp), a C++ worker that constructs the engine and one session
and speaks the same JSONL protocol as the generic text worker. Executing the
AOTI CUDA model inside a live asyncio server process segfaults in the int4
matmul; isolating it in a plain worker process makes serving reliable while
loading weights once. Single-slot: concurrent requests queue. Tool calls use the
Qwen XML <function=...> format (QwenFunctionCallDetector).

Review order: qwen35_moe_engine.{h,cpp} (adapter) and main.cpp; then
qwen35_moe_worker.cpp and serve.py (serving); then tests and docs.

Part of #20001

[ghstack-poisoned]

mergennachin · 2026-06-04T18:48:10Z

Stack from ghstack (oldest at bottom):

pytorch-bot · 2026-06-04T18:48:13Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20043

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures, 1 Pending

As of commit 462416d with merge base eeb0646 ():

NEW FAILURES - The following jobs have failed:

pull / android / run-emulator (gh)
The process '/usr/bin/sh' failed with exit code 1
pull / unittest / linux / linux-job (gh)
RuntimeError: Command docker exec -t 767e9801d9b145094719d4a90ac656c2a6486fed3c66ab56c8d7d6478f0427ce /exec failed with exit code 1
pull / unittest / macos / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
pull / unittest-editable / linux / linux-job (gh)
RuntimeError: Command docker exec -t 89a3e559d8cdd2d1db3e47ea7b024a43f4c74c219ad81b8e53121b2402813664 /exec failed with exit code 1
pull / unittest-editable / macos / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

…ving Implement Qwen35MoEEngine / Qwen35MoESession (the model-agnostic LLMEngine / LLMSession contract) over the exported prefill/decode methods. serving_capacity() reports a single physical session; the model is hybrid_recurrent with seek() NotSupported (no prefix reuse). main.cpp is a thin CLI over the engine/session. OpenAI serving runs process-isolated and model execution stays in C++: serve.py is the control plane (FastAPI, chat templating, Qwen XML tool parsing, validation; no CUDA, no pybind) and spawns qwen3_5_moe_worker (qwen35_moe_worker.cpp), a C++ worker that constructs the engine and one session and speaks the same JSONL protocol as the generic text worker. Executing the AOTI CUDA model inside a live asyncio server process segfaults in the int4 matmul; isolating it in a plain worker process makes serving reliable while loading weights once. Single-slot: concurrent requests queue. Tool calls use the Qwen XML <function=...> format (QwenFunctionCallDetector). Review order: qwen35_moe_engine.{h,cpp} (adapter) and main.cpp; then qwen35_moe_worker.cpp and serve.py (serving); then tests and docs. ghstack-source-id: ca70937 ghstack-comment-id: 4625142707 Pull-Request: #20043

[ghstack-poisoned]

…ving Implement Qwen35MoEEngine / Qwen35MoESession (the model-agnostic LLMEngine / LLMSession contract) over the exported prefill/decode methods. serving_capacity() reports a single physical session; the model is hybrid_recurrent with seek() NotSupported (no prefix reuse). main.cpp is a thin CLI over the engine/session. OpenAI serving runs process-isolated and model execution stays in C++: serve.py is the control plane (FastAPI, chat templating, Qwen XML tool parsing, validation; no CUDA, no pybind) and spawns qwen3_5_moe_worker (qwen35_moe_worker.cpp), a C++ worker that constructs the engine and one session and speaks the same JSONL protocol as the generic text worker. Executing the AOTI CUDA model inside a live asyncio server process segfaults in the int4 matmul; isolating it in a plain worker process makes serving reliable while loading weights once. Single-slot: concurrent requests queue. Tool calls use the Qwen XML <function=...> format (QwenFunctionCallDetector). Review order: qwen35_moe_engine.{h,cpp} (adapter) and main.cpp; then qwen35_moe_worker.cpp and serve.py (serving); then tests and docs. ghstack-source-id: 6ae0026 ghstack-comment-id: 4625142707 Pull-Request: #20043

[ghstack-poisoned]

…ving Implement Qwen35MoEEngine / Qwen35MoESession (the model-agnostic LLMEngine / LLMSession contract) over the exported prefill/decode methods. serving_capacity() reports a single physical session; the model is hybrid_recurrent with seek() NotSupported (no prefix reuse). main.cpp is a thin CLI over the engine/session. OpenAI serving runs process-isolated and model execution stays in C++: serve.py is the control plane (FastAPI, chat templating, Qwen XML tool parsing, validation; no CUDA, no pybind) and spawns qwen3_5_moe_worker (qwen35_moe_worker.cpp), a C++ worker that constructs the engine and one session and speaks the same JSONL protocol as the generic text worker. Executing the AOTI CUDA model inside a live asyncio server process segfaults in the int4 matmul; isolating it in a plain worker process makes serving reliable while loading weights once. Single-slot: concurrent requests queue. Tool calls use the Qwen XML <function=...> format (QwenFunctionCallDetector). Review order: qwen35_moe_engine.{h,cpp} (adapter) and main.cpp; then qwen35_moe_worker.cpp and serve.py (serving); then tests and docs. ghstack-source-id: 4440667 ghstack-comment-id: 4625142707 Pull-Request: #20043

[ghstack-poisoned]

…ving Implement Qwen35MoEEngine / Qwen35MoESession (the model-agnostic LLMEngine / LLMSession contract) over the exported prefill/decode methods. serving_capacity() reports a single physical session; the model is hybrid_recurrent with seek() NotSupported (no prefix reuse). main.cpp is a thin CLI over the engine/session. OpenAI serving runs process-isolated and model execution stays in C++: serve.py is the control plane (FastAPI, chat templating, Qwen XML tool parsing, validation; no CUDA, no pybind) and spawns qwen3_5_moe_worker (qwen35_moe_worker.cpp), a C++ worker that constructs the engine and one session and speaks the same JSONL protocol as the generic text worker. Executing the AOTI CUDA model inside a live asyncio server process segfaults in the int4 matmul; isolating it in a plain worker process makes serving reliable while loading weights once. Single-slot: concurrent requests queue. Tool calls use the Qwen XML <function=...> format (QwenFunctionCallDetector). Review order: qwen35_moe_engine.{h,cpp} (adapter) and main.cpp; then qwen35_moe_worker.cpp and serve.py (serving); then tests and docs. ghstack-source-id: 4577b1d ghstack-comment-id: 4625142707 Pull-Request: #20043

[ghstack-poisoned]

[INITIAL] Update

b9daa77

[ghstack-poisoned]

mergennachin requested review from kirklandsign and larryliu0820 as code owners June 4, 2026 18:48

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 4, 2026

mergennachin marked this pull request as draft June 4, 2026 18:51

[UPDATE] Update

9c4ea67

[ghstack-poisoned]

[UPDATE] Update

cf4f995

[ghstack-poisoned]

[UPDATE] Update

f031ce8

[ghstack-poisoned]

[UPDATE] Update

955a300

[ghstack-poisoned]

mergennachin marked this pull request as ready for review June 5, 2026 19:02

[UPDATE] Update

462416d

[ghstack-poisoned]

mergennachin mentioned this pull request Jun 8, 2026

Qwen3.5-MoE CUDA V2 foundation: one model, many isolated sessions #20117

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples/models/qwen3_5_moe: CUDA Engine/Session adapter + OpenAI serving#20043

examples/models/qwen3_5_moe: CUDA Engine/Session adapter + OpenAI serving#20043
mergennachin wants to merge 6 commits into
gh/mergennachin/6/headfrom
gh/mergennachin/7/head

mergennachin commented Jun 4, 2026 •

edited

Loading

Uh oh!

mergennachin commented Jun 4, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mergennachin commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergennachin commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20043

❌ 5 New Failures, 1 Pending

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mergennachin commented Jun 4, 2026 •

edited

Loading

mergennachin commented Jun 4, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 4, 2026 •

edited

Loading