Qwen3.5-MoE CUDA V2 foundation: one model, many isolated sessions by mergennachin · Pull Request #20117 · pytorch/executorch

mergennachin · 2026-06-08T19:19:43Z

Proves the V2 core independently of any serving change: one physical AOTI-CUDA
Qwen model with weights loaded once can host multiple logical sessions, each with
its own KV/conv/recurrent state, with no cross-session bleed. The mechanism is
the AOTI user-managed-constant rebind, kept CUDA-backend-private; the public
surface stays LLMEngine/LLMSession. No multi-session serving wiring here (that is
a follow-up PR) — the worker/runner still create a single session.

Review order:

examples/models/qwen3_5_moe/export.py - the model-facing change: emits
get_mutable_buffer_metadata ({"version":1,"mutable_buffers":[fqn...]}) naming
the model's per-session state buffers (FQN list only; the backend owns the
tensor descriptors). The FQNs come from the model's explicit module contract.
backends/cuda/runtime/cuda_mutable_state.h - CUDA-private API, keyed by a
per-engine context (not a generic Module/Method/BackendInterface API, and not
process-global).
backends/cuda/runtime/cuda_mutable_state.cpp - context-keyed manager:
descriptor table + initial-template capture at load, per-session GPU buffers
cloned from the template, rebind via update_user_managed_constant_buffer_pairs,
and declared-vs-discovered FQN coverage validation.
backends/cuda/runtime/cuda_backend.cpp - two hooks: note_handle in init(),
rebind_for_execute in execute().
examples/models/qwen3_5_moe/qwen35_moe_engine.{h,cpp} - the engine owns one
shared Module and its context; sessions rebind their own state under the
engine lock; adds serving_capacity, capacity enforcement, the coverage check,
and context teardown. Also removes the now-incompatible cuda_graph flag (a
captured graph's baked pointers would ignore per-session rebinds), dropping it
from the engine config, the qwen3_5_moe_worker/runner (main.cpp) CLIs, and the
README/model.md docs.
backends/cuda/{CMakeLists.txt,runtime/TARGETS} and the qwen CMakeLists - build
wiring (CMake + Buck).

Validated on the real 35B-A3B model (HQQ-INT4) with two interleaved sessions:

after engine load: 17983 MB | capacity max_sessions=4, est 117309440 B/session
A solo : 348 10 4838 1665 15 16 17 18 19 20 21 22
A inter: 348 10 4838 1665 15 16 17 18 19 20 21 22 (identical -> no bleed)
GPU: engine=17983MB, +2 sessions=+108MB (weights once; ~112 MiB state/session)

Falls closed to single-session capacity if the AOTI constant-management symbols
are absent or the declared mutable FQNs do not fully match the loaded methods.

MLX V2 (shared constants + per-session MutableBufferData) is the next backend and
is not addressed here.

Part of #20001

[ghstack-poisoned]

mergennachin · 2026-06-08T19:19:45Z

Stack from ghstack (oldest at bottom):

pytorch-bot · 2026-06-08T19:19:49Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20117

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures, 1 Pending

As of commit 394b0c1 with merge base eeb0646 ():

NEW FAILURES - The following jobs have failed:

pull / unittest / linux / linux-job (gh)
RuntimeError: Command docker exec -t 55df65f0a71d55f23ee6c83d0834540dfe6eb2c5314aea34ed124425c8149ca8 /exec failed with exit code 1
pull / unittest / macos / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
pull / unittest-editable / linux / linux-job (gh)
RuntimeError: Command docker exec -t 8674918466c11fb6a0416c12590490cc2c7cff9f10afe110b684cfc630413977 /exec failed with exit code 1
pull / unittest-editable / macos / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Proves the V2 core independently of any serving change: one physical AOTI-CUDA Qwen model with weights loaded once can host multiple logical sessions, each with its own KV/conv/recurrent state, with no cross-session bleed. The mechanism is the AOTI user-managed-constant rebind, kept CUDA-backend-private; the public surface stays LLMEngine/LLMSession. No multi-session serving wiring here (that is a follow-up PR) — the worker/runner still create a single session. Review order: 1. examples/models/qwen3_5_moe/export.py - the model-facing change: emits get_mutable_buffer_metadata ({"version":1,"mutable_buffers":[fqn...]}) naming the model's per-session state buffers (FQN list only; the backend owns the tensor descriptors). The FQNs come from the model's explicit module contract. 2. backends/cuda/runtime/cuda_mutable_state.h - CUDA-private API, keyed by a per-engine context (not a generic Module/Method/BackendInterface API, and not process-global). 3. backends/cuda/runtime/cuda_mutable_state.cpp - context-keyed manager: descriptor table + initial-template capture at load, per-session GPU buffers cloned from the template, rebind via update_user_managed_constant_buffer_pairs, and declared-vs-discovered FQN coverage validation. 4. backends/cuda/runtime/cuda_backend.cpp - two hooks: note_handle in init(), rebind_for_execute in execute(). 5. examples/models/qwen3_5_moe/qwen35_moe_engine.{h,cpp} - the engine owns one shared Module and its context; sessions rebind their own state under the engine lock; adds serving_capacity, capacity enforcement, the coverage check, and context teardown. Also removes the now-incompatible cuda_graph flag (a captured graph's baked pointers would ignore per-session rebinds), dropping it from the engine config, the qwen3_5_moe_worker/runner (main.cpp) CLIs, and the README/model.md docs. 6. backends/cuda/{CMakeLists.txt,runtime/TARGETS} and the qwen CMakeLists - build wiring (CMake + Buck). Validated on the real 35B-A3B model (HQQ-INT4) with two interleaved sessions: after engine load: 17983 MB | capacity max_sessions=4, est 117309440 B/session A solo : 348 10 4838 1665 15 16 17 18 19 20 21 22 A inter: 348 10 4838 1665 15 16 17 18 19 20 21 22 (identical -> no bleed) GPU: engine=17983MB, +2 sessions=+108MB (weights once; ~112 MiB state/session) Falls closed to single-session capacity if the AOTI constant-management symbols are absent or the declared mutable FQNs do not fully match the loaded methods. MLX V2 (shared constants + per-session MutableBufferData) is the next backend and is not addressed here. ghstack-source-id: 6bd59c4 ghstack-comment-id: 4652591759 Pull-Request: #20117

[ghstack-poisoned]

Proves the V2 core independently of any serving change: one physical AOTI-CUDA Qwen model with weights loaded once can host multiple logical sessions, each with its own KV/conv/recurrent state, with no cross-session bleed. The mechanism is the AOTI user-managed-constant rebind, kept CUDA-backend-private; the public surface stays LLMEngine/LLMSession. No multi-session serving wiring here (that is a follow-up PR) — the worker/runner still create a single session. Review order: 1. examples/models/qwen3_5_moe/export.py - the model-facing change: emits get_mutable_buffer_metadata ({"version":1,"mutable_buffers":[fqn...]}) naming the model's per-session state buffers (FQN list only; the backend owns the tensor descriptors). The FQNs come from the model's explicit module contract. 2. backends/cuda/runtime/cuda_mutable_state.h - CUDA-private API, keyed by a per-engine context (not a generic Module/Method/BackendInterface API, and not process-global). 3. backends/cuda/runtime/cuda_mutable_state.cpp - context-keyed manager: descriptor table + initial-template capture at load, per-session GPU buffers cloned from the template, rebind via update_user_managed_constant_buffer_pairs, and declared-vs-discovered FQN coverage validation. 4. backends/cuda/runtime/cuda_backend.cpp - two hooks: note_handle in init(), rebind_for_execute in execute(). 5. examples/models/qwen3_5_moe/qwen35_moe_engine.{h,cpp} - the engine owns one shared Module and its context; sessions rebind their own state under the engine lock; adds serving_capacity, capacity enforcement, the coverage check, and context teardown. Also removes the now-incompatible cuda_graph flag (a captured graph's baked pointers would ignore per-session rebinds), dropping it from the engine config, the qwen3_5_moe_worker/runner (main.cpp) CLIs, and the README/model.md docs. 6. backends/cuda/{CMakeLists.txt,runtime/TARGETS} and the qwen CMakeLists - build wiring (CMake + Buck). Validated on the real 35B-A3B model (HQQ-INT4) with two interleaved sessions: after engine load: 17983 MB | capacity max_sessions=4, est 117309440 B/session A solo : 348 10 4838 1665 15 16 17 18 19 20 21 22 A inter: 348 10 4838 1665 15 16 17 18 19 20 21 22 (identical -> no bleed) GPU: engine=17983MB, +2 sessions=+108MB (weights once; ~112 MiB state/session) Falls closed to single-session capacity if the AOTI constant-management symbols are absent or the declared mutable FQNs do not fully match the loaded methods. MLX V2 (shared constants + per-session MutableBufferData) is the next backend and is not addressed here. Part of #20001 ghstack-source-id: b3d390f ghstack-comment-id: 4652591759 Pull-Request: #20117

[INITIAL] Update

76dd40c

[ghstack-poisoned]

mergennachin requested review from kirklandsign and larryliu0820 as code owners June 8, 2026 19:19

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 8, 2026

[UPDATE] Update

394b0c1

[ghstack-poisoned]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3.5-MoE CUDA V2 foundation: one model, many isolated sessions#20117

Qwen3.5-MoE CUDA V2 foundation: one model, many isolated sessions#20117
mergennachin wants to merge 2 commits into
gh/mergennachin/7/headfrom
gh/mergennachin/8/head

mergennachin commented Jun 8, 2026 •

edited

Loading

Uh oh!

mergennachin commented Jun 8, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mergennachin commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergennachin commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20117

❌ 4 New Failures, 1 Pending

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mergennachin commented Jun 8, 2026 •

edited

Loading

mergennachin commented Jun 8, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 8, 2026 •

edited

Loading