Qwen3.5-MoE CUDA V2 foundation: one model, many isolated sessions#20117
Open
mergennachin wants to merge 2 commits into
Open
Qwen3.5-MoE CUDA V2 foundation: one model, many isolated sessions#20117mergennachin wants to merge 2 commits into
mergennachin wants to merge 2 commits into
Conversation
[ghstack-poisoned]
Contributor
Author
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20117
Note: Links to docs will display an error until the docs builds have been completed. ❌ 4 New Failures, 1 PendingAs of commit 394b0c1 with merge base eeb0646 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This was referenced Jun 8, 2026
mergennachin
added a commit
that referenced
this pull request
Jun 8, 2026
Proves the V2 core independently of any serving change: one physical AOTI-CUDA
Qwen model with weights loaded once can host multiple logical sessions, each with
its own KV/conv/recurrent state, with no cross-session bleed. The mechanism is
the AOTI user-managed-constant rebind, kept CUDA-backend-private; the public
surface stays LLMEngine/LLMSession. No multi-session serving wiring here (that is
a follow-up PR) — the worker/runner still create a single session.
Review order:
1. examples/models/qwen3_5_moe/export.py - the model-facing change: emits
get_mutable_buffer_metadata ({"version":1,"mutable_buffers":[fqn...]}) naming
the model's per-session state buffers (FQN list only; the backend owns the
tensor descriptors). The FQNs come from the model's explicit module contract.
2. backends/cuda/runtime/cuda_mutable_state.h - CUDA-private API, keyed by a
per-engine context (not a generic Module/Method/BackendInterface API, and not
process-global).
3. backends/cuda/runtime/cuda_mutable_state.cpp - context-keyed manager:
descriptor table + initial-template capture at load, per-session GPU buffers
cloned from the template, rebind via update_user_managed_constant_buffer_pairs,
and declared-vs-discovered FQN coverage validation.
4. backends/cuda/runtime/cuda_backend.cpp - two hooks: note_handle in init(),
rebind_for_execute in execute().
5. examples/models/qwen3_5_moe/qwen35_moe_engine.{h,cpp} - the engine owns one
shared Module and its context; sessions rebind their own state under the
engine lock; adds serving_capacity, capacity enforcement, the coverage check,
and context teardown. Also removes the now-incompatible cuda_graph flag (a
captured graph's baked pointers would ignore per-session rebinds), dropping it
from the engine config, the qwen3_5_moe_worker/runner (main.cpp) CLIs, and the
README/model.md docs.
6. backends/cuda/{CMakeLists.txt,runtime/TARGETS} and the qwen CMakeLists - build
wiring (CMake + Buck).
Validated on the real 35B-A3B model (HQQ-INT4) with two interleaved sessions:
after engine load: 17983 MB | capacity max_sessions=4, est 117309440 B/session
A solo : 348 10 4838 1665 15 16 17 18 19 20 21 22
A inter: 348 10 4838 1665 15 16 17 18 19 20 21 22 (identical -> no bleed)
GPU: engine=17983MB, +2 sessions=+108MB (weights once; ~112 MiB state/session)
Falls closed to single-session capacity if the AOTI constant-management symbols
are absent or the declared mutable FQNs do not fully match the loaded methods.
MLX V2 (shared constants + per-session MutableBufferData) is the next backend and
is not addressed here.
ghstack-source-id: 6bd59c4
ghstack-comment-id: 4652591759
Pull-Request: #20117
[ghstack-poisoned]
mergennachin
added a commit
that referenced
this pull request
Jun 8, 2026
Proves the V2 core independently of any serving change: one physical AOTI-CUDA
Qwen model with weights loaded once can host multiple logical sessions, each with
its own KV/conv/recurrent state, with no cross-session bleed. The mechanism is
the AOTI user-managed-constant rebind, kept CUDA-backend-private; the public
surface stays LLMEngine/LLMSession. No multi-session serving wiring here (that is
a follow-up PR) — the worker/runner still create a single session.
Review order:
1. examples/models/qwen3_5_moe/export.py - the model-facing change: emits
get_mutable_buffer_metadata ({"version":1,"mutable_buffers":[fqn...]}) naming
the model's per-session state buffers (FQN list only; the backend owns the
tensor descriptors). The FQNs come from the model's explicit module contract.
2. backends/cuda/runtime/cuda_mutable_state.h - CUDA-private API, keyed by a
per-engine context (not a generic Module/Method/BackendInterface API, and not
process-global).
3. backends/cuda/runtime/cuda_mutable_state.cpp - context-keyed manager:
descriptor table + initial-template capture at load, per-session GPU buffers
cloned from the template, rebind via update_user_managed_constant_buffer_pairs,
and declared-vs-discovered FQN coverage validation.
4. backends/cuda/runtime/cuda_backend.cpp - two hooks: note_handle in init(),
rebind_for_execute in execute().
5. examples/models/qwen3_5_moe/qwen35_moe_engine.{h,cpp} - the engine owns one
shared Module and its context; sessions rebind their own state under the
engine lock; adds serving_capacity, capacity enforcement, the coverage check,
and context teardown. Also removes the now-incompatible cuda_graph flag (a
captured graph's baked pointers would ignore per-session rebinds), dropping it
from the engine config, the qwen3_5_moe_worker/runner (main.cpp) CLIs, and the
README/model.md docs.
6. backends/cuda/{CMakeLists.txt,runtime/TARGETS} and the qwen CMakeLists - build
wiring (CMake + Buck).
Validated on the real 35B-A3B model (HQQ-INT4) with two interleaved sessions:
after engine load: 17983 MB | capacity max_sessions=4, est 117309440 B/session
A solo : 348 10 4838 1665 15 16 17 18 19 20 21 22
A inter: 348 10 4838 1665 15 16 17 18 19 20 21 22 (identical -> no bleed)
GPU: engine=17983MB, +2 sessions=+108MB (weights once; ~112 MiB state/session)
Falls closed to single-session capacity if the AOTI constant-management symbols
are absent or the declared mutable FQNs do not fully match the loaded methods.
MLX V2 (shared constants + per-session MutableBufferData) is the next backend and
is not addressed here.
Part of #20001
ghstack-source-id: b3d390f
ghstack-comment-id: 4652591759
Pull-Request: #20117
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Proves the V2 core independently of any serving change: one physical AOTI-CUDA
Qwen model with weights loaded once can host multiple logical sessions, each with
its own KV/conv/recurrent state, with no cross-session bleed. The mechanism is
the AOTI user-managed-constant rebind, kept CUDA-backend-private; the public
surface stays LLMEngine/LLMSession. No multi-session serving wiring here (that is
a follow-up PR) — the worker/runner still create a single session.
Review order:
get_mutable_buffer_metadata ({"version":1,"mutable_buffers":[fqn...]}) naming
the model's per-session state buffers (FQN list only; the backend owns the
tensor descriptors). The FQNs come from the model's explicit module contract.
per-engine context (not a generic Module/Method/BackendInterface API, and not
process-global).
descriptor table + initial-template capture at load, per-session GPU buffers
cloned from the template, rebind via update_user_managed_constant_buffer_pairs,
and declared-vs-discovered FQN coverage validation.
rebind_for_execute in execute().
shared Module and its context; sessions rebind their own state under the
engine lock; adds serving_capacity, capacity enforcement, the coverage check,
and context teardown. Also removes the now-incompatible cuda_graph flag (a
captured graph's baked pointers would ignore per-session rebinds), dropping it
from the engine config, the qwen3_5_moe_worker/runner (main.cpp) CLIs, and the
README/model.md docs.
wiring (CMake + Buck).
Validated on the real 35B-A3B model (HQQ-INT4) with two interleaved sessions:
after engine load: 17983 MB | capacity max_sessions=4, est 117309440 B/session
A solo : 348 10 4838 1665 15 16 17 18 19 20 21 22
A inter: 348 10 4838 1665 15 16 17 18 19 20 21 22 (identical -> no bleed)
GPU: engine=17983MB, +2 sessions=+108MB (weights once; ~112 MiB state/session)
Falls closed to single-session capacity if the AOTI constant-management symbols
are absent or the declared mutable FQNs do not fully match the loaded methods.
MLX V2 (shared constants + per-session MutableBufferData) is the next backend and
is not addressed here.
Part of #20001