Skip to content

Qwen3.5-MoE CUDA V2 foundation: one model, many isolated sessions#20117

Open
mergennachin wants to merge 2 commits into
gh/mergennachin/7/headfrom
gh/mergennachin/8/head
Open

Qwen3.5-MoE CUDA V2 foundation: one model, many isolated sessions#20117
mergennachin wants to merge 2 commits into
gh/mergennachin/7/headfrom
gh/mergennachin/8/head

Conversation

@mergennachin

@mergennachin mergennachin commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Proves the V2 core independently of any serving change: one physical AOTI-CUDA
Qwen model with weights loaded once can host multiple logical sessions, each with
its own KV/conv/recurrent state, with no cross-session bleed. The mechanism is
the AOTI user-managed-constant rebind, kept CUDA-backend-private; the public
surface stays LLMEngine/LLMSession. No multi-session serving wiring here (that is
a follow-up PR) — the worker/runner still create a single session.

Review order:

  1. examples/models/qwen3_5_moe/export.py - the model-facing change: emits
    get_mutable_buffer_metadata ({"version":1,"mutable_buffers":[fqn...]}) naming
    the model's per-session state buffers (FQN list only; the backend owns the
    tensor descriptors). The FQNs come from the model's explicit module contract.
  2. backends/cuda/runtime/cuda_mutable_state.h - CUDA-private API, keyed by a
    per-engine context (not a generic Module/Method/BackendInterface API, and not
    process-global).
  3. backends/cuda/runtime/cuda_mutable_state.cpp - context-keyed manager:
    descriptor table + initial-template capture at load, per-session GPU buffers
    cloned from the template, rebind via update_user_managed_constant_buffer_pairs,
    and declared-vs-discovered FQN coverage validation.
  4. backends/cuda/runtime/cuda_backend.cpp - two hooks: note_handle in init(),
    rebind_for_execute in execute().
  5. examples/models/qwen3_5_moe/qwen35_moe_engine.{h,cpp} - the engine owns one
    shared Module and its context; sessions rebind their own state under the
    engine lock; adds serving_capacity, capacity enforcement, the coverage check,
    and context teardown. Also removes the now-incompatible cuda_graph flag (a
    captured graph's baked pointers would ignore per-session rebinds), dropping it
    from the engine config, the qwen3_5_moe_worker/runner (main.cpp) CLIs, and the
    README/model.md docs.
  6. backends/cuda/{CMakeLists.txt,runtime/TARGETS} and the qwen CMakeLists - build
    wiring (CMake + Buck).

Validated on the real 35B-A3B model (HQQ-INT4) with two interleaved sessions:

after engine load: 17983 MB | capacity max_sessions=4, est 117309440 B/session
A solo : 348 10 4838 1665 15 16 17 18 19 20 21 22
A inter: 348 10 4838 1665 15 16 17 18 19 20 21 22 (identical -> no bleed)
GPU: engine=17983MB, +2 sessions=+108MB (weights once; ~112 MiB state/session)

Falls closed to single-session capacity if the AOTI constant-management symbols
are absent or the declared mutable FQNs do not fully match the loaded methods.

MLX V2 (shared constants + per-session MutableBufferData) is the next backend and
is not addressed here.

Part of #20001

[ghstack-poisoned]
@pytorch-bot

pytorch-bot Bot commented Jun 8, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20117

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures, 1 Pending

As of commit 394b0c1 with merge base eeb0646 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 8, 2026
mergennachin added a commit that referenced this pull request Jun 8, 2026
Proves the V2 core independently of any serving change: one physical AOTI-CUDA
Qwen model with weights loaded once can host multiple logical sessions, each with
its own KV/conv/recurrent state, with no cross-session bleed. The mechanism is
the AOTI user-managed-constant rebind, kept CUDA-backend-private; the public
surface stays LLMEngine/LLMSession. No multi-session serving wiring here (that is
a follow-up PR) — the worker/runner still create a single session.

Review order:
1. examples/models/qwen3_5_moe/export.py - the model-facing change: emits
   get_mutable_buffer_metadata ({"version":1,"mutable_buffers":[fqn...]}) naming
   the model's per-session state buffers (FQN list only; the backend owns the
   tensor descriptors). The FQNs come from the model's explicit module contract.
2. backends/cuda/runtime/cuda_mutable_state.h - CUDA-private API, keyed by a
   per-engine context (not a generic Module/Method/BackendInterface API, and not
   process-global).
3. backends/cuda/runtime/cuda_mutable_state.cpp - context-keyed manager:
   descriptor table + initial-template capture at load, per-session GPU buffers
   cloned from the template, rebind via update_user_managed_constant_buffer_pairs,
   and declared-vs-discovered FQN coverage validation.
4. backends/cuda/runtime/cuda_backend.cpp - two hooks: note_handle in init(),
   rebind_for_execute in execute().
5. examples/models/qwen3_5_moe/qwen35_moe_engine.{h,cpp} - the engine owns one
   shared Module and its context; sessions rebind their own state under the
   engine lock; adds serving_capacity, capacity enforcement, the coverage check,
   and context teardown. Also removes the now-incompatible cuda_graph flag (a
   captured graph's baked pointers would ignore per-session rebinds), dropping it
   from the engine config, the qwen3_5_moe_worker/runner (main.cpp) CLIs, and the
   README/model.md docs.
6. backends/cuda/{CMakeLists.txt,runtime/TARGETS} and the qwen CMakeLists - build
   wiring (CMake + Buck).

Validated on the real 35B-A3B model (HQQ-INT4) with two interleaved sessions:

  after engine load: 17983 MB | capacity max_sessions=4, est 117309440 B/session
  A solo : 348 10 4838 1665 15 16 17 18 19 20 21 22
  A inter: 348 10 4838 1665 15 16 17 18 19 20 21 22   (identical -> no bleed)
  GPU: engine=17983MB, +2 sessions=+108MB   (weights once; ~112 MiB state/session)

Falls closed to single-session capacity if the AOTI constant-management symbols
are absent or the declared mutable FQNs do not fully match the loaded methods.

MLX V2 (shared constants + per-session MutableBufferData) is the next backend and
is not addressed here.


ghstack-source-id: 6bd59c4
ghstack-comment-id: 4652591759
Pull-Request: #20117
[ghstack-poisoned]
mergennachin added a commit that referenced this pull request Jun 8, 2026
Proves the V2 core independently of any serving change: one physical AOTI-CUDA
Qwen model with weights loaded once can host multiple logical sessions, each with
its own KV/conv/recurrent state, with no cross-session bleed. The mechanism is
the AOTI user-managed-constant rebind, kept CUDA-backend-private; the public
surface stays LLMEngine/LLMSession. No multi-session serving wiring here (that is
a follow-up PR) — the worker/runner still create a single session.

Review order:
1. examples/models/qwen3_5_moe/export.py - the model-facing change: emits
   get_mutable_buffer_metadata ({"version":1,"mutable_buffers":[fqn...]}) naming
   the model's per-session state buffers (FQN list only; the backend owns the
   tensor descriptors). The FQNs come from the model's explicit module contract.
2. backends/cuda/runtime/cuda_mutable_state.h - CUDA-private API, keyed by a
   per-engine context (not a generic Module/Method/BackendInterface API, and not
   process-global).
3. backends/cuda/runtime/cuda_mutable_state.cpp - context-keyed manager:
   descriptor table + initial-template capture at load, per-session GPU buffers
   cloned from the template, rebind via update_user_managed_constant_buffer_pairs,
   and declared-vs-discovered FQN coverage validation.
4. backends/cuda/runtime/cuda_backend.cpp - two hooks: note_handle in init(),
   rebind_for_execute in execute().
5. examples/models/qwen3_5_moe/qwen35_moe_engine.{h,cpp} - the engine owns one
   shared Module and its context; sessions rebind their own state under the
   engine lock; adds serving_capacity, capacity enforcement, the coverage check,
   and context teardown. Also removes the now-incompatible cuda_graph flag (a
   captured graph's baked pointers would ignore per-session rebinds), dropping it
   from the engine config, the qwen3_5_moe_worker/runner (main.cpp) CLIs, and the
   README/model.md docs.
6. backends/cuda/{CMakeLists.txt,runtime/TARGETS} and the qwen CMakeLists - build
   wiring (CMake + Buck).

Validated on the real 35B-A3B model (HQQ-INT4) with two interleaved sessions:

  after engine load: 17983 MB | capacity max_sessions=4, est 117309440 B/session
  A solo : 348 10 4838 1665 15 16 17 18 19 20 21 22
  A inter: 348 10 4838 1665 15 16 17 18 19 20 21 22   (identical -> no bleed)
  GPU: engine=17983MB, +2 sessions=+108MB   (weights once; ~112 MiB state/session)

Falls closed to single-session capacity if the AOTI constant-management symbols
are absent or the declared mutable FQNs do not fully match the loaded methods.

MLX V2 (shared constants + per-session MutableBufferData) is the next backend and
is not addressed here.

Part of #20001

ghstack-source-id: b3d390f
ghstack-comment-id: 4652591759
Pull-Request: #20117
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant