[None][feat] Enable mamba/linear attention cache reuse in scheduler by VALLIS-NERIA · Pull Request #12896 · NVIDIA/TensorRT-LLM

VALLIS-NERIA · 2026-04-09T17:36:39Z

Summary

Extend KVCacheManager C++ layer with mutable cache block ID accessors and logging for linear cache memory budget calculation
Refactor MambaCacheManager to support block-level cache operations (add/remove/reuse) instead of flat tensor management
Update scheduler to handle linear attention cache reuse alongside KV cache reuse
Wire through enable_cache_reuse flag for mamba cache in executor and resource manager
Add integration test for mamba2 hybrid model cache reuse
Add unit tests for KV cache manager with linear attention metadata

Test plan

Unit tests in tests/unittest/_torch/executor/test_kv_cache_manager.py
Integration test for mamba2 hybrid model with cache reuse
Existing KV cache reuse tests still pass

🤖 Generated with Claude Code

…ock/mNextBlocks with lookup-node pointers. Signed-off-by: SimengLiu-nv <simengl@nvidia.com>

Signed-off-by: SimengLiu-nv <simengl@nvidia.com>

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

…use_new Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

…use_new Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

…er_mask 1. Restore enable_block_reuse=False model defaults for NemotronH and Qwen3Next hybrid models. Commit b4e54e7 removed these defaults which enabled block reuse for hybrid linear models, causing Executor worker errors. 2. Fix AutoDeploy cached_sequence_interface TypeError by constructing proper mamba_layer_mask and layer_mask in _create_and_assign_state_views instead of passing None from _get_mamba_state_params. Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

…use_new Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> # Conflicts: # tests/integration/defs/accuracy/test_llm_api_pytorch.py

…use_new Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> # Conflicts: # tensorrt_llm/_torch/attention_backend/trtllm.py

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

…cache manager (By Agent) - Remove USE_FAKE_POOL debug env var from resource_manager - Remove duplicate max_total_draft_tokens assignment (merge artifact) - Remove duplicate no-op pydantic validator for max_attention_window - Add NotImplementedError stub for update_mamba_states in CppMambaHybridCacheManager - Convert hot-path assert to RuntimeError in _setup_state_indices - Add dtype alignment checks in _get_ssm_states/_get_conv_states - Fix MixedMambaHybridCacheManager.free_resources to forward pin_on_release - Fix _setup_state_indices return type annotation - Fix shadowed layer_idx variable in state accessors - Add docstring to get_num_attention_layers explaining dual behavior - Fix VSWA log message for linear attention case - Remove dead code (self.iter, self._request_block_ids, unused import) - Remove redundant ceil_div redefinition in model_config - Fix calc_context_stop_positions to skip unnecessary 0 entry Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

…use_new Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> # Conflicts: # tensorrt_llm/_torch/models/modeling_qwen3_next.py # tensorrt_llm/_torch/pyexecutor/_util.py

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

…use_new Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

…or disagg support Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

…use_new Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> # Conflicts: # tests/integration/test_lists/waives.txt

…aNet.forward() Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

…n3_next.py, port attn_dp change to gdn_mixer.py Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

… user/xiweny/linear_reuse_new Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

Signed-off-by: xiweny <13230610+VALLIS-NERIA@users.noreply.github.com>

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

…By Agent) Add cache block reuse support for mamba/linear attention models in the scheduler and cache manager. This allows recurrent state cache blocks to be shared across requests, improving memory efficiency for hybrid models. Key changes: - Extend KVCacheManager C++ layer with mutable cache block ID accessors and logging for linear cache memory budget calculation - Refactor MambaCacheManager to support block-level cache operations (add/remove/reuse) instead of flat tensor management - Update scheduler to handle linear attention cache reuse alongside KV cache reuse - Wire through enable_cache_reuse flag for mamba cache in executor and resource manager - Add integration test for mamba2 hybrid model cache reuse - Add unit tests for KV cache manager with linear attention metadata Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

SimengLiu-nv and others added 30 commits March 5, 2026 15:29

[None][feat] Wire KVCacheBlock to UnifiedBlockTree, replacing mPrevBl…

bdb3791

…ock/mNextBlocks with lookup-node pointers. Signed-off-by: SimengLiu-nv <simengl@nvidia.com>

Address comments.

bd4810a

Signed-off-by: SimengLiu-nv <simengl@nvidia.com>

block allocation and reusing works for linear attention

27574b9

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

copy states during context shifts

3543bbe

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

fix corner cases

36aa474

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

temp stage: accuracy w/o reuse ok

cd1a67b

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

temp stage: accuracy with reuse ok

94d4312

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

fix merge conflicts

d885842

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

Merge remote-tracking branch 'origin/main' into pr-11919

603d822

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

temporary stage

b398561

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

fix multiple issues

df7284a

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

Merge remote-tracking branch 'origin/main' into user/xiweny/linear_re…

ce9674a

…use_new Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

use pre calculated buffers

cab2412

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

Merge remote-tracking branch 'origin/main' into user/xiweny/linear_re…

a1889b8

…use_new Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

scheduler support

22e7fd2

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

FIFO placeholder management

6475692

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

remove debug prints in module/op level

3312fa9

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

change memory layout to layer first

9b73cbf

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

fix scheduler

efbb815

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

auto choose mamba cache manager impl

aa15395

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

format code

5bfda48

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

fix unhandled kFORCE_CHUNK enum in switch statement

f9e2ad0

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

fix config of current implementation

1810dba

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

merge upstream main and resolve conflicts

cf50425

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

fix missing is_nemotron_hybrid/is_qwen3_hybrid imports

4dd57bf

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

remove some hacks

b4e54e7

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

Merge remote-tracking branch 'origin/main' into user/xiweny/linear_re…

782c46f

…use_new Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> # Conflicts: # tests/integration/defs/accuracy/test_llm_api_pytorch.py

Merge remote-tracking branch 'origin/main' into user/xiweny/linear_re…

75b9438

…use_new Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> # Conflicts: # tensorrt_llm/_torch/attention_backend/trtllm.py

revert to use old mambacachemanager as default

c27c351

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

VALLIS-NERIA and others added 29 commits April 3, 2026 15:27

remove warning at exit

c1129e4

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

rename

82b7049

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

fix style

d616e76

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

Merge remote-tracking branch 'origin/main' into user/xiweny/linear_re…

d8dbaf0

…use_new Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> # Conflicts: # tensorrt_llm/_torch/models/modeling_qwen3_next.py # tensorrt_llm/_torch/pyexecutor/_util.py

[Agent fix] Add missing triton import in modeling_qwen3_next.py

162a5eb

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

Merge remote-tracking branch 'origin/main' into user/xiweny/linear_re…

18d8650

…use_new Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

[Agent fix] Remove extra blank line in cuda_graph_runner.py

7a1b3eb

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

refine tests

9319bf7

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

[Agent fix] Add missing imports for ruff-legacy compliance (F821/F811)

fe01292

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

[Agent fix] Remove duplicate ids kwarg from parametrize_with_ids call

ea36783

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

[Agent fix] Update test list entries for parametrized test_fp8

f40cc41

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

[Agent fix] Fix CppMambaHybridCacheManager.get_state_indices signature

67a4a55

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

[Agent fix] Use MixedMambaHybridCacheManager in test_mamba_transfer f…

cc52b20

…or disagg support Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

Merge remote-tracking branch 'origin/main' into user/xiweny/linear_re…

9247acb

…use_new Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> # Conflicts: # tests/integration/test_lists/waives.txt

[Agent fix] Add missing spec_metadata parameter to Qwen3NextGatedDelt…

e56e95e

…aNet.forward() Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

[Agent fix] Remove duplicate Qwen3NextGatedDeltaNet from modeling_qwe…

605c6af

…n3_next.py, port attn_dp change to gdn_mixer.py Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

fix silly AI, unify naming and test

ee653a6

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

WAR block save issue

d3c7589

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

address comments

e8e8520

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

fix attention DP sharding

079f8bf

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

Merge branch 'main' into user/xiweny/linear_reuse_new

0115a80

Merge remote-tracking branch 'fork/user/xiweny/linear_reuse_new' into…

67a3c57

… user/xiweny/linear_reuse_new Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

address commentes

ad051e9

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

fix the placeholder issue

9a34a49

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

Update l0_gb200_multi_gpus.yml

daa6320

Signed-off-by: xiweny <13230610+VALLIS-NERIA@users.noreply.github.com>

address comments

866dfcc

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

address comments

3915d7d

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

github-actions bot assigned VALLIS-NERIA Apr 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][feat] Enable mamba/linear attention cache reuse in scheduler#12896

[None][feat] Enable mamba/linear attention cache reuse in scheduler#12896
VALLIS-NERIA wants to merge 93 commits intoNVIDIA:mainfrom
VALLIS-NERIA:user/xiweny/mamba_cache_reuse

VALLIS-NERIA commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

VALLIS-NERIA commented Apr 9, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants