Skip to content

[None][feat] Enable mamba/linear attention cache reuse in scheduler#12896

Draft
VALLIS-NERIA wants to merge 93 commits intoNVIDIA:mainfrom
VALLIS-NERIA:user/xiweny/mamba_cache_reuse
Draft

[None][feat] Enable mamba/linear attention cache reuse in scheduler#12896
VALLIS-NERIA wants to merge 93 commits intoNVIDIA:mainfrom
VALLIS-NERIA:user/xiweny/mamba_cache_reuse

Conversation

@VALLIS-NERIA
Copy link
Copy Markdown
Collaborator

Summary

  • Extend KVCacheManager C++ layer with mutable cache block ID accessors and logging for linear cache memory budget calculation
  • Refactor MambaCacheManager to support block-level cache operations (add/remove/reuse) instead of flat tensor management
  • Update scheduler to handle linear attention cache reuse alongside KV cache reuse
  • Wire through enable_cache_reuse flag for mamba cache in executor and resource manager
  • Add integration test for mamba2 hybrid model cache reuse
  • Add unit tests for KV cache manager with linear attention metadata

Test plan

  • Unit tests in tests/unittest/_torch/executor/test_kv_cache_manager.py
  • Integration test for mamba2 hybrid model with cache reuse
  • Existing KV cache reuse tests still pass

🤖 Generated with Claude Code

SimengLiu-nv and others added 30 commits March 5, 2026 15:29
…ock/mNextBlocks with lookup-node pointers.

Signed-off-by: SimengLiu-nv <simengl@nvidia.com>
Signed-off-by: SimengLiu-nv <simengl@nvidia.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
…use_new

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
…use_new

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
…er_mask

1. Restore enable_block_reuse=False model defaults for NemotronH and
   Qwen3Next hybrid models. Commit b4e54e7 removed these defaults which
   enabled block reuse for hybrid linear models, causing Executor worker
   errors.

2. Fix AutoDeploy cached_sequence_interface TypeError by constructing
   proper mamba_layer_mask and layer_mask in _create_and_assign_state_views
   instead of passing None from _get_mamba_state_params.

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
…use_new

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

# Conflicts:
#	tests/integration/defs/accuracy/test_llm_api_pytorch.py
…use_new

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

# Conflicts:
#	tensorrt_llm/_torch/attention_backend/trtllm.py
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
VALLIS-NERIA and others added 29 commits April 3, 2026 15:27
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
…cache manager (By Agent)

- Remove USE_FAKE_POOL debug env var from resource_manager
- Remove duplicate max_total_draft_tokens assignment (merge artifact)
- Remove duplicate no-op pydantic validator for max_attention_window
- Add NotImplementedError stub for update_mamba_states in CppMambaHybridCacheManager
- Convert hot-path assert to RuntimeError in _setup_state_indices
- Add dtype alignment checks in _get_ssm_states/_get_conv_states
- Fix MixedMambaHybridCacheManager.free_resources to forward pin_on_release
- Fix _setup_state_indices return type annotation
- Fix shadowed layer_idx variable in state accessors
- Add docstring to get_num_attention_layers explaining dual behavior
- Fix VSWA log message for linear attention case
- Remove dead code (self.iter, self._request_block_ids, unused import)
- Remove redundant ceil_div redefinition in model_config
- Fix calc_context_stop_positions to skip unnecessary 0 entry

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
…use_new

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

# Conflicts:
#	tensorrt_llm/_torch/models/modeling_qwen3_next.py
#	tensorrt_llm/_torch/pyexecutor/_util.py
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
…use_new

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
…or disagg support

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
…use_new

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>

# Conflicts:
#	tests/integration/test_lists/waives.txt
…aNet.forward()

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
…n3_next.py, port attn_dp change to gdn_mixer.py

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
… user/xiweny/linear_reuse_new

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: xiweny <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
…By Agent)

Add cache block reuse support for mamba/linear attention models in the
scheduler and cache manager. This allows recurrent state cache blocks to
be shared across requests, improving memory efficiency for hybrid models.

Key changes:
- Extend KVCacheManager C++ layer with mutable cache block ID accessors
  and logging for linear cache memory budget calculation
- Refactor MambaCacheManager to support block-level cache operations
  (add/remove/reuse) instead of flat tensor management
- Update scheduler to handle linear attention cache reuse alongside KV
  cache reuse
- Wire through enable_cache_reuse flag for mamba cache in executor and
  resource manager
- Add integration test for mamba2 hybrid model cache reuse
- Add unit tests for KV cache manager with linear attention metadata

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants