[None][feat] Enable mamba/linear attention cache reuse in scheduler#12896
Draft
VALLIS-NERIA wants to merge 93 commits intoNVIDIA:mainfrom
Draft
[None][feat] Enable mamba/linear attention cache reuse in scheduler#12896VALLIS-NERIA wants to merge 93 commits intoNVIDIA:mainfrom
VALLIS-NERIA wants to merge 93 commits intoNVIDIA:mainfrom
Conversation
…ock/mNextBlocks with lookup-node pointers. Signed-off-by: SimengLiu-nv <simengl@nvidia.com>
Signed-off-by: SimengLiu-nv <simengl@nvidia.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
…use_new Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
…use_new Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
…er_mask 1. Restore enable_block_reuse=False model defaults for NemotronH and Qwen3Next hybrid models. Commit b4e54e7 removed these defaults which enabled block reuse for hybrid linear models, causing Executor worker errors. 2. Fix AutoDeploy cached_sequence_interface TypeError by constructing proper mamba_layer_mask and layer_mask in _create_and_assign_state_views instead of passing None from _get_mamba_state_params. Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
…use_new Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> # Conflicts: # tests/integration/defs/accuracy/test_llm_api_pytorch.py
…use_new Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> # Conflicts: # tensorrt_llm/_torch/attention_backend/trtllm.py
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
…cache manager (By Agent) - Remove USE_FAKE_POOL debug env var from resource_manager - Remove duplicate max_total_draft_tokens assignment (merge artifact) - Remove duplicate no-op pydantic validator for max_attention_window - Add NotImplementedError stub for update_mamba_states in CppMambaHybridCacheManager - Convert hot-path assert to RuntimeError in _setup_state_indices - Add dtype alignment checks in _get_ssm_states/_get_conv_states - Fix MixedMambaHybridCacheManager.free_resources to forward pin_on_release - Fix _setup_state_indices return type annotation - Fix shadowed layer_idx variable in state accessors - Add docstring to get_num_attention_layers explaining dual behavior - Fix VSWA log message for linear attention case - Remove dead code (self.iter, self._request_block_ids, unused import) - Remove redundant ceil_div redefinition in model_config - Fix calc_context_stop_positions to skip unnecessary 0 entry Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
…use_new Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> # Conflicts: # tensorrt_llm/_torch/models/modeling_qwen3_next.py # tensorrt_llm/_torch/pyexecutor/_util.py
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
…use_new Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
…or disagg support Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
…use_new Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> # Conflicts: # tests/integration/test_lists/waives.txt
…aNet.forward() Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
…n3_next.py, port attn_dp change to gdn_mixer.py Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
… user/xiweny/linear_reuse_new Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: xiweny <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
…By Agent) Add cache block reuse support for mamba/linear attention models in the scheduler and cache manager. This allows recurrent state cache blocks to be shared across requests, improving memory efficiency for hybrid models. Key changes: - Extend KVCacheManager C++ layer with mutable cache block ID accessors and logging for linear cache memory budget calculation - Refactor MambaCacheManager to support block-level cache operations (add/remove/reuse) instead of flat tensor management - Update scheduler to handle linear attention cache reuse alongside KV cache reuse - Wire through enable_cache_reuse flag for mamba cache in executor and resource manager - Add integration test for mamba2 hybrid model cache reuse - Add unit tests for KV cache manager with linear attention metadata Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
enable_cache_reuseflag for mamba cache in executor and resource managerTest plan
tests/unittest/_torch/executor/test_kv_cache_manager.py🤖 Generated with Claude Code