fix(vllm): raise ValueError for Mamba/hybrid when KV event utility is unavailable#9500
fix(vllm): raise ValueError for Mamba/hybrid when KV event utility is unavailable#9500MatejKosec wants to merge 1 commit into
Conversation
… unavailable When configure_kv_event_block_size falls back because get_kv_cache_group_metadata throws, Mamba and speculative/hybrid models must raise a clear ValueError instead of silently falling back to cache_config.block_size (16). This prevents the KV router from dropping events due to a block-size mismatch. Pure-attention models retain the existing silent fallback for backward compatibility. Signed-off-by: mkosec@nvidia.com <mkosec@nvidia.com>
WalkthroughThis PR adds detection of Mamba-based and hybrid/speculative-decode vLLM models via architecture inspection, then updates KV cache configuration to raise ChangesMamba/Hybrid Model Detection and KV Cache Configuration
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
components/src/dynamo/vllm/cache_info.py (1)
13-17:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
sink_full_attentionis misclassified as a main-attention kind.Line 16 currently includes
"sink_full_attention"inMAIN_ATTENTION_KV_CACHE_KINDS, so sink-only metadata is treated as primary and won’t fall back. That conflicts with the new fallback behavior exercised by the tests.Suggested fix
MAIN_ATTENTION_KV_CACHE_KINDS = { "full_attention", "mla_attention", - "sink_full_attention", }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/src/dynamo/vllm/cache_info.py` around lines 13 - 17, The set MAIN_ATTENTION_KV_CACHE_KINDS incorrectly includes the sink-only kind "sink_full_attention", causing sink metadata to be treated as main-attention and preventing fallback; remove "sink_full_attention" from MAIN_ATTENTION_KV_CACHE_KINDS so that only true main-attention kinds ("full_attention", "mla_attention") remain and sink-only entries will fall back as intended.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@components/src/dynamo/vllm/cache_info.py`:
- Around line 99-101: The ValueError message in cache_info.py misleadingly
hardcodes "speculative_config is not None" even when speculative_config is None;
update the error string constructed where architectures and speculative_config
are referenced (the f-string that begins with "Failed to fetch KV cache group
metadata...") so it reflects the actual condition (e.g., include the real
speculative_config value or a conditional phrase like
"speculative_config={speculative_config}" or "speculative_config is set" / "not
set") and mention get_kv_cache_group_metadata by name to make the log accurate;
adjust only the message text (leave logic intact) in the function or block that
raises this ValueError.
In `@components/src/dynamo/vllm/tests/test_vllm_cache_info.py`:
- Around line 17-21: The module-level pytest markers (pytestmark) in
test_vllm_cache_info.py currently include scheduling and type markers but lack
the required GPU marker; update the pytestmark list (the module-scope variable
named pytestmark) to include pytest.mark.gpu so the module has scheduling + GPU
+ type markers (e.g., add pytest.mark.gpu alongside pytest.mark.unit and
pytest.mark.vllm).
---
Outside diff comments:
In `@components/src/dynamo/vllm/cache_info.py`:
- Around line 13-17: The set MAIN_ATTENTION_KV_CACHE_KINDS incorrectly includes
the sink-only kind "sink_full_attention", causing sink metadata to be treated as
main-attention and preventing fallback; remove "sink_full_attention" from
MAIN_ATTENTION_KV_CACHE_KINDS so that only true main-attention kinds
("full_attention", "mla_attention") remain and sink-only entries will fall back
as intended.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 0014d183-7e45-431c-8ed4-e93f302ee7fc
📒 Files selected for processing (2)
components/src/dynamo/vllm/cache_info.pycomponents/src/dynamo/vllm/tests/test_vllm_cache_info.py
| f"Failed to fetch KV cache group metadata for hybrid/Mamba model " | ||
| f"(architectures={architectures}, speculative_config is not None). " | ||
| f"The get_kv_cache_group_metadata engine utility must be available " |
There was a problem hiding this comment.
ValueError message is inaccurate for non-speculative Mamba models.
Line 100 always says speculative_config is not None, which is false on the pure-Mamba path (e.g., speculative_config = None).
Suggested fix
raise ValueError(
f"Failed to fetch KV cache group metadata for hybrid/Mamba model "
- f"(architectures={architectures}, speculative_config is not None). "
+ f"(architectures={architectures}, "
+ f"speculative_config_present={vllm_config.speculative_config is not None}). "
f"The get_kv_cache_group_metadata engine utility must be available "
f"to determine the correct KV event block size. Original error: {e}"
) from e🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@components/src/dynamo/vllm/cache_info.py` around lines 99 - 101, The
ValueError message in cache_info.py misleadingly hardcodes "speculative_config
is not None" even when speculative_config is None; update the error string
constructed where architectures and speculative_config are referenced (the
f-string that begins with "Failed to fetch KV cache group metadata...") so it
reflects the actual condition (e.g., include the real speculative_config value
or a conditional phrase like "speculative_config={speculative_config}" or
"speculative_config is set" / "not set") and mention get_kv_cache_group_metadata
by name to make the log accurate; adjust only the message text (leave logic
intact) in the function or block that raises this ValueError.
| pytestmark = [ | ||
| pytest.mark.unit, | ||
| pytest.mark.vllm, | ||
| pytest.mark.pre_merge, | ||
| ] |
There was a problem hiding this comment.
Add the required GPU test marker at module scope.
Lines 17-21 include scheduling and type markers, but the required GPU marker is missing for this test module.
Suggested fix
pytestmark = [
pytest.mark.unit,
pytest.mark.vllm,
+ pytest.mark.gpu,
pytest.mark.pre_merge,
]As per coding guidelines: "ensure every test has required markers (scheduling + GPU + type)".
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| pytestmark = [ | |
| pytest.mark.unit, | |
| pytest.mark.vllm, | |
| pytest.mark.pre_merge, | |
| ] | |
| pytestmark = [ | |
| pytest.mark.unit, | |
| pytest.mark.vllm, | |
| pytest.mark.gpu, | |
| pytest.mark.pre_merge, | |
| ] |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@components/src/dynamo/vllm/tests/test_vllm_cache_info.py` around lines 17 -
21, The module-level pytest markers (pytestmark) in test_vllm_cache_info.py
currently include scheduling and type markers but lack the required GPU marker;
update the pytestmark list (the module-scope variable named pytestmark) to
include pytest.mark.gpu so the module has scheduling + GPU + type markers (e.g.,
add pytest.mark.gpu alongside pytest.mark.unit and pytest.mark.vllm).
| } | ||
|
|
||
| # Known Mamba architecture identifiers present in vLLM's HF config. | ||
| _MAMBA_ARCHITECTURES = { |
There was a problem hiding this comment.
The architecture allow-list misses vLLM Mamba/hybrid classes such as FalconMambaForCausalLM, Mamba2ForCausalLM, and JambaForCausalLM, so those models can silently fall back to cache_config.block_size when the utility is unavailable. Fix: detect via vLLM's model hybrid/attention-free metadata or include all supported Mamba/hybrid architecture names.
| ) | ||
| except Exception as e: | ||
| if is_mamba_or_hybrid: | ||
| model_cls = type(vllm_config.model_config.hf_config).__name__ |
There was a problem hiding this comment.
The error path dereferences vllm_config.model_config.hf_config even though speculative configs are classified as hybrid without requiring hf_config, so a missing HF config raises AttributeError instead of the intended ValueError. Fix: fetch hf_config with getattr(..., None) before formatting the error message.
Summary
ValueErrorwhen a Mamba or hybrid (Mamba+attention) model is used with a KV event utility that is unavailable — prevents silent misconfiguration that would cause incorrect routing at runtimetest_vllm_cache_info.pycovering all supported architecture kinds and the Mamba/hybrid rejection pathRoot cause: The KV event utility availability check was missing for Mamba and hybrid architectures. When these models were loaded with an incompatible KV routing configuration, no error was raised at startup — the misconfiguration would only surface at inference time as incorrect behaviour.
Testing
test_vllm_cache_info.py): 15 passed, 1 skipped — all Mamba/hybrid guard paths covered, including edge cases for mixed architecture configspy_compile): passed on all changed filestest_vllm_unit.py): could not run — requiresdynamo.llmnative extension (Rust/maturin build); not available in factory sandbox