fix(vllm): raise ValueError for Mamba/hybrid when KV event utility is unavailable by MatejKosec · Pull Request #9500 · ai-dynamo/dynamo

MatejKosec · 2026-05-13T19:54:32Z

Summary

Add ValueError when a Mamba or hybrid (Mamba+attention) model is used with a KV event utility that is unavailable — prevents silent misconfiguration that would cause incorrect routing at runtime
Add unit tests covering the new guard: 15 tests in test_vllm_cache_info.py covering all supported architecture kinds and the Mamba/hybrid rejection path

Root cause: The KV event utility availability check was missing for Mamba and hybrid architectures. When these models were loaded with an incompatible KV routing configuration, no error was raised at startup — the misconfiguration would only surface at inference time as incorrect behaviour.

Testing

New unit tests (test_vllm_cache_info.py): 15 passed, 1 skipped — all Mamba/hybrid guard paths covered, including edge cases for mixed architecture configs
Python syntax check (py_compile): passed on all changed files
Existing vLLM unit suite (test_vllm_unit.py): could not run — requires dynamo.llm native extension (Rust/maturin build); not available in factory sandbox
Hardware tested: none required — pure Python guard added at config-validation time, no GPU needed to verify the ValueError path

… unavailable When configure_kv_event_block_size falls back because get_kv_cache_group_metadata throws, Mamba and speculative/hybrid models must raise a clear ValueError instead of silently falling back to cache_config.block_size (16). This prevents the KV router from dropping events due to a block-size mismatch. Pure-attention models retain the existing silent fallback for backward compatibility. Signed-off-by: mkosec@nvidia.com <mkosec@nvidia.com>

copy-pr-bot · 2026-05-13T19:54:35Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-13T19:59:02Z

Walkthrough

This PR adds detection of Mamba-based and hybrid/speculative-decode vLLM models via architecture inspection, then updates KV cache configuration to raise ValueError for these models when utility metadata is unavailable instead of silently falling back. Non-hybrid models preserve the previous warning-and-fallback behavior.

Changes

Mamba/Hybrid Model Detection and KV Cache Configuration

Layer / File(s)	Summary
Mamba/Hybrid Model Detection `components/src/dynamo/vllm/cache_info.py`, `components/src/dynamo/vllm/tests/test_vllm_cache_info.py`	Introduces `_MAMBA_ARCHITECTURES` constant and `detect_mamba_hybrid_model()` function that returns `True` for models with speculative configs or Mamba architecture identifiers. Tests validate detection across speculative configs, Mamba architectures, missing HuggingFace configs, and empty architecture lists.
KV Cache Configuration with Model-Type Error Handling `components/src/dynamo/vllm/cache_info.py`, `components/src/dynamo/vllm/tests/test_vllm_cache_info.py`	Updates `configure_kv_event_block_size()` to detect model type and conditionally raise `ValueError` when `get_kv_cache_group_metadata` fails for Mamba/hybrid models; non-hybrid models silently fall back. Tests cover successful metadata storage, fallback behavior, and `ValueError` cases.
Supporting Block Size and Getter Tests `components/src/dynamo/vllm/tests/test_vllm_cache_info.py`	Tests for `select_main_attention_block_size()` and `get_configured_kv_event_block_size()` covering metadata fallback scenarios and cached vs. default block size retrieval.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: raising a ValueError for Mamba/hybrid models when KV event utility is unavailable, which aligns with the PR's core modification to configure_kv_event_block_size() behavior.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The pull request provides a clear summary with root cause analysis, testing details, and specific changes. It addresses all key aspects but lacks the structured sections from the template.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

components/src/dynamo/vllm/cache_info.py (1)
13-17: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

sink_full_attention is misclassified as a main-attention kind.

Line 16 currently includes "sink_full_attention" in MAIN_ATTENTION_KV_CACHE_KINDS, so sink-only metadata is treated as primary and won’t fall back. That conflicts with the new fallback behavior exercised by the tests.
Suggested fix
 MAIN_ATTENTION_KV_CACHE_KINDS = {
     "full_attention",
     "mla_attention",
-    "sink_full_attention",
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/vllm/cache_info.py` around lines 13 - 17, The set
MAIN_ATTENTION_KV_CACHE_KINDS incorrectly includes the sink-only kind
"sink_full_attention", causing sink metadata to be treated as main-attention and
preventing fallback; remove "sink_full_attention" from
MAIN_ATTENTION_KV_CACHE_KINDS so that only true main-attention kinds
("full_attention", "mla_attention") remain and sink-only entries will fall back
as intended.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/src/dynamo/vllm/cache_info.py`:
- Around line 99-101: The ValueError message in cache_info.py misleadingly
hardcodes "speculative_config is not None" even when speculative_config is None;
update the error string constructed where architectures and speculative_config
are referenced (the f-string that begins with "Failed to fetch KV cache group
metadata...") so it reflects the actual condition (e.g., include the real
speculative_config value or a conditional phrase like
"speculative_config={speculative_config}" or "speculative_config is set" / "not
set") and mention get_kv_cache_group_metadata by name to make the log accurate;
adjust only the message text (leave logic intact) in the function or block that
raises this ValueError.

In `@components/src/dynamo/vllm/tests/test_vllm_cache_info.py`:
- Around line 17-21: The module-level pytest markers (pytestmark) in
test_vllm_cache_info.py currently include scheduling and type markers but lack
the required GPU marker; update the pytestmark list (the module-scope variable
named pytestmark) to include pytest.mark.gpu so the module has scheduling + GPU
+ type markers (e.g., add pytest.mark.gpu alongside pytest.mark.unit and
pytest.mark.vllm).

---

Outside diff comments:
In `@components/src/dynamo/vllm/cache_info.py`:
- Around line 13-17: The set MAIN_ATTENTION_KV_CACHE_KINDS incorrectly includes
the sink-only kind "sink_full_attention", causing sink metadata to be treated as
main-attention and preventing fallback; remove "sink_full_attention" from
MAIN_ATTENTION_KV_CACHE_KINDS so that only true main-attention kinds
("full_attention", "mla_attention") remain and sink-only entries will fall back
as intended.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0014d183-7e45-431c-8ed4-e93f302ee7fc

📥 Commits

Reviewing files that changed from the base of the PR and between 0ff011b and 042b88e.

📒 Files selected for processing (2)

components/src/dynamo/vllm/cache_info.py
components/src/dynamo/vllm/tests/test_vllm_cache_info.py

coderabbitai · 2026-05-13T19:59:05Z

+                f"Failed to fetch KV cache group metadata for hybrid/Mamba model "
+                f"(architectures={architectures}, speculative_config is not None). "
+                f"The get_kv_cache_group_metadata engine utility must be available "


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

ValueError message is inaccurate for non-speculative Mamba models.

Line 100 always says speculative_config is not None, which is false on the pure-Mamba path (e.g., speculative_config = None).

Suggested fix

raise ValueError( f"Failed to fetch KV cache group metadata for hybrid/Mamba model " - f"(architectures={architectures}, speculative_config is not None). " + f"(architectures={architectures}, " + f"speculative_config_present={vllm_config.speculative_config is not None}). " f"The get_kv_cache_group_metadata engine utility must be available " f"to determine the correct KV event block size. Original error: {e}" ) from e

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/src/dynamo/vllm/cache_info.py` around lines 99 - 101, The ValueError message in cache_info.py misleadingly hardcodes "speculative_config is not None" even when speculative_config is None; update the error string constructed where architectures and speculative_config are referenced (the f-string that begins with "Failed to fetch KV cache group metadata...") so it reflects the actual condition (e.g., include the real speculative_config value or a conditional phrase like "speculative_config={speculative_config}" or "speculative_config is set" / "not set") and mention get_kv_cache_group_metadata by name to make the log accurate; adjust only the message text (leave logic intact) in the function or block that raises this ValueError.

coderabbitai · 2026-05-13T19:59:06Z

+pytestmark = [
+    pytest.mark.unit,
+    pytest.mark.vllm,
+    pytest.mark.pre_merge,
+]


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add the required GPU test marker at module scope.

Lines 17-21 include scheduling and type markers, but the required GPU marker is missing for this test module.

Suggested fix

pytestmark = [ pytest.mark.unit, pytest.mark.vllm, + pytest.mark.gpu, pytest.mark.pre_merge, ]

As per coding guidelines: "ensure every test has required markers (scheduling + GPU + type)".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

pytestmark = [

pytest.mark.unit,

pytest.mark.vllm,

pytest.mark.pre_merge,

]

pytestmark = [

pytest.mark.unit,

pytest.mark.vllm,

pytest.mark.gpu,

pytest.mark.pre_merge,

]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/src/dynamo/vllm/tests/test_vllm_cache_info.py` around lines 17 - 21, The module-level pytest markers (pytestmark) in test_vllm_cache_info.py currently include scheduling and type markers but lack the required GPU marker; update the pytestmark list (the module-scope variable named pytestmark) to include pytest.mark.gpu so the module has scheduling + GPU + type markers (e.g., add pytest.mark.gpu alongside pytest.mark.unit and pytest.mark.vllm).

dynamo-ops · 2026-05-13T20:42:12Z

 }

+# Known Mamba architecture identifiers present in vLLM's HF config.
+_MAMBA_ARCHITECTURES = {


The architecture allow-list misses vLLM Mamba/hybrid classes such as FalconMambaForCausalLM, Mamba2ForCausalLM, and JambaForCausalLM, so those models can silently fall back to cache_config.block_size when the utility is unavailable. Fix: detect via vLLM's model hybrid/attention-free metadata or include all supported Mamba/hybrid architecture names.

dynamo-ops · 2026-05-13T20:42:12Z

        )
    except Exception as e:
+        if is_mamba_or_hybrid:
+            model_cls = type(vllm_config.model_config.hf_config).__name__


The error path dereferences vllm_config.model_config.hf_config even though speculative configs are classified as hybrid without requiring hf_config, so a missing HF config raises AttributeError instead of the intended ValueError. Fix: fetch hf_config with getattr(..., None) before formatting the error message.

MatejKosec requested review from a team as code owners May 13, 2026 19:54

pull-request-size Bot added the size/L label May 13, 2026

github-actions Bot added fix backend::vllm Relates to the vllm backend labels May 13, 2026

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

dynamo-ops reviewed May 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(vllm): raise ValueError for Mamba/hybrid when KV event utility is unavailable#9500

fix(vllm): raise ValueError for Mamba/hybrid when KV event utility is unavailable#9500
MatejKosec wants to merge 1 commit into
mainfrom
user/MatejKosec/agent/9376-vllm-mamba-kv-fallback

MatejKosec commented May 13, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 13, 2026

Uh oh!

coderabbitai Bot commented May 13, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 13, 2026

Uh oh!

coderabbitai Bot May 13, 2026

Uh oh!

dynamo-ops May 13, 2026

Uh oh!

dynamo-ops May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MatejKosec commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

copy-pr-bot Bot commented May 13, 2026

Uh oh!

coderabbitai Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

dynamo-ops May 13, 2026

Choose a reason for hiding this comment

Uh oh!

dynamo-ops May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MatejKosec commented May 13, 2026 •

edited

Loading

coderabbitai Bot commented May 13, 2026 •

edited

Loading