Skip to content

fix(vllm): raise ValueError for Mamba/hybrid when KV event utility is unavailable#9500

Open
MatejKosec wants to merge 1 commit into
mainfrom
user/MatejKosec/agent/9376-vllm-mamba-kv-fallback
Open

fix(vllm): raise ValueError for Mamba/hybrid when KV event utility is unavailable#9500
MatejKosec wants to merge 1 commit into
mainfrom
user/MatejKosec/agent/9376-vllm-mamba-kv-fallback

Conversation

@MatejKosec
Copy link
Copy Markdown
Contributor

@MatejKosec MatejKosec commented May 13, 2026

Summary

  • Add ValueError when a Mamba or hybrid (Mamba+attention) model is used with a KV event utility that is unavailable — prevents silent misconfiguration that would cause incorrect routing at runtime
  • Add unit tests covering the new guard: 15 tests in test_vllm_cache_info.py covering all supported architecture kinds and the Mamba/hybrid rejection path

Root cause: The KV event utility availability check was missing for Mamba and hybrid architectures. When these models were loaded with an incompatible KV routing configuration, no error was raised at startup — the misconfiguration would only surface at inference time as incorrect behaviour.

Testing

  • New unit tests (test_vllm_cache_info.py): 15 passed, 1 skipped — all Mamba/hybrid guard paths covered, including edge cases for mixed architecture configs
  • Python syntax check (py_compile): passed on all changed files
  • Existing vLLM unit suite (test_vllm_unit.py): could not run — requires dynamo.llm native extension (Rust/maturin build); not available in factory sandbox
  • Hardware tested: none required — pure Python guard added at config-validation time, no GPU needed to verify the ValueError path

… unavailable

When configure_kv_event_block_size falls back because get_kv_cache_group_metadata throws, Mamba and speculative/hybrid models must raise a clear ValueError instead of silently falling back to cache_config.block_size (16). This prevents the KV router from dropping events due to a block-size mismatch. Pure-attention models retain the existing silent fallback for backward compatibility.

Signed-off-by: mkosec@nvidia.com <mkosec@nvidia.com>
@MatejKosec MatejKosec requested review from a team as code owners May 13, 2026 19:54
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 13, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added fix backend::vllm Relates to the vllm backend labels May 13, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 13, 2026

Walkthrough

This PR adds detection of Mamba-based and hybrid/speculative-decode vLLM models via architecture inspection, then updates KV cache configuration to raise ValueError for these models when utility metadata is unavailable instead of silently falling back. Non-hybrid models preserve the previous warning-and-fallback behavior.

Changes

Mamba/Hybrid Model Detection and KV Cache Configuration

Layer / File(s) Summary
Mamba/Hybrid Model Detection
components/src/dynamo/vllm/cache_info.py, components/src/dynamo/vllm/tests/test_vllm_cache_info.py
Introduces _MAMBA_ARCHITECTURES constant and detect_mamba_hybrid_model() function that returns True for models with speculative configs or Mamba architecture identifiers. Tests validate detection across speculative configs, Mamba architectures, missing HuggingFace configs, and empty architecture lists.
KV Cache Configuration with Model-Type Error Handling
components/src/dynamo/vllm/cache_info.py, components/src/dynamo/vllm/tests/test_vllm_cache_info.py
Updates configure_kv_event_block_size() to detect model type and conditionally raise ValueError when get_kv_cache_group_metadata fails for Mamba/hybrid models; non-hybrid models silently fall back. Tests cover successful metadata storage, fallback behavior, and ValueError cases.
Supporting Block Size and Getter Tests
components/src/dynamo/vllm/tests/test_vllm_cache_info.py
Tests for select_main_attention_block_size() and get_configured_kv_event_block_size() covering metadata fallback scenarios and cached vs. default block size retrieval.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: raising a ValueError for Mamba/hybrid models when KV event utility is unavailable, which aligns with the PR's core modification to configure_kv_event_block_size() behavior.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request provides a clear summary with root cause analysis, testing details, and specific changes. It addresses all key aspects but lacks the structured sections from the template.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/src/dynamo/vllm/cache_info.py (1)

13-17: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

sink_full_attention is misclassified as a main-attention kind.

Line 16 currently includes "sink_full_attention" in MAIN_ATTENTION_KV_CACHE_KINDS, so sink-only metadata is treated as primary and won’t fall back. That conflicts with the new fallback behavior exercised by the tests.

Suggested fix
 MAIN_ATTENTION_KV_CACHE_KINDS = {
     "full_attention",
     "mla_attention",
-    "sink_full_attention",
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/vllm/cache_info.py` around lines 13 - 17, The set
MAIN_ATTENTION_KV_CACHE_KINDS incorrectly includes the sink-only kind
"sink_full_attention", causing sink metadata to be treated as main-attention and
preventing fallback; remove "sink_full_attention" from
MAIN_ATTENTION_KV_CACHE_KINDS so that only true main-attention kinds
("full_attention", "mla_attention") remain and sink-only entries will fall back
as intended.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/src/dynamo/vllm/cache_info.py`:
- Around line 99-101: The ValueError message in cache_info.py misleadingly
hardcodes "speculative_config is not None" even when speculative_config is None;
update the error string constructed where architectures and speculative_config
are referenced (the f-string that begins with "Failed to fetch KV cache group
metadata...") so it reflects the actual condition (e.g., include the real
speculative_config value or a conditional phrase like
"speculative_config={speculative_config}" or "speculative_config is set" / "not
set") and mention get_kv_cache_group_metadata by name to make the log accurate;
adjust only the message text (leave logic intact) in the function or block that
raises this ValueError.

In `@components/src/dynamo/vllm/tests/test_vllm_cache_info.py`:
- Around line 17-21: The module-level pytest markers (pytestmark) in
test_vllm_cache_info.py currently include scheduling and type markers but lack
the required GPU marker; update the pytestmark list (the module-scope variable
named pytestmark) to include pytest.mark.gpu so the module has scheduling + GPU
+ type markers (e.g., add pytest.mark.gpu alongside pytest.mark.unit and
pytest.mark.vllm).

---

Outside diff comments:
In `@components/src/dynamo/vllm/cache_info.py`:
- Around line 13-17: The set MAIN_ATTENTION_KV_CACHE_KINDS incorrectly includes
the sink-only kind "sink_full_attention", causing sink metadata to be treated as
main-attention and preventing fallback; remove "sink_full_attention" from
MAIN_ATTENTION_KV_CACHE_KINDS so that only true main-attention kinds
("full_attention", "mla_attention") remain and sink-only entries will fall back
as intended.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0014d183-7e45-431c-8ed4-e93f302ee7fc

📥 Commits

Reviewing files that changed from the base of the PR and between 0ff011b and 042b88e.

📒 Files selected for processing (2)
  • components/src/dynamo/vllm/cache_info.py
  • components/src/dynamo/vllm/tests/test_vllm_cache_info.py

Comment on lines +99 to +101
f"Failed to fetch KV cache group metadata for hybrid/Mamba model "
f"(architectures={architectures}, speculative_config is not None). "
f"The get_kv_cache_group_metadata engine utility must be available "
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

ValueError message is inaccurate for non-speculative Mamba models.

Line 100 always says speculative_config is not None, which is false on the pure-Mamba path (e.g., speculative_config = None).

Suggested fix
             raise ValueError(
                 f"Failed to fetch KV cache group metadata for hybrid/Mamba model "
-                f"(architectures={architectures}, speculative_config is not None). "
+                f"(architectures={architectures}, "
+                f"speculative_config_present={vllm_config.speculative_config is not None}). "
                 f"The get_kv_cache_group_metadata engine utility must be available "
                 f"to determine the correct KV event block size. Original error: {e}"
             ) from e
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/vllm/cache_info.py` around lines 99 - 101, The
ValueError message in cache_info.py misleadingly hardcodes "speculative_config
is not None" even when speculative_config is None; update the error string
constructed where architectures and speculative_config are referenced (the
f-string that begins with "Failed to fetch KV cache group metadata...") so it
reflects the actual condition (e.g., include the real speculative_config value
or a conditional phrase like "speculative_config={speculative_config}" or
"speculative_config is set" / "not set") and mention get_kv_cache_group_metadata
by name to make the log accurate; adjust only the message text (leave logic
intact) in the function or block that raises this ValueError.

Comment on lines +17 to +21
pytestmark = [
pytest.mark.unit,
pytest.mark.vllm,
pytest.mark.pre_merge,
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add the required GPU test marker at module scope.

Lines 17-21 include scheduling and type markers, but the required GPU marker is missing for this test module.

Suggested fix
 pytestmark = [
     pytest.mark.unit,
     pytest.mark.vllm,
+    pytest.mark.gpu,
     pytest.mark.pre_merge,
 ]

As per coding guidelines: "ensure every test has required markers (scheduling + GPU + type)".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pytestmark = [
pytest.mark.unit,
pytest.mark.vllm,
pytest.mark.pre_merge,
]
pytestmark = [
pytest.mark.unit,
pytest.mark.vllm,
pytest.mark.gpu,
pytest.mark.pre_merge,
]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/vllm/tests/test_vllm_cache_info.py` around lines 17 -
21, The module-level pytest markers (pytestmark) in test_vllm_cache_info.py
currently include scheduling and type markers but lack the required GPU marker;
update the pytestmark list (the module-scope variable named pytestmark) to
include pytest.mark.gpu so the module has scheduling + GPU + type markers (e.g.,
add pytest.mark.gpu alongside pytest.mark.unit and pytest.mark.vllm).

}

# Known Mamba architecture identifiers present in vLLM's HF config.
_MAMBA_ARCHITECTURES = {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The architecture allow-list misses vLLM Mamba/hybrid classes such as FalconMambaForCausalLM, Mamba2ForCausalLM, and JambaForCausalLM, so those models can silently fall back to cache_config.block_size when the utility is unavailable. Fix: detect via vLLM's model hybrid/attention-free metadata or include all supported Mamba/hybrid architecture names.

)
except Exception as e:
if is_mamba_or_hybrid:
model_cls = type(vllm_config.model_config.hf_config).__name__
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error path dereferences vllm_config.model_config.hf_config even though speculative configs are classified as hybrid without requiring hf_config, so a missing HF config raises AttributeError instead of the intended ValueError. Fix: fetch hf_config with getattr(..., None) before formatting the error message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend::vllm Relates to the vllm backend fix size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants