[Draft] feat: LCF KV connector hooks for streaming attention and KV paging (LCF-323/P4-5) by arthurrasmusson-lb · Pull Request #2 · LightBitsLabs/vllm

arthurrasmusson-lb · 2026-04-15T02:00:08Z

Summary

Add three small hooks that enable KV connectors to perform windowed streaming attention for layers whose KV cache exceeds GPU capacity (e.g., NoPE layers in iRoPE models like Llama-4-Scout).

3 files changed, 68 insertions, 1 deletion.

Changes

kv_transfer_utils.py (+23 lines): Extend maybe_transfer_kv_layer decorator to check connector.is_streaming_layer() and divert to compute_streaming_attention() instead of native attention
base.py (+34 lines): Add is_streaming_layer() and compute_streaming_attention() default methods to KVConnectorBase_V1 — both return False/None by default so existing connectors are unaffected
kv_cache_utils.py (+12 lines): Skip _check_enough_kv_cache_memory when the connector advertises supports_kv_paging=True — the connector can stream KV from fabric on demand

Design

These hooks are connector-agnostic — any fabric-backed connector can implement the streaming protocol, not just LCF (upstream code changes here rely on functions in LCF - some functions of the vLLM KV Connector V1 should consequently be moved to Rust before distributing code to avoid copying of our proprietary implementation, we only want to expose common or shared primitives for the OpenKV API while deciding the scope of our optimized implementation and moving all of those implementation details out of Python into binary Rust to protect our IP from future theft - ie: a DeepSeek or similar company collaborates on common primitives and steals our optimized implementation because we left optimizations readable in Python in artifacts we distribute instead of compiling / stripping out debug symbols in an obfuscated implementation).

No existing functions are modified or removed. All new methods have safe defaults (False/None) so existing connectors work unchanged.

Context

Demonstrated at GTC 2026: 26-second TTFT at 10M tokens on 4x L40S using the Light Coretex Fabric connector with these hooks.

Jira: LCF-323

Test plan

Existing connectors unaffected (is_streaming_layer returns False by default)
maybe_transfer_kv_layer falls through to native attention when not streaming
supports_kv_paging bypass only activates when connector sets the flag
LCF connector implements streaming protocol and produces correct 10M-token output

Add three small hooks that enable KV connectors to perform windowed streaming attention for layers whose KV cache exceeds GPU capacity (e.g., NoPE layers in iRoPE models like Llama-4-Scout): 1. kv_transfer_utils.py: extend maybe_transfer_kv_layer decorator to check connector.is_streaming_layer() and divert to compute_streaming_attention() instead of native attention 2. base.py: add is_streaming_layer() and compute_streaming_attention() default methods to KVConnectorBase_V1 (both return False/None by default so existing connectors are unaffected) 3. kv_cache_utils.py: skip _check_enough_kv_cache_memory when the connector advertises supports_kv_paging=True (the connector can stream KV from fabric on demand) These hooks are connector-agnostic — any fabric-backed connector can implement the streaming protocol, not just LCF. The merge_attn_states Triton kernel needed for partial output merging already exists in vLLM. Demonstrated at GTC 2026: 26-second TTFT at 10M tokens on 4x L40S using the Light Coretex Fabric connector with these hooks. Jira: LCF-323 Signed-off-by: Arthur Hanson Rasmusson <arthur@vgpu.io>

arthurrasmusson-lb force-pushed the feat/lcf-kv-connector-hooks branch from 9cf6c51 to dc55765 Compare April 15, 2026 02:03

arthurrasmusson-lb force-pushed the feat/lcf-kv-connector-hooks branch from dc55765 to a4cb8a9 Compare April 15, 2026 02:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Draft] feat: LCF KV connector hooks for streaming attention and KV paging (LCF-323/P4-5)#2

[Draft] feat: LCF KV connector hooks for streaming attention and KV paging (LCF-323/P4-5)#2
arthurrasmusson-lb wants to merge 1 commit into
mainfrom
feat/lcf-kv-connector-hooks

arthurrasmusson-lb commented Apr 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

arthurrasmusson-lb commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Design

Context

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

arthurrasmusson-lb commented Apr 15, 2026 •

edited

Loading