Skip to content

[Draft] feat: LCF KV connector hooks for streaming attention and KV paging (LCF-323/P4-5)#2

Open
arthurrasmusson-lb wants to merge 1 commit into
mainfrom
feat/lcf-kv-connector-hooks
Open

[Draft] feat: LCF KV connector hooks for streaming attention and KV paging (LCF-323/P4-5)#2
arthurrasmusson-lb wants to merge 1 commit into
mainfrom
feat/lcf-kv-connector-hooks

Conversation

@arthurrasmusson-lb

@arthurrasmusson-lb arthurrasmusson-lb commented Apr 15, 2026

Copy link
Copy Markdown
Collaborator

Summary

Add three small hooks that enable KV connectors to perform windowed streaming attention for layers whose KV cache exceeds GPU capacity (e.g., NoPE layers in iRoPE models like Llama-4-Scout).

3 files changed, 68 insertions, 1 deletion.

Changes

  1. kv_transfer_utils.py (+23 lines): Extend maybe_transfer_kv_layer decorator to check connector.is_streaming_layer() and divert to compute_streaming_attention() instead of native attention

  2. base.py (+34 lines): Add is_streaming_layer() and compute_streaming_attention() default methods to KVConnectorBase_V1 — both return False/None by default so existing connectors are unaffected

  3. kv_cache_utils.py (+12 lines): Skip _check_enough_kv_cache_memory when the connector advertises supports_kv_paging=True — the connector can stream KV from fabric on demand

Design

These hooks are connector-agnostic — any fabric-backed connector can implement the streaming protocol, not just LCF (upstream code changes here rely on functions in LCF - some functions of the vLLM KV Connector V1 should consequently be moved to Rust before distributing code to avoid copying of our proprietary implementation, we only want to expose common or shared primitives for the OpenKV API while deciding the scope of our optimized implementation and moving all of those implementation details out of Python into binary Rust to protect our IP from future theft - ie: a DeepSeek or similar company collaborates on common primitives and steals our optimized implementation because we left optimizations readable in Python in artifacts we distribute instead of compiling / stripping out debug symbols in an obfuscated implementation).

No existing functions are modified or removed. All new methods have safe defaults (False/None) so existing connectors work unchanged.

Context

Demonstrated at GTC 2026: 26-second TTFT at 10M tokens on 4x L40S using the Light Coretex Fabric connector with these hooks.

Jira: LCF-323

Test plan

  • Existing connectors unaffected (is_streaming_layer returns False by default)
  • maybe_transfer_kv_layer falls through to native attention when not streaming
  • supports_kv_paging bypass only activates when connector sets the flag
  • LCF connector implements streaming protocol and produces correct 10M-token output

@arthurrasmusson-lb arthurrasmusson-lb force-pushed the feat/lcf-kv-connector-hooks branch from 9cf6c51 to dc55765 Compare April 15, 2026 02:03
Add three small hooks that enable KV connectors to perform windowed
streaming attention for layers whose KV cache exceeds GPU capacity
(e.g., NoPE layers in iRoPE models like Llama-4-Scout):

1. kv_transfer_utils.py: extend maybe_transfer_kv_layer decorator to
   check connector.is_streaming_layer() and divert to
   compute_streaming_attention() instead of native attention

2. base.py: add is_streaming_layer() and compute_streaming_attention()
   default methods to KVConnectorBase_V1 (both return False/None by
   default so existing connectors are unaffected)

3. kv_cache_utils.py: skip _check_enough_kv_cache_memory when the
   connector advertises supports_kv_paging=True (the connector can
   stream KV from fabric on demand)

These hooks are connector-agnostic — any fabric-backed connector can
implement the streaming protocol, not just LCF. The merge_attn_states
Triton kernel needed for partial output merging already exists in vLLM.

Demonstrated at GTC 2026: 26-second TTFT at 10M tokens on 4x L40S
using the Light Coretex Fabric connector with these hooks.

Jira: LCF-323
Signed-off-by: Arthur Hanson Rasmusson <arthur@vgpu.io>
@arthurrasmusson-lb arthurrasmusson-lb force-pushed the feat/lcf-kv-connector-hooks branch from dc55765 to a4cb8a9 Compare April 15, 2026 02:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants