[Draft] feat: LCF KV connector hooks for streaming attention and KV paging (LCF-323/P4-5)#2
Open
arthurrasmusson-lb wants to merge 1 commit into
Open
[Draft] feat: LCF KV connector hooks for streaming attention and KV paging (LCF-323/P4-5)#2arthurrasmusson-lb wants to merge 1 commit into
arthurrasmusson-lb wants to merge 1 commit into
Conversation
9cf6c51 to
dc55765
Compare
Add three small hooks that enable KV connectors to perform windowed streaming attention for layers whose KV cache exceeds GPU capacity (e.g., NoPE layers in iRoPE models like Llama-4-Scout): 1. kv_transfer_utils.py: extend maybe_transfer_kv_layer decorator to check connector.is_streaming_layer() and divert to compute_streaming_attention() instead of native attention 2. base.py: add is_streaming_layer() and compute_streaming_attention() default methods to KVConnectorBase_V1 (both return False/None by default so existing connectors are unaffected) 3. kv_cache_utils.py: skip _check_enough_kv_cache_memory when the connector advertises supports_kv_paging=True (the connector can stream KV from fabric on demand) These hooks are connector-agnostic — any fabric-backed connector can implement the streaming protocol, not just LCF. The merge_attn_states Triton kernel needed for partial output merging already exists in vLLM. Demonstrated at GTC 2026: 26-second TTFT at 10M tokens on 4x L40S using the Light Coretex Fabric connector with these hooks. Jira: LCF-323 Signed-off-by: Arthur Hanson Rasmusson <arthur@vgpu.io>
dc55765 to
a4cb8a9
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add three small hooks that enable KV connectors to perform windowed streaming attention for layers whose KV cache exceeds GPU capacity (e.g., NoPE layers in iRoPE models like Llama-4-Scout).
3 files changed, 68 insertions, 1 deletion.
Changes
kv_transfer_utils.py(+23 lines): Extendmaybe_transfer_kv_layerdecorator to checkconnector.is_streaming_layer()and divert tocompute_streaming_attention()instead of native attentionbase.py(+34 lines): Addis_streaming_layer()andcompute_streaming_attention()default methods toKVConnectorBase_V1— both return False/None by default so existing connectors are unaffectedkv_cache_utils.py(+12 lines): Skip_check_enough_kv_cache_memorywhen the connector advertisessupports_kv_paging=True— the connector can stream KV from fabric on demandDesign
These hooks are connector-agnostic — any fabric-backed connector can implement the streaming protocol, not just LCF (upstream code changes here rely on functions in LCF - some functions of the vLLM KV Connector V1 should consequently be moved to Rust before distributing code to avoid copying of our proprietary implementation, we only want to expose common or shared primitives for the OpenKV API while deciding the scope of our optimized implementation and moving all of those implementation details out of Python into binary Rust to protect our IP from future theft - ie: a DeepSeek or similar company collaborates on common primitives and steals our optimized implementation because we left optimizations readable in Python in artifacts we distribute instead of compiling / stripping out debug symbols in an obfuscated implementation).
No existing functions are modified or removed. All new methods have safe defaults (False/None) so existing connectors work unchanged.
Context
Demonstrated at GTC 2026: 26-second TTFT at 10M tokens on 4x L40S using the Light Coretex Fabric connector with these hooks.
Jira: LCF-323
Test plan