Skip to content

RFC: Session-aware KV cache management #18

@cyr20040123

Description

@cyr20040123

Problem statement / motivation

RFC: Session-Aware KV Cache Management in agentic-api

Abstract

Session cache is a basic and fundamental feature for Responses API which optimizes multi-turn dialogue inference by automatically caching the KV cache of a session, cutting down redundant computation, lowering API calling costs and shortening response latency for context-aware LLM interactions.

This RFC proposes introducing a Session Cache Manager into the Core Orchestration Layer of agentic-api. It maintains mappings between KV cache blocks and sessions, providing unified lifecycle management (tiering, migration, eviction, and invalidation) of KV cache on the agentic-api side. agentic-api acts as the authoritative KV cache scheduler, interacting with vLLM Server via the KVConnector interface or interacting with KV store connected to vLLM. This design also lays the groundwork for advanced optimizations such as priority-based eviction and proactive cache invalidation proposed in vLLM RFC #37003 and RFC #37168.


1. Motivation

1.1 Current Pain Points

  • Agentic workloads break prefix caching: In a typical agent turn, over 90% of tokens are a reuse of the previous turn's prefix. However, during tool-call pauses (which account for 40–60% of wall time), KV cache blocks have no active references and are evicted by LRU. When the session resumes, the entire context (70K–200K tokens) must be re-prefilled from scratch.
  • vLLM's LRU policy is session-agnostic: LRU only considers recency; it has no awareness of which sessions are still active and which have ended, causing zombie blocks to occupy GPU HBM indefinitely.
  • No home for stateful session management: vLLM core remains stateless, and the Responses API layer lacks a system-level mapping between sessions and KV cache blocks.

1.2 Why Implement This in agentic-api

  • agentic-api is the natural owner of stateful Responses API logic, holding semantic information such as session_id, previous_response_id, and session liveness.
  • This aligns with the architectural separation of concerns: agentic-api owns cache policy, while vLLM owns cache mechanism.
  • It enables session routing and migration across vLLM instances — something a stateless vLLM cannot achieve on its own.

2. Background

  • The OpenAI Responses API supports multi-turn session chaining via previous_response_id.
  • Alibaba's Session Cache (x-dashscope-session-cache) has validated the feasibility of server-side automatic context caching (5-minute TTL, reset on cache hit).
  • vLLM RFC #37003: Proposes a RetentionDirective API that allows an orchestrator to set per-range block eviction priority (0–100) and TTL, with multi-tenant isolation via retention_scope (session_id).
  • vLLM RFC #37168: Proposes a proactive invalidation endpoint POST /release_kv_cache, session-aware reference counting, and a dual-zone scheduling strategy (Aging Zone / Fresh Zone), achieving a 26% reduction in TTFT under high-concurrency workloads.
  • vLLM KVConnector ecosystem: MooncakeStore, LMCache, and similar backends provide multi-tier, cross-node KV block transfer capabilities.

3. Design Goals

Goal Description
G1 agentic-api serves as the authoritative KV cache lifecycle scheduler, supporting session-based cache management
G2 Support multi-tier storage (GPU HBM → DRAM → NVMe → object storage)
G3 Support session migration across vLLM instances (horizontal scaling)
G4 Zero overhead for non-agentic stateless workloads
G5 Provide the foundation for implementing vLLM RFC #37003 and #37168

Proposed solution

4. Architecture

4.1 Approach Comparison

Approach A: Shared KV Cache Store with vLLM Server

agentic-api (ResponsesStore for metadata)
    ↕ KV$ block metadata sync
Mooncake Store / LMCache (shared)
    ↕ KVConnector
vLLM Server (GPU HBM)

Pros:

  • Zero-copy: vLLM and agentic-api access the same physical blocks with no data movement.
  • Lowest latency: On a cache hit, vLLM reads blocks directly with no cross-tier loading.
  • Natural alignment with RFC #37003: agentic-api knows which sessions are active and can issue priority directives directly.

Cons:

  • Tight coupling to the vLLM runtime; agentic-api upgrades or restarts must be coordinated with vLLM.
  • MooncakeStore is currently optimized for disaggregated prefill; session-level persistence requires customization.
  • Complex session routing in multi-node vLLM deployments.

Approach B: Independent External KV Cache Store

agentic-api (ResponsesStore for metadata + External KV$ Store)
    ↓ on-demand load when a session request arrives
vLLM KVConnector (P2P load to KV$ store that can be operated by vLLM)
    ↓
vLLM Server (GPU HBM)

Multi-tier storage options:

Tier Backend Use Case
L1-Hot Mooncake Store (RDMA) / LMCache Fast cross-node migration / CPU offloading
L2-Hot Redis / Valkey (cluster mode) For less-active sessions
L3-Cold NVMe/SSD (io_uring) Long-idle sessions
L4-Archive S3 / MinIO object storage Long-term persistence

Pros:

  • Fully consistent with the agentic-api architecture philosophy: the Core Orchestration Layer owns all state; vLLM executes statelessly.
  • Enables cross-node session migration: sessions can be routed to any vLLM instance.
  • Finer-grained lifecycle control: explicit eviction, tiered TTL, quota management.

Cons:

  • Cold-start load latency (milliseconds to hundreds of milliseconds for L2/L3 tiers).
  • Requires serialization/deserialization of the KV block format across different storage backends.

4.2 Recommended Approach: Hybrid Multi-Tier Architecture

Rather than choosing one approach, the two should be composed in layers (with the hot tier directly accessible by vLLM):

┌─────────────────────────────────────────────────┐
│         agentic-api (Session Cache Manager)     │
│  ┌───────────────────┐   ┌────────────────────┐ │
│  │Meta Data Storage  │   │ Orchestrator       │ │
│  │(block2session map)│   │                    │ │
│  └───────────────────┘   └────────────────────┘ │
└────────────────────┬────────────────────────────┘
     Eviction/Mgmt   │  On-demand Load / Offload
┌────────────────────▼────────────────────────────┐
│         Unified KV Cache Storage (Tiered)       │
│  Hot:  Mooncake Store / LMCache (DRAM/RDMA)     │
│  Warm: Redis / SSD-backed LMCache               │
│  Cold: Object Storage (S3/MinIO)                │
└────────────────────┬────────────────────────────┘
                     │ KVConnector
┌────────────────────▼────────────────────────────┐
│             vLLM Server (GPU HBM)               │
└─────────────────────────────────────────────────┘

4.3 Data Model: Block-Session Mapping

@dataclass
class CacheBlockRecord:
    block_hash: str           # vLLM prefix hash (see Section 5.1)
    session_id: str           # Responses API session
    token_start: int
    token_end: int
    tier: Literal["hot", "warm", "cold"]
    storage_location: str     # Mooncake/Redis/S3 URI
    created_at: datetime
    last_hit_at: datetime
    ttl_seconds: int
    ref_count: int            # for cross-session sharing
    eviction_priority: int    # fed to RFC #37003

Note: block_hash is a rolling hash computed internally by vLLM based on token content; it is not directly accessible from the agentic-api side. See Section 5.1 for a detailed analysis of this core challenge and the proposed solutions.


5. Detailed Design

5.1 Core Challenge: Establishing the Session-to-KV-Block Mapping

This is the most critical engineering challenge in this proposal.

The fundamental tension:

agentic-api has: session_id, previous_response_id, full message history
vLLM has:        token sequences, block_hash (content hash), GPU memory state

No shared identifier exists between the two sides → no direct mapping is possible

vLLM's prefix caching is entirely based on content hashing (a rolling hash over token sequences). It neither accepts nor stores any session semantics. From the agentic-api side, KV cache blocks are a black box.

Three viable approaches (priority order: Path 1 > Path 2 > Path 3):

Path 1: Content Hash Reconstruction (Near-Term Feasibility — No vLLM Changes Required)

Since agentic-api holds the full message history, it can reproduce vLLM's block hash computation locally, independently establishing the session-to-block mapping:

session messages
    ↓ call vLLM /tokenize API (avoid re-implementing the tokenizer)
token_ids
    ↓ split by vLLM block_size (read from vLLM config)
    ↓ reproduce vLLM rolling hash algorithm
[block_hash_1, block_hash_2, ..., block_hash_N]
    ↓
session_id → [block_hash] mapping written to ResponsesStore

vLLM's prefix cache hashes are deterministic (identical token sequences always yield identical block hashes). As long as agentic-api retains the relevant data (e.g., cache_salt) and reproduces the same algorithm, it can stay in sync with vLLM's internal state.

Pros:

  • Requires no changes to vLLM; implementable today.
  • The mapping is fully managed by agentic-api, independent of any vLLM API changes.
  • Feasibility can be validated independently, establishing a baseline for future migration to other paths.

Cons and Risks:

  • Tightly coupled to vLLM's internal hash algorithm; must be updated whenever vLLM changes its hashing implementation.
  • Each request requires an additional /tokenize call (one extra RTT), mitigated by caching token_ids.
  • Requires reading block_size from vLLM config; cross-version compatibility must be maintained.

Path 2: vLLM Returns Block Metadata in Responses (Alternative Approach)

Submit a small upstream PR to vLLM to expose the newly created and reused blocks for each request in the usage field of inference responses, providing accurate mapping data directly from vLLM:

{
  "usage": {
    "prompt_tokens": 4096,
    "cached_tokens": 3840,
    "kv_cache_blocks": {
      "created": ["hash_A", "hash_B"],
      "reused":  ["hash_C", "hash_D", "hash_E"]
    }
  }
}

agentic-api then builds the mapping directly from the response:

response_id → [hash_A, hash_B, hash_C, ...]
     ↓  (ResponsesStore already has response_id → session_id)
session_id  → [hash_A, hash_B, hash_C, ...]

Pros: The mapping is provided directly by vLLM, accurate and reliable, with no hash algorithm coupling risk. The vLLM change is minimal (block metadata already exists internally; it only needs to be exposed). This can be submitted as a standalone PR to vLLM, independent of RFC #37003/#37168.

Prerequisites: Requires acceptance of the upstream PR by the vLLM community. Once Path 1 validates feasibility, this can be pursued as a more elegant long-term alternative.

Path 3: Native session_id Passthrough in vLLM (Supplementary Approach)

Add a session_id field to vLLM's request parameters and propagate it internally so that vLLM writes session_id directly into KVCacheBlock objects at allocation time. After inference completes, agentic-api asynchronously queries vLLM via an exposed interface to retrieve and maintain the session-to-block association.

agentic-api sends request with:
  {"session_id": "session-{session_id}", "model": "...", "input": "..."}
          ↓ vLLM internal propagation
  KVCacheBlock.session_ids.add("session-{session_id}")   ← block tagged at allocation
          ↓ after inference completes
  vLLM exposes a query interface for agentic-api:
      session_id → [block_hash_1, block_hash_2, ...]
          ↓
  agentic-api maintains lifecycle based on this mapping

Required vLLM changes:

  1. Add an optional session_id field to inference request parameters.
  2. Add a session_ids attribute to KVCacheBlock, populated from the request at block allocation time.
  3. Expose an interface to query the list of allocated blocks by session_id.

Pros: The mapping is maintained natively inside vLLM, eliminating the need for agentic-api to reproduce the hash algorithm. More stable than Path 1; provides more complete block lifecycle management than Path 2.

Prerequisites: Requires moderate upstream modifications to vLLM. Similar proposals already exist in the community (e.g., RFC #37003, PR #38514), which can serve as a foundation — though merge uncertainty remains.

Note (Potential Complexity): In prefix caching, blocks with identical token prefixes are shared across multiple sessions. Therefore, the session_id field in KVCacheBlock must be a collection type (e.g., set[session_id]) to track all sessions holding a reference to that block. Actual block release should only occur when all associated sessions have released their references. The agentic-api side must also handle reference counting logic accordingly when asynchronously maintaining the mapping. The data structures and logic involved are non-trivial.

Comparison of the Three Paths:

Path 1 Path 2 Path 3
Approach Content hash reconstruction vLLM returns block metadata in response Native session_id passthrough in vLLM
vLLM Changes None Minor (response field extension) Moderate (request params + KVCacheBlock + query interface)
Mapping Accuracy High (deterministic hash) High (provided directly by vLLM) Highest (natively embedded in block)
Coupling to vLLM version Yes (hash algorithm) No No
Additional Overhead /tokenize RTT None None

5.2 Session Cache Request Flow

The flow below describes the current implementation using Path 1 (content hash reconstruction).

  1. A request arrives at agentic-api carrying previous_response_id.
  2. Session Cache Manager queries ResponsesStore, resolves the session chain, and retrieves the full message history.
  3. Establish block mapping (Path 1):
    • Call vLLM's /tokenize endpoint to obtain token_ids (read from local cache if available).
    • Split by vLLM block_size and reproduce the rolling hash to derive [block_hash_1, ..., block_hash_N].
    • Query the Core Orchestration Layer to determine the current tier of each block.
  4. Decide based on the block's tier:
    • HBM hit: Attach retention_directives (with retention_scope = session_id) to raise eviction priority, then proceed to inference.
    • DRAM/NVMe hit: Load blocks into the region accessible by vLLM KVConnector, then proceed to inference.
    • Cache miss: Normal prefill.
  5. After inference completes, write the list of block_hash values for this response into ResponsesStore, and update last_hit_at, ref_count, and eviction_priority.

5.3 Lifecycle Management

Event agentic-api Action vLLM Action
Session first created Register block-session mapping, tier=hot Prefill; blocks reside in HBM
Session active Refresh TTL / raise eviction priority Normal prefix cache reuse
Tool-call pause Maintain priority; optionally offload to DRAM Retain blocks by priority
Context compression/pruning Release stale blocks Call RFC #37168 /release_kv_cache to proactively release zombie blocks
Session ended Evict blocks LRU natural cleanup
Resource pressure Migrate blocks to warm/cold tier (LRU + priority) Offload to Mooncake/LMCache

5.4 API Design

a. Responses API Extension (User-Facing)

POST /v1/responses
X-Session-Cache: enable          # enable session cache
X-Session-Cache-Ttl: 300         # custom TTL in seconds (optional)
X-Session-Cache-Tier: hot        # target storage tier (optional)

{
  "model": "...",
  "input": "...",
  "previous_response_id": "resp_xxx"
}

b. Session Cache Manager Internal Interface

class SessionCacheManager:
    def register_blocks(session_id: str, blocks: list[CacheBlockRecord]) -> None
    def lookup_blocks(session_id: str) -> list[CacheBlockRecord]
    def refresh_ttl(session_id: str) -> None
    def promote_to_hot(session_id: str) -> None   # feeds RFC #37003
    def invalidate_blocks(session_id: str, token_range: tuple) -> None  # calls RFC #37168
    def evict_session(session_id: str) -> None
    def migrate_session(session_id: str, target_node: str) -> None  # cross-instance

6. KV Cache Storage Backend Selection

Backend Characteristics Recommended Use Case
Mooncake Store High RDMA bandwidth, officially supported by vLLM, cross-node transfer Hot tier, disaggregated prefill
LMCache Multi-backend (CPU/disk), vLLM KVConnector support, actively developed open-source Warm tier, large-memory local nodes
Redis / Valkey Native TTL support, cluster HA, easy to deploy Metadata + small-scale warm cache
NVMe SSD (io_uring) High IOPS, low-latency local storage Warm-cold boundary, large contexts
S3 / MinIO Unlimited capacity, low cost Cold tier, long-term archival

7. Features and Key Problems Addressed

Core Problems Solved

  1. KV cache premature eviction during agent pauses: Priority hints protect active session blocks, definitively solving the "tool-call pause → blocks evicted by LRU → full context re-prefill on resume" problem.
  2. Zombie cache accumulation: The proactive invalidation interface immediately releases stale blocks during context compression, rather than waiting for delayed LRU cleanup.
  3. High TTFT for multi-turn sessions: Historical context is restored directly from cache, eliminating redundant prefill. Expected TTFT reduction: 20–30% (based on RFC #37168 benchmark data).
  4. Sessions cannot migrate across vLLM instances: The block-session mapping in the Core Orchestration Layer makes session routing to any instance possible.

New Capabilities Added

  • Automatic session caching: Users only need to provide previous_response_id; no manual cache management required.
  • Transparent multi-tier storage: The application layer is unaware of tier transitions; Session Cache Manager automatically handles hot/warm/cold migration.
  • Session quotas and cost attribution: Per-session cache usage tracking, supporting per-user and per-tenant billing.
  • Cross-session prefix sharing: Blocks for identical system prompts or tool definitions are shared across sessions (RFC #37168 cache_sharing semantics).
  • Session migration and elastic scaling: When a vLLM instance goes offline, its session blocks are offloaded to the warm tier and reloaded when traffic is routed to a new instance.

8. Performance Expectations

Scenario Expected Improvement (reference: RFC #37168)
Session resume after tool-call pause (hot cache hit) ~60% reduction in TTFT
High-concurrency multi-agent (20+ concurrent) ~20% reduction in TTFT
Cache hit rate after context compression ~25% improvement
Prompt throughput ~15% improvement

9. Security and Compatibility

  • Access control: Only the session owner may issue /release_kv_cache-equivalent requests.
  • cache_sharing option: Users can control whether blocks are allowed to be shared across sessions (disabled by default in tenant-isolation scenarios).
  • Backward compatibility: When session-related parameters are omitted, the system falls back to standard LRU behavior with zero additional overhead.

10. Open Questions

  1. [Critical] Design the architecture: Design the architecture for storing the meta data for KV$ blocks, manage & operate the KV& storage, and interact with vLLM.
  2. [Critical] Hash algorithm drift risk in Path 1: vLLM's prefix cache rolling hash algorithm is not a stable public API and may change in future releases. Should agentic-api introduce a version-detection mechanism, or should hash algorithm synchronization be incorporated into a vLLM version compatibility test matrix?
  3. [Critical] Performance impact of the /tokenize RTT in Path 1: Path 1 requires an additional /tokenize call per session. Is this RTT acceptable under high-concurrency workloads? Should token_ids be cached at the response level or the session level?
  4. vLLM upstream PR strategy (Path 2): Once Path 1 is stable, the team should push Path 2 to the vLLM community (adding a kv_cache_blocks field to the response usage object). What is the right timing and priority for this PR?
  5. How should KV block format serialization be standardized between Mooncake Store (hot tier) and vLLM's internal format?
  6. What is the right metadata store for Session Cache Manager — SQLite for single-node deployments, or PostgreSQL for production?
  7. What triggers session migration across multiple vLLM instances?

Alternatives considered

No response

Additional context

11. Implementation Milestones (TBD)

Milestone Scope
M1 Path 1: Call /tokenize + reproduce block hash algorithm to establish precise block-session mapping; implement hash algorithm version-compatibility mechanism
M2 Warm tier management using LMCache / Mooncake Store
M3 Cold tier management + archival to object storage (S3)
M4 Cross-instance session migration logic for KV cache blocks
M5 Path 2 (alternative): Submit upstream PR to vLLM to add kv_cache_blocks to the response usage field
M6 Path 3 (alternative): Submit upstream PR to vLLM to add session_id passthrough in request parameters and KVCacheBlock; refactor agentic-api to maintain the mapping via asynchronous queries, eliminating hash algorithm coupling entirely

12. Key Supplementary Notes

  1. Engineering path for session-block mapping: Path 1 (content hash reconstruction) is the primary near-term implementation, requiring no upstream vLLM changes. Path 2 (vLLM response metadata) and Path 3 (native session_id passthrough in vLLM) are treated as future evolution options to be pursued after Path 1 is stable.
  2. Coordination with ADR-02 storage design: The block-session mapping can be implemented as a sub-table within the Multi-table structure rather than as an independent component, reducing architectural fragmentation.
  3. Explicit non-goals: agentic-api will not re-implement the tokenizer (Path 1 calls vLLM's /tokenize endpoint instead); it will not replace vLLM's internal LRU mechanism (it augments and guides it). This is consistent with the architectural principles in ADR-01.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions