RFC: Session-aware KV cache management

### Problem statement / motivation

# RFC: Session-Aware KV Cache Management in agentic-api

## Abstract

Session cache is a basic and fundamental feature for Responses API which optimizes multi-turn dialogue inference by automatically caching the KV cache of a session, cutting down redundant computation, lowering API calling costs and shortening response latency for context-aware LLM interactions.

This RFC proposes introducing a **Session Cache Manager** into the Core Orchestration Layer of agentic-api. It maintains mappings between KV cache blocks and sessions, providing unified lifecycle management (tiering, migration, eviction, and invalidation) of KV cache on the agentic-api side. agentic-api acts as the authoritative KV cache scheduler, interacting with vLLM Server via the KVConnector interface or interacting with KV store connected to vLLM. This design also lays the groundwork for advanced optimizations such as priority-based eviction and proactive cache invalidation proposed in vLLM RFC #37003 and RFC #37168.

---

## 1. Motivation

### 1.1 Current Pain Points

- **Agentic workloads break prefix caching**: In a typical agent turn, over 90% of tokens are a reuse of the previous turn's prefix. However, during tool-call pauses (which account for 40–60% of wall time), KV cache blocks have no active references and are evicted by LRU. When the session resumes, the entire context (70K–200K tokens) must be re-prefilled from scratch.
- **vLLM's LRU policy is session-agnostic**: LRU only considers recency; it has no awareness of which sessions are still active and which have ended, causing zombie blocks to occupy GPU HBM indefinitely.
- **No home for stateful session management**: vLLM core remains stateless, and the Responses API layer lacks a system-level mapping between sessions and KV cache blocks.

### 1.2 Why Implement This in agentic-api

- agentic-api is the natural owner of stateful Responses API logic, holding semantic information such as `session_id`, `previous_response_id`, and session liveness.
- This aligns with the architectural separation of concerns: agentic-api owns cache policy, while vLLM owns cache mechanism.
- It enables session routing and migration across vLLM instances — something a stateless vLLM cannot achieve on its own.

---

## 2. Background

- The OpenAI Responses API supports multi-turn session chaining via `previous_response_id`.
- Alibaba's Session Cache (`x-dashscope-session-cache`) has validated the feasibility of server-side automatic context caching (5-minute TTL, reset on cache hit).
- **vLLM RFC #37003**: Proposes a `RetentionDirective` API that allows an orchestrator to set per-range block eviction priority (0–100) and TTL, with multi-tenant isolation via `retention_scope` (session_id).
- **vLLM RFC #37168**: Proposes a proactive invalidation endpoint `POST /release_kv_cache`, session-aware reference counting, and a dual-zone scheduling strategy (Aging Zone / Fresh Zone), achieving a 26% reduction in TTFT under high-concurrency workloads.
- **vLLM KVConnector ecosystem**: MooncakeStore, LMCache, and similar backends provide multi-tier, cross-node KV block transfer capabilities.

---

## 3. Design Goals

| Goal | Description |
|------|-------------|
| **G1** | agentic-api serves as the authoritative KV cache lifecycle scheduler, supporting session-based cache management |
| **G2** | Support multi-tier storage (GPU HBM → DRAM → NVMe → object storage) |
| **G3** | Support session migration across vLLM instances (horizontal scaling) |
| **G4** | Zero overhead for non-agentic stateless workloads |
| **G5** | Provide the foundation for implementing vLLM RFC #37003 and #37168 |

---


### Proposed solution


## 4. Architecture

### 4.1 Approach Comparison

#### Approach A: Shared KV Cache Store with vLLM Server

```
agentic-api (ResponsesStore for metadata)
    ↕ KV$ block metadata sync
Mooncake Store / LMCache (shared)
    ↕ KVConnector
vLLM Server (GPU HBM)
```

**Pros**:
- Zero-copy: vLLM and agentic-api access the same physical blocks with no data movement.
- Lowest latency: On a cache hit, vLLM reads blocks directly with no cross-tier loading.
- Natural alignment with RFC #37003: agentic-api knows which sessions are active and can issue priority directives directly.

**Cons**:
- Tight coupling to the vLLM runtime; agentic-api upgrades or restarts must be coordinated with vLLM.
- MooncakeStore is currently optimized for disaggregated prefill; session-level persistence requires customization.
- Complex session routing in multi-node vLLM deployments.

#### Approach B: Independent External KV Cache Store

```
agentic-api (ResponsesStore for metadata + External KV$ Store)
    ↓ on-demand load when a session request arrives
vLLM KVConnector (P2P load to KV$ store that can be operated by vLLM)
    ↓
vLLM Server (GPU HBM)
```

Multi-tier storage options:

| Tier | Backend | Use Case |
|------|---------|----------|
| L1-Hot | Mooncake Store (RDMA) / LMCache | Fast cross-node migration / CPU offloading |
| L2-Hot | Redis / Valkey (cluster mode) | For less-active sessions |
| L3-Cold | NVMe/SSD (io_uring) | Long-idle sessions |
| L4-Archive | S3 / MinIO object storage | Long-term persistence |

**Pros**:
- Fully consistent with the agentic-api architecture philosophy: the Core Orchestration Layer owns all state; vLLM executes statelessly.
- Enables cross-node session migration: sessions can be routed to any vLLM instance.
- Finer-grained lifecycle control: explicit eviction, tiered TTL, quota management.

**Cons**:
- Cold-start load latency (milliseconds to hundreds of milliseconds for L2/L3 tiers).
- Requires serialization/deserialization of the KV block format across different storage backends.

### 4.2 Recommended Approach: Hybrid Multi-Tier Architecture

Rather than choosing one approach, the two should be composed in layers (with the hot tier directly accessible by vLLM):

```
┌─────────────────────────────────────────────────┐
│         agentic-api (Session Cache Manager)     │
│  ┌───────────────────┐   ┌────────────────────┐ │
│  │Meta Data Storage  │   │ Orchestrator       │ │
│  │(block2session map)│   │                    │ │
│  └───────────────────┘   └────────────────────┘ │
└────────────────────┬────────────────────────────┘
     Eviction/Mgmt   │  On-demand Load / Offload
┌────────────────────▼────────────────────────────┐
│         Unified KV Cache Storage (Tiered)       │
│  Hot:  Mooncake Store / LMCache (DRAM/RDMA)     │
│  Warm: Redis / SSD-backed LMCache               │
│  Cold: Object Storage (S3/MinIO)                │
└────────────────────┬────────────────────────────┘
                     │ KVConnector
┌────────────────────▼────────────────────────────┐
│             vLLM Server (GPU HBM)               │
└─────────────────────────────────────────────────┘
```

### 4.3 Data Model: Block-Session Mapping

```python
@dataclass
class CacheBlockRecord:
    block_hash: str           # vLLM prefix hash (see Section 5.1)
    session_id: str           # Responses API session
    token_start: int
    token_end: int
    tier: Literal["hot", "warm", "cold"]
    storage_location: str     # Mooncake/Redis/S3 URI
    created_at: datetime
    last_hit_at: datetime
    ttl_seconds: int
    ref_count: int            # for cross-session sharing
    eviction_priority: int    # fed to RFC #37003
```

> **Note**: `block_hash` is a rolling hash computed internally by vLLM based on token content; it is not directly accessible from the agentic-api side. See Section 5.1 for a detailed analysis of this core challenge and the proposed solutions.

---

## 5. Detailed Design

### 5.1 Core Challenge: Establishing the Session-to-KV-Block Mapping

This is the most critical engineering challenge in this proposal.

**The fundamental tension**:

```
agentic-api has: session_id, previous_response_id, full message history
vLLM has:        token sequences, block_hash (content hash), GPU memory state

No shared identifier exists between the two sides → no direct mapping is possible
```

vLLM's prefix caching is entirely based on **content hashing** (a rolling hash over token sequences). It neither accepts nor stores any session semantics. From the agentic-api side, KV cache blocks are a black box.

**Three viable approaches** (priority order: Path 1 > Path 2 > Path 3):

#### Path 1: Content Hash Reconstruction (Near-Term Feasibility — No vLLM Changes Required)

Since agentic-api holds the full message history, it can **reproduce vLLM's block hash computation** locally, independently establishing the session-to-block mapping:

```
session messages
    ↓ call vLLM /tokenize API (avoid re-implementing the tokenizer)
token_ids
    ↓ split by vLLM block_size (read from vLLM config)
    ↓ reproduce vLLM rolling hash algorithm
[block_hash_1, block_hash_2, ..., block_hash_N]
    ↓
session_id → [block_hash] mapping written to ResponsesStore
```

vLLM's prefix cache hashes are deterministic (identical token sequences always yield identical block hashes). As long as agentic-api retains the relevant data (e.g., `cache_salt`) and reproduces the same algorithm, it can stay in sync with vLLM's internal state.

**Pros**:
- Requires no changes to vLLM; implementable today.
- The mapping is fully managed by agentic-api, independent of any vLLM API changes.
- Feasibility can be validated independently, establishing a baseline for future migration to other paths.

**Cons and Risks**:
- Tightly coupled to vLLM's internal hash algorithm; must be updated whenever vLLM changes its hashing implementation.
- Each request requires an additional `/tokenize` call (one extra RTT), mitigated by caching `token_ids`.
- Requires reading `block_size` from vLLM config; cross-version compatibility must be maintained.

#### Path 2: vLLM Returns Block Metadata in Responses (Alternative Approach)

Submit a small upstream PR to vLLM to expose the newly created and reused blocks for each request in the `usage` field of inference responses, providing accurate mapping data directly from vLLM:

```json
{
  "usage": {
    "prompt_tokens": 4096,
    "cached_tokens": 3840,
    "kv_cache_blocks": {
      "created": ["hash_A", "hash_B"],
      "reused":  ["hash_C", "hash_D", "hash_E"]
    }
  }
}
```

agentic-api then builds the mapping directly from the response:
```
response_id → [hash_A, hash_B, hash_C, ...]
     ↓  (ResponsesStore already has response_id → session_id)
session_id  → [hash_A, hash_B, hash_C, ...]
```

**Pros**: The mapping is provided directly by vLLM, accurate and reliable, with no hash algorithm coupling risk. The vLLM change is minimal (block metadata already exists internally; it only needs to be exposed). This can be submitted as a standalone PR to vLLM, independent of RFC #37003/#37168.

**Prerequisites**: Requires acceptance of the upstream PR by the vLLM community. Once Path 1 validates feasibility, this can be pursued as a more elegant long-term alternative.

#### Path 3: Native `session_id` Passthrough in vLLM (Supplementary Approach)

Add a `session_id` field to vLLM's request parameters and propagate it internally so that vLLM writes `session_id` directly into `KVCacheBlock` objects at allocation time. After inference completes, agentic-api asynchronously queries vLLM via an exposed interface to retrieve and maintain the session-to-block association.

```
agentic-api sends request with:
  {"session_id": "session-{session_id}", "model": "...", "input": "..."}
          ↓ vLLM internal propagation
  KVCacheBlock.session_ids.add("session-{session_id}")   ← block tagged at allocation
          ↓ after inference completes
  vLLM exposes a query interface for agentic-api:
      session_id → [block_hash_1, block_hash_2, ...]
          ↓
  agentic-api maintains lifecycle based on this mapping
```

Required vLLM changes:
1. Add an optional `session_id` field to inference request parameters.
2. Add a `session_ids` attribute to `KVCacheBlock`, populated from the request at block allocation time.
3. Expose an interface to query the list of allocated blocks by `session_id`.

**Pros**: The mapping is maintained natively inside vLLM, eliminating the need for agentic-api to reproduce the hash algorithm. More stable than Path 1; provides more complete block lifecycle management than Path 2.

**Prerequisites**: Requires moderate upstream modifications to vLLM. Similar proposals already exist in the community (e.g., RFC #37003, PR #38514), which can serve as a foundation — though merge uncertainty remains.

> **Note (Potential Complexity)**: In prefix caching, blocks with identical token prefixes are shared across multiple sessions. Therefore, the `session_id` field in `KVCacheBlock` must be a collection type (e.g., `set[session_id]`) to track all sessions holding a reference to that block. Actual block release should only occur when all associated sessions have released their references. The agentic-api side must also handle reference counting logic accordingly when asynchronously maintaining the mapping. The data structures and logic involved are non-trivial.

**Comparison of the Three Paths**:

| | Path 1 | Path 2 | Path 3 |
|---|---|---|---|
| **Approach** | Content hash reconstruction | vLLM returns block metadata in response | Native `session_id` passthrough in vLLM |
| **vLLM Changes** | None | Minor (response field extension) | Moderate (request params + KVCacheBlock + query interface) |
| **Mapping Accuracy** | High (deterministic hash) | High (provided directly by vLLM) | Highest (natively embedded in block) |
| **Coupling to vLLM version** | Yes (hash algorithm) | No | No |
| **Additional Overhead** | `/tokenize` RTT | None | None |

### 5.2 Session Cache Request Flow

> The flow below describes the current implementation using **Path 1 (content hash reconstruction)**.

1. A request arrives at agentic-api carrying `previous_response_id`.
2. Session Cache Manager queries ResponsesStore, resolves the session chain, and retrieves the full message history.
3. **Establish block mapping** (Path 1):
   - Call vLLM's `/tokenize` endpoint to obtain `token_ids` (read from local cache if available).
   - Split by vLLM `block_size` and reproduce the rolling hash to derive `[block_hash_1, ..., block_hash_N]`.
   - Query the Core Orchestration Layer to determine the current tier of each block.
4. Decide based on the block's tier:
   - **HBM hit**: Attach `retention_directives` (with `retention_scope = session_id`) to raise eviction priority, then proceed to inference.
   - **DRAM/NVMe hit**: Load blocks into the region accessible by vLLM KVConnector, then proceed to inference.
   - **Cache miss**: Normal prefill.
5. After inference completes, write the list of `block_hash` values for this response into ResponsesStore, and update `last_hit_at`, `ref_count`, and `eviction_priority`.

### 5.3 Lifecycle Management

| Event | agentic-api Action | vLLM Action |
|-------|-------------------|-------------|
| Session first created | Register block-session mapping, tier=hot | Prefill; blocks reside in HBM |
| Session active | Refresh TTL / raise eviction priority | Normal prefix cache reuse |
| Tool-call pause | Maintain priority; optionally offload to DRAM | Retain blocks by priority |
| Context compression/pruning | Release stale blocks | Call RFC #37168 `/release_kv_cache` to proactively release zombie blocks |
| Session ended | Evict blocks | LRU natural cleanup |
| Resource pressure | Migrate blocks to warm/cold tier (LRU + priority) | Offload to Mooncake/LMCache |

### 5.4 API Design

**a. Responses API Extension (User-Facing)**

```http
POST /v1/responses
X-Session-Cache: enable          # enable session cache
X-Session-Cache-Ttl: 300         # custom TTL in seconds (optional)
X-Session-Cache-Tier: hot        # target storage tier (optional)

{
  "model": "...",
  "input": "...",
  "previous_response_id": "resp_xxx"
}
```

**b. Session Cache Manager Internal Interface**

```python
class SessionCacheManager:
    def register_blocks(session_id: str, blocks: list[CacheBlockRecord]) -> None
    def lookup_blocks(session_id: str) -> list[CacheBlockRecord]
    def refresh_ttl(session_id: str) -> None
    def promote_to_hot(session_id: str) -> None   # feeds RFC #37003
    def invalidate_blocks(session_id: str, token_range: tuple) -> None  # calls RFC #37168
    def evict_session(session_id: str) -> None
    def migrate_session(session_id: str, target_node: str) -> None  # cross-instance
```

---

## 6. KV Cache Storage Backend Selection

| Backend | Characteristics | Recommended Use Case |
|---------|-----------------|---------------------|
| **Mooncake Store** | High RDMA bandwidth, officially supported by vLLM, cross-node transfer | Hot tier, disaggregated prefill |
| **LMCache** | Multi-backend (CPU/disk), vLLM KVConnector support, actively developed open-source | Warm tier, large-memory local nodes |
| **Redis / Valkey** | Native TTL support, cluster HA, easy to deploy | Metadata + small-scale warm cache |
| **NVMe SSD (io_uring)** | High IOPS, low-latency local storage | Warm-cold boundary, large contexts |
| **S3 / MinIO** | Unlimited capacity, low cost | Cold tier, long-term archival |

---

## 7. Features and Key Problems Addressed

### Core Problems Solved

1. **KV cache premature eviction during agent pauses**: Priority hints protect active session blocks, definitively solving the "tool-call pause → blocks evicted by LRU → full context re-prefill on resume" problem.
2. **Zombie cache accumulation**: The proactive invalidation interface immediately releases stale blocks during context compression, rather than waiting for delayed LRU cleanup.
3. **High TTFT for multi-turn sessions**: Historical context is restored directly from cache, eliminating redundant prefill. Expected TTFT reduction: 20–30% (based on RFC #37168 benchmark data).
4. **Sessions cannot migrate across vLLM instances**: The block-session mapping in the Core Orchestration Layer makes session routing to any instance possible.

### New Capabilities Added

- **Automatic session caching**: Users only need to provide `previous_response_id`; no manual cache management required.
- **Transparent multi-tier storage**: The application layer is unaware of tier transitions; Session Cache Manager automatically handles hot/warm/cold migration.
- **Session quotas and cost attribution**: Per-session cache usage tracking, supporting per-user and per-tenant billing.
- **Cross-session prefix sharing**: Blocks for identical system prompts or tool definitions are shared across sessions (RFC #37168 `cache_sharing` semantics).
- **Session migration and elastic scaling**: When a vLLM instance goes offline, its session blocks are offloaded to the warm tier and reloaded when traffic is routed to a new instance.

---

## 8. Performance Expectations

| Scenario | Expected Improvement (reference: RFC #37168) |
|----------|----------------------------------------------|
| Session resume after tool-call pause (hot cache hit) | ~60% reduction in TTFT |
| High-concurrency multi-agent (20+ concurrent) | ~20% reduction in TTFT |
| Cache hit rate after context compression | ~25% improvement |
| Prompt throughput | ~15% improvement |

---

## 9. Security and Compatibility

- **Access control**: Only the session owner may issue `/release_kv_cache`-equivalent requests.
- **`cache_sharing` option**: Users can control whether blocks are allowed to be shared across sessions (disabled by default in tenant-isolation scenarios).
- **Backward compatibility**: When session-related parameters are omitted, the system falls back to standard LRU behavior with zero additional overhead.

---

## 10. Open Questions

0. **[Critical] Design the architecture**: Design the architecture for storing the meta data for KV$ blocks, manage & operate the KV& storage, and interact with vLLM.
1. **[Critical] Hash algorithm drift risk in Path 1**: vLLM's prefix cache rolling hash algorithm is not a stable public API and may change in future releases. Should agentic-api introduce a version-detection mechanism, or should hash algorithm synchronization be incorporated into a vLLM version compatibility test matrix?
2. **[Critical] Performance impact of the `/tokenize` RTT in Path 1**: Path 1 requires an additional `/tokenize` call per session. Is this RTT acceptable under high-concurrency workloads? Should `token_ids` be cached at the response level or the session level?
3. **vLLM upstream PR strategy (Path 2)**: Once Path 1 is stable, the team should push Path 2 to the vLLM community (adding a `kv_cache_blocks` field to the response `usage` object). What is the right timing and priority for this PR?
4. How should KV block format serialization be standardized between Mooncake Store (hot tier) and vLLM's internal format?
5. What is the right metadata store for Session Cache Manager — SQLite for single-node deployments, or PostgreSQL for production?
6. What triggers session migration across multiple vLLM instances?

---


### Alternatives considered

_No response_

### Additional context


## 11. Implementation Milestones (TBD)

| Milestone | Scope |
|-----------|-------|
| M1 | **Path 1**: Call `/tokenize` + reproduce block hash algorithm to establish precise block-session mapping; implement hash algorithm version-compatibility mechanism |
| M2 | Warm tier management using LMCache / Mooncake Store |
| M3 | Cold tier management + archival to object storage (S3) |
| M4 | Cross-instance session migration logic for KV cache blocks |
| M5 | **Path 2 (alternative)**: Submit upstream PR to vLLM to add `kv_cache_blocks` to the response `usage` field |
| M6 | **Path 3 (alternative)**: Submit upstream PR to vLLM to add `session_id` passthrough in request parameters and `KVCacheBlock`; refactor agentic-api to maintain the mapping via asynchronous queries, eliminating hash algorithm coupling entirely |

---

## 12. Key Supplementary Notes

1. **Engineering path for session-block mapping**: Path 1 (content hash reconstruction) is the primary near-term implementation, requiring no upstream vLLM changes. Path 2 (vLLM response metadata) and Path 3 (native `session_id` passthrough in vLLM) are treated as future evolution options to be pursued after Path 1 is stable.
2. **Coordination with ADR-02 storage design**: The block-session mapping can be implemented as a sub-table within the Multi-table structure rather than as an independent component, reducing architectural fragmentation.
3. **Explicit non-goals**: agentic-api will not re-implement the tokenizer (Path 1 calls vLLM's `/tokenize` endpoint instead); it will not replace vLLM's internal LRU mechanism (it augments and guides it). This is consistent with the architectural principles in ADR-01.

---

## References

- [vllm-project/agentic-api](https://github.com/vllm-project/agentic-api)
- [vLLM RFC #37003: Context-Aware KV-Cache Retention API (Prioritized Evictions)](https://github.com/vllm-project/vllm/issues/37003)
- [vLLM RFC #37168: Active Coordination and Two-Zone Scheduling Mechanism for KV Cache in Long-Running Agents](https://github.com/vllm-project/vllm/issues/37168)
- [Alibaba DashScope Session Cache](https://help.aliyun.com/zh/model-studio/context-cache)
- [Mooncake Store vLLM Connector](https://docs.vllm.ai/en/latest/features/mooncake_connector_usage/)
- [kvcache-ai/Mooncake](https://github.com/kvcache-ai/Mooncake/)


Goal	Description
G1	agentic-api serves as the authoritative KV cache lifecycle scheduler, supporting session-based cache management
G2	Support multi-tier storage (GPU HBM → DRAM → NVMe → object storage)
G3	Support session migration across vLLM instances (horizontal scaling)
G4	Zero overhead for non-agentic stateless workloads
G5	Provide the foundation for implementing vLLM RFC #37003 and #37168

Tier	Backend	Use Case
L1-Hot	Mooncake Store (RDMA) / LMCache	Fast cross-node migration / CPU offloading
L2-Hot	Redis / Valkey (cluster mode)	For less-active sessions
L3-Cold	NVMe/SSD (io_uring)	Long-idle sessions
L4-Archive	S3 / MinIO object storage	Long-term persistence

	Path 1	Path 2	Path 3
Approach	Content hash reconstruction	vLLM returns block metadata in response	Native `session_id` passthrough in vLLM
vLLM Changes	None	Minor (response field extension)	Moderate (request params + KVCacheBlock + query interface)
Mapping Accuracy	High (deterministic hash)	High (provided directly by vLLM)	Highest (natively embedded in block)
Coupling to vLLM version	Yes (hash algorithm)	No	No
Additional Overhead	`/tokenize` RTT	None	None

Event	agentic-api Action	vLLM Action
Session first created	Register block-session mapping, tier=hot	Prefill; blocks reside in HBM
Session active	Refresh TTL / raise eviction priority	Normal prefix cache reuse
Tool-call pause	Maintain priority; optionally offload to DRAM	Retain blocks by priority
Context compression/pruning	Release stale blocks	Call RFC #37168 `/release_kv_cache` to proactively release zombie blocks
Session ended	Evict blocks	LRU natural cleanup
Resource pressure	Migrate blocks to warm/cold tier (LRU + priority)	Offload to Mooncake/LMCache

Backend	Characteristics	Recommended Use Case
Mooncake Store	High RDMA bandwidth, officially supported by vLLM, cross-node transfer	Hot tier, disaggregated prefill
LMCache	Multi-backend (CPU/disk), vLLM KVConnector support, actively developed open-source	Warm tier, large-memory local nodes
Redis / Valkey	Native TTL support, cluster HA, easy to deploy	Metadata + small-scale warm cache
NVMe SSD (io_uring)	High IOPS, low-latency local storage	Warm-cold boundary, large contexts
S3 / MinIO	Unlimited capacity, low cost	Cold tier, long-term archival

Scenario	Expected Improvement (reference: RFC #37168)
Session resume after tool-call pause (hot cache hit)	~60% reduction in TTFT
High-concurrency multi-agent (20+ concurrent)	~20% reduction in TTFT
Cache hit rate after context compression	~25% improvement
Prompt throughput	~15% improvement

Milestone	Scope
M1	Path 1: Call `/tokenize` + reproduce block hash algorithm to establish precise block-session mapping; implement hash algorithm version-compatibility mechanism
M2	Warm tier management using LMCache / Mooncake Store
M3	Cold tier management + archival to object storage (S3)
M4	Cross-instance session migration logic for KV cache blocks
M5	Path 2 (alternative): Submit upstream PR to vLLM to add `kv_cache_blocks` to the response `usage` field
M6	Path 3 (alternative): Submit upstream PR to vLLM to add `session_id` passthrough in request parameters and `KVCacheBlock`; refactor agentic-api to maintain the mapping via asynchronous queries, eliminating hash algorithm coupling entirely

RFC: Session-aware KV cache management #18

Description

Problem statement / motivation

RFC: Session-Aware KV Cache Management in agentic-api

Abstract

1. Motivation

1.1 Current Pain Points

1.2 Why Implement This in agentic-api

2. Background

3. Design Goals

Proposed solution

4. Architecture

4.1 Approach Comparison

Approach A: Shared KV Cache Store with vLLM Server

Approach B: Independent External KV Cache Store

4.2 Recommended Approach: Hybrid Multi-Tier Architecture

4.3 Data Model: Block-Session Mapping

5. Detailed Design

5.1 Core Challenge: Establishing the Session-to-KV-Block Mapping

Path 1: Content Hash Reconstruction (Near-Term Feasibility — No vLLM Changes Required)

Path 2: vLLM Returns Block Metadata in Responses (Alternative Approach)

Path 3: Native session_id Passthrough in vLLM (Supplementary Approach)

5.2 Session Cache Request Flow

5.3 Lifecycle Management

5.4 API Design

6. KV Cache Storage Backend Selection

7. Features and Key Problems Addressed

Core Problems Solved

New Capabilities Added

8. Performance Expectations

9. Security and Compatibility

10. Open Questions

Alternatives considered

Additional context

11. Implementation Milestones (TBD)

12. Key Supplementary Notes

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Path 3: Native `session_id` Passthrough in vLLM (Supplementary Approach)