feat(kv_cache): enable asymmetric store/retrieve storages in PD backend (LMCache#2509)

hlin99 · deng451e · web-flow · commit 9d413181c30e · 2026-03-16T06:48:55.000Z
* feat(kv_cache): enable asymmetric save/remote storage in PD backend

Remove the restriction that prevented using `save_decode_cache` and
`remote_backend` simultaneously in Prefill-Decode (PD) separation scenarios.

This change introduces `pd_retrieve_locations` and `pd_store_location`
parameters to decouple the KV cache retrieval and storage logic. This
enables an asymmetric cache flow:
1. Prefill nodes transmit KV cache to Decode nodes via the PDBackend.
2. Decode nodes write back their generated KV cache to a remote backend
   for subsequent prefill reuse.
3. In multi-turn dialogue scenarios, subsequent
   prefill requests retrieve historical KV cache from the remote backend,
   significantly increasing Prefix Cache hit rates and reducing TTFT

This decoupling provides greater flexibility for cross-instance cache
management and improves overall pipeline efficiency in distributed
inference.

[ Compute Layer ]
    +----------------------+                 +------------------+
    |  Prefill Node        | ===============&gt;|  Decode Node     |
    | (Hit-Remote &amp; GenKV) |  (1) PDBackend  | (Hit-PD &amp; GenKV) |
    +-------^--------------+                 +-------+----------+
            |                                   |
            :                                   :
------------|-----------------------------------|------------
[ Storage Layer ]                               |
            |                                   | (2) pd_store_location
            | (3) pd_retrieve_locations         |     (Decode -&gt; Pool)
            |     (Pool -&gt; Prefill)             |
            |                                   v
    +-------+--------------------------------------------+
    |             Distributed Storage Pool               |
    |   [Node A]    [Node B]    [Node C]    [Node D]     |
    |   &lt;=======  (Object Storage / NFS / DFS)  =======&gt; |
    +----------------------------------------------------+

Workflow:
1. Prefill -&gt; Decode (PDBackend): Initial KV transfer for the current turn.
2. Decode -&gt; Remote (Store): Decode saves updated KV to NFS for persistence.
3. Remote -&gt; Prefill (Retrieve): Next-turn prefill pulls from Remote,
   drastically increasing Prefix Cache hit rate for multi-turn dialogues.

Signed-off-by: Tony Lin &lt;tony.lin@intel.com&gt;

* small refactor

Signed-off-by: Tony Lin &lt;tony.lin@intel.com&gt;

* config examples for pd + remote backends

Signed-off-by: Tony Lin &lt;tony.lin@intel.com&gt;

* refactor: rename pd_retrieve_locations/pd_store_location to retrieve_locations/store_location

Remove the PD-specific prefix to make the retrieve/store locations
generic instead of being limited to PD only.

This breaks the PD-only feature restriction and allows the mechanism
to be reused by other roles/components.

Signed-off-by: Tony Lin &lt;tony.lin@intel.com&gt;

* move retrieve &amp; store locations from storage manger to cache engine

Signed-off-by: Tony Lin &lt;tony.lin@intel.com&gt;

* add para validation check

Signed-off-by: Tony Lin &lt;tony.lin@intel.com&gt;

* config: replace hardcoded IP with placeholder in decoder remote configs

Signed-off-by: Tony Lin &lt;tony.lin@intel.com&gt;

* resolve conflicts and rebase to the latest

Signed-off-by: Tony Lin &lt;tony.lin@intel.com&gt;

* address review comments

Signed-off-by: Tony Lin &lt;tony.lin@intel.com&gt;

* add description in configurations.rst

Signed-off-by: Tony Lin &lt;tony.lin@intel.com&gt;

---------

Signed-off-by: Tony Lin &lt;tony.lin@intel.com&gt;
Co-authored-by: deng451e &lt;57919305+deng451e@users.noreply.github.com&gt;
diff --git a/docs/source/api_reference/configurations.rst b/docs/source/api_reference/configurations.rst
@@ -79,6 +79,12 @@ Basic cache settings that control the core functionality of LMCache.
    * - min_retrieve_tokens
      - LMCACHE_MIN_RETRIEVE_TOKENS
      - Minimum number of hit tokens required to perform retrieve. If hit tokens < this value, skip retrieve but still record the hits to avoid re-storing existing chunks. See :ref:`performance_tuning` for a working example. Default: 0 (disabled)
+   * - store_location
+     - LMCACHE_STORE_LOCATION
+     - A single storage backend name to store KV caches into. When specified, only the matching backend receives store operations. Valid values are the backend class names registered in the storage manager, including: ``"LocalCPUBackend"``, ``"LocalDiskBackend"``, ``"RemoteBackend"``, ``"PDBackend"``, ``"P2PBackend"``, ``"GdsBackend"``, etc, and any storage plugin backends. Note: ``"PDBackend"`` cannot be used as a store location for a decoder instance in a PD setup, since PDBackend is one-way from prefiller to decoder only. Default: null (store to all active backends)
+   * - retrieve_locations
+     - LMCACHE_RETRIEVE_LOCATIONS
+     - List of storage backend names to search when retrieving or looking up KV caches. When specified, only the listed backends are searched. Valid values are the backend class names registered in the storage manager, including: ``"LocalCPUBackend"``, ``"LocalDiskBackend"``, ``"RemoteBackend"``, ``"PDBackend"``, ``"P2PBackend"``, ``"GdsBackend"``, etc, and any storage plugin backends. Default: null (search all active backends)
    * - extra_config
      - LMCACHE_EXTRA_CONFIG={"key": value, ...}
      - Additional configuration as JSON dict. For NUMA manual mode, include "gpu_to_numa_mapping": {gpu_id: numa_node, ...}. Default: {}
@@ -475,4 +481,4 @@ These configurations are deprecated and may be removed in future versions.
    * - audit_actual_remote_url
      - LMCACHE_AUDIT_ACTUAL_REMOTE_URL
      - (Deprecated) URL of actual remote LMCache instance for auditing. Use extra_config['audit_actual_remote_url'] instead
-     
+     
diff --git a/examples/disagg_prefill/1p1d/configs/lmcache-decoder-pd-with-remote-config.yaml b/examples/disagg_prefill/1p1d/configs/lmcache-decoder-pd-with-remote-config.yaml
@@ -0,0 +1,21 @@
+local_cpu: True
+max_local_cpu_size: 5
+
+remote_url: "lm://localhost:6800"
+remote_serde: "cachegen"
+
+retrieve_locations: ["PDBackend"]
+store_location: "RemoteBackend"
+
+enable_pd: True
+transfer_channel: "nixl"
+pd_role: "receiver"
+pd_peer_host: "localhost"
+pd_peer_init_port: 7300
+pd_peer_alloc_port: 7400
+pd_buffer_size: 2147483648 # 2GB
+pd_buffer_device: "cuda"
+nixl_backends: [UCX]
+
+save_decode_cache: true
+save_unfull_chunk: true
diff --git a/examples/disagg_prefill/1p1d/configs/lmcache-prefiller-pd-with-remote-config.yaml b/examples/disagg_prefill/1p1d/configs/lmcache-prefiller-pd-with-remote-config.yaml
@@ -0,0 +1,19 @@
+local_cpu: True
+max_local_cpu_size: 5
+
+remote_url: "lm://localhost:6800"
+remote_serde: "cachegen"
+
+retrieve_locations: ["LocalCPUBackend", "RemoteBackend"]
+
+enable_pd: True
+transfer_channel: "nixl"
+pd_role: "sender"
+pd_proxy_host: "localhost"
+pd_proxy_port: 7500
+pd_buffer_size: 1073741824 # 1GB
+pd_buffer_device: "cuda"
+nixl_backends: [UCX]
+
+save_unfull_chunk: true
+
diff --git a/examples/disagg_prefill/xpyd/configs/lmcache-decoder-1-pd-with-remote-config.yaml b/examples/disagg_prefill/xpyd/configs/lmcache-decoder-1-pd-with-remote-config.yaml
@@ -0,0 +1,22 @@
+local_cpu: True
+max_local_cpu_size: 5
+
+remote_url: "lm://<your remote server IP>:<port>"
+remote_serde: "cachegen"
+
+retrieve_locations: ["PDBackend"]
+store_location: "RemoteBackend"
+
+enable_pd: True
+transfer_channel: "nixl"
+pd_role: "receiver"
+pd_peer_host: "localhost"
+pd_peer_init_port: 7300
+pd_peer_alloc_port: 7400
+pd_buffer_size: 2147483648 # 2GB
+pd_buffer_device: "cuda"
+nixl_backends: [UCX]
+
+save_decode_cache: true
+save_unfull_chunk: true
+
diff --git a/examples/disagg_prefill/xpyd/configs/lmcache-decoder-2-pd-with-remote-config.yaml b/examples/disagg_prefill/xpyd/configs/lmcache-decoder-2-pd-with-remote-config.yaml
@@ -0,0 +1,21 @@
+local_cpu: True
+max_local_cpu_size: 5
+
+remote_url: "lm://<your remote server IP>:<port>"
+remote_serde: "cachegen"
+
+retrieve_locations: ["PDBackend"]
+store_location: "RemoteBackend"
+
+enable_pd: True
+transfer_channel: "nixl"
+pd_role: "receiver"
+pd_peer_host: "localhost"
+pd_peer_init_port: 7301
+pd_peer_alloc_port: 7401
+pd_buffer_size: 2147483648 # 2GB
+pd_buffer_device: "cuda"
+nixl_backends: [UCX]
+
+save_decode_cache: true
+save_unfull_chunk: true
diff --git a/examples/disagg_prefill/xpyd/configs/lmcache-prefiller-pd-with-remote-config.yaml b/examples/disagg_prefill/xpyd/configs/lmcache-prefiller-pd-with-remote-config.yaml
@@ -0,0 +1,19 @@
+local_cpu: True
+max_local_cpu_size: 5
+
+remote_url: "lm://localhost:6800"
+remote_serde: "cachegen"
+
+retrieve_locations: ["LocalCPUBackend", "RemoteBackend"]
+
+enable_pd: True
+transfer_channel: "nixl"
+pd_role: "sender"
+pd_proxy_host: "localhost"
+pd_proxy_port: 7500
+pd_buffer_size: 1073741824 # 1GB
+pd_buffer_device: "cuda"
+nixl_backends: [UCX]
+
+save_unfull_chunk: true
+
diff --git a/lmcache/v1/cache_engine.py b/lmcache/v1/cache_engine.py
@@ -180,6 +180,11 @@ def __init__(
         # at decoder.
         self.remove_after_retrieve = config.enable_pd and config.pd_role == "receiver"
 
+        # asymmetric store/retrieve location can be specified
+        # this is typically used (but not limited) in PD system
+        self.store_location = config.store_location
+        self.retrieve_locations = config.retrieve_locations
+
         self.num_layers = metadata.kv_shape[0]
         self.fmt = None
         if self.use_layerwise:
@@ -532,7 +537,10 @@ def store(
             # TODO: we implicitly rely on batched_put to call ref_count_down
             # this management should be done in a cleaner way
             self.storage_manager.batched_put(
-                keys, memory_objs, transfer_spec=transfer_spec
+                keys,
+                memory_objs,
+                transfer_spec=transfer_spec,
+                location=self.store_location,
             )
 
         self.stats_monitor.on_store_finished(
@@ -640,7 +648,9 @@ def store_layer(
 
             keys_multi_layer = key.split_layers(self.num_layers)
             # Only check the first layer
-            if self.storage_manager.contains(keys_multi_layer[0]):
+            if self.storage_manager.contains(
+                keys_multi_layer[0], self.retrieve_locations
+            ):
                 continue
 
             # Allocate the memory object
@@ -715,7 +725,9 @@ def store_layer(
             for layer_id in range(self.num_layers):
                 yield
                 next(mem_obj_generator)
-                self.storage_manager.batched_put(keys[layer_id], memory_objs[layer_id])
+                self.storage_manager.batched_put(
+                    keys[layer_id], memory_objs[layer_id], location=self.store_location
+                )
 
             tot_time = time.perf_counter() - t_start
             logger.info(
@@ -848,7 +860,7 @@ def retrieve(
         for key, memory_obj, _, _ in reordered_chunks:
             if self.remove_after_retrieve and not self._is_passive():
                 assert self.storage_manager is not None
-                self.storage_manager.remove(key)
+                self.storage_manager.remove(key, self.retrieve_locations)
             if not self.async_loading:
                 memory_obj.ref_count_down()
 
@@ -956,7 +968,9 @@ def retrieve_layer(
             keys_multi_layer = key.split_layers(self.num_layers)
 
             # NOTE: Only check the first layer
-            if current_location := self.storage_manager.contains(keys_multi_layer[0]):
+            if current_location := self.storage_manager.contains(
+                keys_multi_layer[0], self.retrieve_locations
+            ):
                 if location is None:
                     location = current_location
                 else:
@@ -1082,6 +1096,9 @@ def lookup(
             assert hashes is not None
             lookup_stats = self.stats_monitor.on_lookup_request(sum(offsets))
 
+        if search_range is None:
+            search_range = self.retrieve_locations
+
         res = 0
         try:
             chunk_info_iterator = self.token_database.process_tokens(
@@ -1243,6 +1260,9 @@ def async_lookup_and_prefetch(
         keys: list[CacheEngineKey] = []
         cum_chunk_lengths = [0]
 
+        if search_range is None:
+            search_range = self.retrieve_locations
+
         # TODO(Jiayi): make token database able to return list.
         for start, end, key in self.token_database.process_tokens(
             tokens=tokens,
diff --git a/lmcache/v1/config.py b/lmcache/v1/config.py
@@ -120,6 +120,8 @@
     },
     "blend_min_tokens": {"type": int, "default": 256, "env_converter": int},
     "blend_special_str": {"type": str, "default": " # # ", "env_converter": str},
+    "retrieve_locations": {"type": Optional[list[str]], "default": None},
+    "store_location": {"type": Optional[str], "default": None},
     # P2P configurations
     "enable_p2p": {
         "type": bool,
@@ -544,11 +546,6 @@ def _validate_config(self):
         assert self.pd_role is not None
         assert self.pd_buffer_size is not None
         assert self.pd_buffer_device is not None
-
-        assert self.remote_url is None, "PD only supports remote_url=None"
-        assert self.save_decode_cache is False, (
-            "PD only supports save_decode_cache=False"
-        )
         assert self.enable_p2p is False, "PD only supports enable_p2p=False"
 
         # PD requires save_unfull_chunk=True for complete KV cache transfer
@@ -568,6 +565,19 @@ def _validate_config(self):
                 "including partial chunks will be transferred to decode node"
             )
 
+        # for receiver, PDBackend is for retrieve location
+        # can't take PDBackend as store location
+        # as PDBackend is now one way from producer to receiver only
+        if self.pd_role == "receiver":
+            assert self.store_location != "PDBackend", (
+                "store_location cannot be PDBackend for receiver"
+            )
+            assert self.retrieve_locations in (None, ["PDBackend"]), (
+                "for pd receiver, "
+                'retrieve_locations are expected to be ["PDBackend"], '
+                f"now, it is {self.retrieve_locations}"
+            )
+
     if enable_nixl_storage:
         assert self.extra_config.get("nixl_backend") is not None
         assert self.extra_config.get("nixl_pool_size") is not None