[Feat] Mooncake & Mooncake|Posix by UESTC-AHao · Pull Request #945 · ModelEngine-Group/unified-cache-management

UESTC-AHao · 2026-04-30T09:01:50Z

Purpose

UCM + Mooncake & Mooncake|Posix (Ascend 0.18.0)

Modifications

unified-cache-management\examples\ucm_mooncake_config.yaml

Test

sumingZero · 2026-05-06T06:53:43Z

+      master_server_address: "127.0.0.1:50088"
+      metadata_server: "P2PHANDSHAKE"
+      protocol: "ascend"
+      global_segment_size: 32212254720


Should we explain the meaning of these parameters

I will add comments to provide brief explanations for these parameters.

sumingZero · 2026-05-06T07:05:02Z

        return self.store_.Check(task.task_id)

+    def register_memory(self, base_addr: int, total_size: int) -> None:
+        self.store_.RegisterMemory(base_addr, total_size)


Should the return value of RegisterMemory be validated

sumingZero · 2026-05-06T07:23:35Z

+     * @param total_size Total size of the memory region in bytes.
+     * @return Status::OK on success, error code on failure.
+     */
+    virtual Status RegisterMemory(void* base_addr, size_t total_size) = 0;


If RegisterMemory is Mooncake-specific operation, should it be handled internally within MooncakeStore rather than being added as a public interface in StoreV1 base class? The current design forces all other store implementations to provide empty stub methods, violating Interface Segregation Principle.

yuanzhg078 · 2026-05-06T09:22:48Z

+    metadata_server,
+    protocol,
+    global_segment_size,
+    replica_num,


create_mooncake_store() is called with an extra argument worker_num, but its function signature does not define worker_num

Yes,I'll fix it.

ygwpz · 2026-05-08T03:39:38Z

@@ -0,0 +1,50 @@
+find_path(ASCEND_ACL_INCLUDE_DIR NAMES acl/acl.h PATHS /usr/local/Ascend/ascend-toolkit/latest/include NO_DEFAULT_PATH)
+find_library(ASCEND_ACL_LIBRARY NAMES ascendcl PATHS /usr/local/Ascend/ascend-toolkit/latest/lib64 NO_DEFAULT_PATH)


The path /vllm-workspace/Mooncake/mooncake-store/lib is hardcoded here. This won't work in our CI environment where Mooncake is installed at a different location. Could we use an environment variable like MOONCAKE_STORE_ROOT instead?

The current intention is to make it directly usable in the Ascend environment: because the Mooncake source code path in the Vllm-Ascend image is located at /vllm-workspace/Mooncake/mooncake-store/lib (a fixed path). However, considering future scalability and support for the CUDA ecosystem, switching to environment variables is indeed more reasonable.

ygwpz · 2026-05-08T03:39:40Z

+    file(GLOB_RECURSE SOURCES "./cc/*.cc")
+    add_library(mooncakestore SHARED ${SOURCES})
+
+    target_compile_features(mooncakestore PUBLIC cxx_std_20)


Same issue here with the include path. Would be nice if MOONCAKE_STORE_INCLUDE_DIR could be passed as a cmake option.

Understood. I will make it more reasonable.

ygwpz · 2026-05-08T03:40:26Z

+        for (auto sz : shard.sizes) { totalSize += sz; }
+        if (totalSize == 0) { continue; }
+
+        void* buf = bufPool.AcquireWithTimeout(std::chrono::milliseconds(30000));


The 30-second timeout here is hardcoded. In some scenarios we might want a longer timeout for large transfers. Could this be configurable via the Config struct?

ygwpz · 2026-05-08T03:40:52Z

+
+    std::vector<int> RpcBatchIsExist(const std::vector<std::string>& keys)
+    {
+        if (keys.empty()) { return std::vector<int>(keys.size(), -1); }


When keys.empty(), returning std::vector<int>(keys.size(), -1) returns an empty vector, which is correct. But the logic feels a bit odd - why check keys.empty() and then reference keys.size()? Could simplify to just return {};

ygwpz · 2026-05-08T03:40:53Z

+
+            if (shard.addrs.size() > config_.tensorSizeList.size()) {
+                UC_WARN(
+                    "BuildShards: key={} has {} addrs but tensorSizeList has only {}, truncating",


This warning is logged but the operation continues with truncated data. Should this be an error instead? Truncating tensor addresses seems like it could cause correctness issues.

ygwpz · 2026-05-08T03:42:57Z

+                    num_blocks = kv_layer.shape[0]
+                    total_size = kv_layer.numel() * kv_layer.element_size()
+                    self.store.register_memory(kv_layer.data_ptr(), total_size)
+                else:


The error message doesn't include layer_name. When this happens in production, it's hard to know which layer caused the issue. Could you include the layer name in the message?

ygwpz · 2026-05-08T03:42:59Z

+            elif isinstance(kv_layer, Tuple):
+                for tensor in kv_layer:
+                    total_size = tensor.numel() * tensor.element_size()
+                    self.store.register_memory(tensor.data_ptr(), total_size)


Similarly, the type error doesn't mention which layer. Would help debugging to add layer_name here too.

ygwpz · 2026-05-08T03:44:07Z

+    config: Dict[str, object], pipeline: ucmpipelinestore.PipelineStore
+):
+    store_dir = Path(__file__).resolve().parent.parent
+    posix_config = copy.deepcopy(config)


Why is tensor_size only set when device_id >= 0? If Posix is used as a fallback backend in non-NPU scenarios, shouldn't tensor_size still be needed?

UESTC-AHao requested review from FangRun2, Infinite666, Tarrei, chinesezyc, harrisonyhq, mag1c-h, qyh111 and ygwpz as code owners April 30, 2026 09:01

UESTC-AHao force-pushed the dev_fh_mooncake branch 6 times, most recently from 5e3d768 to b6df4c9 Compare May 6, 2026 01:42

Mooncake|Posix

716c6b3

UESTC-AHao force-pushed the dev_fh_mooncake branch from b6df4c9 to 716c6b3 Compare May 6, 2026 02:18

sumingZero reviewed May 6, 2026

View reviewed changes

yuanzhg078 reviewed May 6, 2026

View reviewed changes

ygwpz reviewed May 8, 2026

View reviewed changes

		@@ -0,0 +1,50 @@
		find_path(ASCEND_ACL_INCLUDE_DIR NAMES acl/acl.h PATHS /usr/local/Ascend/ascend-toolkit/latest/include NO_DEFAULT_PATH)
		find_library(ASCEND_ACL_LIBRARY NAMES ascendcl PATHS /usr/local/Ascend/ascend-toolkit/latest/lib64 NO_DEFAULT_PATH)

Conversation

UESTC-AHao commented Apr 30, 2026

Purpose

Modifications

Test

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sumingZero May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sumingZero May 6, 2026 •

edited

Loading