Skip to content

[Feat] Mooncake & Mooncake|Posix#945

Open
UESTC-AHao wants to merge 1 commit into
ModelEngine-Group:developfrom
UESTC-AHao:dev_fh_mooncake
Open

[Feat] Mooncake & Mooncake|Posix#945
UESTC-AHao wants to merge 1 commit into
ModelEngine-Group:developfrom
UESTC-AHao:dev_fh_mooncake

Conversation

@UESTC-AHao
Copy link
Copy Markdown
Contributor

Purpose

UCM + Mooncake & Mooncake|Posix (Ascend 0.18.0)

Modifications

unified-cache-management\examples\ucm_mooncake_config.yaml

Test

master_server_address: "127.0.0.1:50088"
metadata_server: "P2PHANDSHAKE"
protocol: "ascend"
global_segment_size: 32212254720
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we explain the meaning of these parameters

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add comments to provide brief explanations for these parameters.

return self.store_.Check(task.task_id)

def register_memory(self, base_addr: int, total_size: int) -> None:
self.store_.RegisterMemory(base_addr, total_size)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the return value of RegisterMemory be validated

Comment thread ucm/store/ucmstore_v1.h
* @param total_size Total size of the memory region in bytes.
* @return Status::OK on success, error code on failure.
*/
virtual Status RegisterMemory(void* base_addr, size_t total_size) = 0;
Copy link
Copy Markdown
Contributor

@sumingZero sumingZero May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If RegisterMemory is Mooncake-specific operation, should it be handled internally within MooncakeStore rather than being added as a public interface in StoreV1 base class? The current design forces all other store implementations to provide empty stub methods, violating Interface Segregation Principle.

metadata_server,
protocol,
global_segment_size,
replica_num,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create_mooncake_store() is called with an extra argument worker_num, but its function signature does not define worker_num

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes,I'll fix it.

@@ -0,0 +1,50 @@
find_path(ASCEND_ACL_INCLUDE_DIR NAMES acl/acl.h PATHS /usr/local/Ascend/ascend-toolkit/latest/include NO_DEFAULT_PATH)
find_library(ASCEND_ACL_LIBRARY NAMES ascendcl PATHS /usr/local/Ascend/ascend-toolkit/latest/lib64 NO_DEFAULT_PATH)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path /vllm-workspace/Mooncake/mooncake-store/lib is hardcoded here. This won't work in our CI environment where Mooncake is installed at a different location. Could we use an environment variable like MOONCAKE_STORE_ROOT instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current intention is to make it directly usable in the Ascend environment: because the Mooncake source code path in the Vllm-Ascend image is located at /vllm-workspace/Mooncake/mooncake-store/lib (a fixed path). However, considering future scalability and support for the CUDA ecosystem, switching to environment variables is indeed more reasonable.

file(GLOB_RECURSE SOURCES "./cc/*.cc")
add_library(mooncakestore SHARED ${SOURCES})

target_compile_features(mooncakestore PUBLIC cxx_std_20)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue here with the include path. Would be nice if MOONCAKE_STORE_INCLUDE_DIR could be passed as a cmake option.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. I will make it more reasonable.

for (auto sz : shard.sizes) { totalSize += sz; }
if (totalSize == 0) { continue; }

void* buf = bufPool.AcquireWithTimeout(std::chrono::milliseconds(30000));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 30-second timeout here is hardcoded. In some scenarios we might want a longer timeout for large transfers. Could this be configurable via the Config struct?


std::vector<int> RpcBatchIsExist(const std::vector<std::string>& keys)
{
if (keys.empty()) { return std::vector<int>(keys.size(), -1); }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When keys.empty(), returning std::vector<int>(keys.size(), -1) returns an empty vector, which is correct. But the logic feels a bit odd - why check keys.empty() and then reference keys.size()? Could simplify to just return {};


if (shard.addrs.size() > config_.tensorSizeList.size()) {
UC_WARN(
"BuildShards: key={} has {} addrs but tensorSizeList has only {}, truncating",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning is logged but the operation continues with truncated data. Should this be an error instead? Truncating tensor addresses seems like it could cause correctness issues.

num_blocks = kv_layer.shape[0]
total_size = kv_layer.numel() * kv_layer.element_size()
self.store.register_memory(kv_layer.data_ptr(), total_size)
else:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message doesn't include layer_name. When this happens in production, it's hard to know which layer caused the issue. Could you include the layer name in the message?

elif isinstance(kv_layer, Tuple):
for tensor in kv_layer:
total_size = tensor.numel() * tensor.element_size()
self.store.register_memory(tensor.data_ptr(), total_size)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, the type error doesn't mention which layer. Would help debugging to add layer_name here too.

config: Dict[str, object], pipeline: ucmpipelinestore.PipelineStore
):
store_dir = Path(__file__).resolve().parent.parent
posix_config = copy.deepcopy(config)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is tensor_size only set when device_id >= 0? If Posix is used as a fallback backend in non-NPU scenarios, shouldn't tensor_size still be needed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants