[Feat] Mooncake & Mooncake|Posix#945
Conversation
5e3d768 to
b6df4c9
Compare
b6df4c9 to
716c6b3
Compare
| master_server_address: "127.0.0.1:50088" | ||
| metadata_server: "P2PHANDSHAKE" | ||
| protocol: "ascend" | ||
| global_segment_size: 32212254720 |
There was a problem hiding this comment.
Should we explain the meaning of these parameters
There was a problem hiding this comment.
I will add comments to provide brief explanations for these parameters.
| return self.store_.Check(task.task_id) | ||
|
|
||
| def register_memory(self, base_addr: int, total_size: int) -> None: | ||
| self.store_.RegisterMemory(base_addr, total_size) |
There was a problem hiding this comment.
Should the return value of RegisterMemory be validated
| * @param total_size Total size of the memory region in bytes. | ||
| * @return Status::OK on success, error code on failure. | ||
| */ | ||
| virtual Status RegisterMemory(void* base_addr, size_t total_size) = 0; |
There was a problem hiding this comment.
If RegisterMemory is Mooncake-specific operation, should it be handled internally within MooncakeStore rather than being added as a public interface in StoreV1 base class? The current design forces all other store implementations to provide empty stub methods, violating Interface Segregation Principle.
| metadata_server, | ||
| protocol, | ||
| global_segment_size, | ||
| replica_num, |
There was a problem hiding this comment.
create_mooncake_store() is called with an extra argument worker_num, but its function signature does not define worker_num
| @@ -0,0 +1,50 @@ | |||
| find_path(ASCEND_ACL_INCLUDE_DIR NAMES acl/acl.h PATHS /usr/local/Ascend/ascend-toolkit/latest/include NO_DEFAULT_PATH) | |||
| find_library(ASCEND_ACL_LIBRARY NAMES ascendcl PATHS /usr/local/Ascend/ascend-toolkit/latest/lib64 NO_DEFAULT_PATH) | |||
There was a problem hiding this comment.
The path /vllm-workspace/Mooncake/mooncake-store/lib is hardcoded here. This won't work in our CI environment where Mooncake is installed at a different location. Could we use an environment variable like MOONCAKE_STORE_ROOT instead?
There was a problem hiding this comment.
The current intention is to make it directly usable in the Ascend environment: because the Mooncake source code path in the Vllm-Ascend image is located at /vllm-workspace/Mooncake/mooncake-store/lib (a fixed path). However, considering future scalability and support for the CUDA ecosystem, switching to environment variables is indeed more reasonable.
| file(GLOB_RECURSE SOURCES "./cc/*.cc") | ||
| add_library(mooncakestore SHARED ${SOURCES}) | ||
|
|
||
| target_compile_features(mooncakestore PUBLIC cxx_std_20) |
There was a problem hiding this comment.
Same issue here with the include path. Would be nice if MOONCAKE_STORE_INCLUDE_DIR could be passed as a cmake option.
There was a problem hiding this comment.
Understood. I will make it more reasonable.
| for (auto sz : shard.sizes) { totalSize += sz; } | ||
| if (totalSize == 0) { continue; } | ||
|
|
||
| void* buf = bufPool.AcquireWithTimeout(std::chrono::milliseconds(30000)); |
There was a problem hiding this comment.
The 30-second timeout here is hardcoded. In some scenarios we might want a longer timeout for large transfers. Could this be configurable via the Config struct?
|
|
||
| std::vector<int> RpcBatchIsExist(const std::vector<std::string>& keys) | ||
| { | ||
| if (keys.empty()) { return std::vector<int>(keys.size(), -1); } |
There was a problem hiding this comment.
When keys.empty(), returning std::vector<int>(keys.size(), -1) returns an empty vector, which is correct. But the logic feels a bit odd - why check keys.empty() and then reference keys.size()? Could simplify to just return {};
|
|
||
| if (shard.addrs.size() > config_.tensorSizeList.size()) { | ||
| UC_WARN( | ||
| "BuildShards: key={} has {} addrs but tensorSizeList has only {}, truncating", |
There was a problem hiding this comment.
This warning is logged but the operation continues with truncated data. Should this be an error instead? Truncating tensor addresses seems like it could cause correctness issues.
| num_blocks = kv_layer.shape[0] | ||
| total_size = kv_layer.numel() * kv_layer.element_size() | ||
| self.store.register_memory(kv_layer.data_ptr(), total_size) | ||
| else: |
There was a problem hiding this comment.
The error message doesn't include layer_name. When this happens in production, it's hard to know which layer caused the issue. Could you include the layer name in the message?
| elif isinstance(kv_layer, Tuple): | ||
| for tensor in kv_layer: | ||
| total_size = tensor.numel() * tensor.element_size() | ||
| self.store.register_memory(tensor.data_ptr(), total_size) |
There was a problem hiding this comment.
Similarly, the type error doesn't mention which layer. Would help debugging to add layer_name here too.
| config: Dict[str, object], pipeline: ucmpipelinestore.PipelineStore | ||
| ): | ||
| store_dir = Path(__file__).resolve().parent.parent | ||
| posix_config = copy.deepcopy(config) |
There was a problem hiding this comment.
Why is tensor_size only set when device_id >= 0? If Posix is used as a fallback backend in non-NPU scenarios, shouldn't tensor_size still be needed?
Purpose
UCM + Mooncake & Mooncake|Posix (Ascend 0.18.0)
Modifications
unified-cache-management\examples\ucm_mooncake_config.yaml
Test