[Feat]Adapt Deepseek-v4-Flash on cuda and ascend#950
Conversation
| """ | ||
|
|
||
| group_ucm_block_ids: list[list[bytes]] = field(default_factory=list) | ||
| group_vllm_block_ids: list[list[int]] = field(default_factory=list) |
There was a problem hiding this comment.
This _build_layout method duplicates significant logic from HMAKVCacheLayout._build_layout. Consider extracting common ptr/tensor_size extraction into a helper method to reduce code duplication.
| for (size_t i = 0, offset = 0; i < number; i++) { | ||
| auto pHost = (void*)(((int8_t*)host) + offset); | ||
| auto pDevice = device[i]; | ||
| if (!pDevice) { continue; } |
There was a problem hiding this comment.
The if (!pDevice) check is good, but should pHost also be checked for null? The symmetry with dump_queue.cc would improve safety.
| load_tok_end = total_hit_tokens | ||
| start_blk = load_tok_start // group.block_size | ||
| end_blk = load_tok_end // group.block_size | ||
| if start_blk >= end_blk: |
There was a problem hiding this comment.
For SW groups, load_tok_start = total_hit_tokens - group.sliding_window. Consider adding a comment explaining why this calculation ensures the SW window tail is loaded correctly on resume.
|
|
||
|
|
||
| class AscendDSV4Layout(HMAKVCacheLayout): | ||
| def __init__( |
There was a problem hiding this comment.
AscendDSV4Layout duplicates significant logic from HMAKVCacheLayout._build_layout. The ptr/tensor_size extraction loop is nearly identical. Consider extracting into a shared helper method.
| inherited ``ucm_block_ids``. | ||
| - ``group_vllm_block_ids[gid]``: per-group VLLM physical block ids; this | ||
| is initialised as an empty list per group here and populated later by | ||
| the dispatch path (still a TODO for HMA dump/load). |
There was a problem hiding this comment.
This TODO indicates incomplete implementation for HMA dump/load. Should this be tracked as a separate issue or addressed before merge?
Purpose
Modifications
Test