[Feat] Add layerwise load-ahead window#952
Conversation
|
|
||
|
|
||
| def _make_metadata(*, include_failing_request: bool = False): | ||
| request_meta = { |
There was a problem hiding this comment.
Missing test coverage: no test for load_ahead > total_layer_count. What behavior is expected when the window exceeds available layers? Consider adding a test case for this edge condition.
| connector.kv_cache_layout = FakeKVCacheLayout(len(connector.layer_ids)) | ||
| connector.tp_rank = 0 | ||
| connector.tp_size = 1 | ||
| connector.is_mla = False |
There was a problem hiding this comment.
Missing test for validation errors: _get_layerwise_load_ahead should raise ValueError for negative or non-integer values, but there's no test verifying this behavior.
| @@ -836,6 +882,8 @@ def wait_for_layer_load(self, layer_name: str) -> None: | |||
| return | |||
There was a problem hiding this comment.
The should_refill_window check adds to _waited_load_layers before popping from load_tasks. If the same layer is revisited (e.g., rollback or MTP paths), this could cause issues. Should the check happen after the pop?
| def _submit_next_load_layers( | ||
| self, metadata: "UCMConnectorMetadata", count: int | ||
| ) -> None: | ||
| submitted_count = 0 |
There was a problem hiding this comment.
The _submit_next_load_layers loop condition self._next_load_layer_index < len(self.layer_ids) stops silently when index exceeds. Should there be a debug log or assertion to clarify expected behavior?
Summary
layerwise_load_aheadfor vLLM layerwise KV loading.1to preserve the existing one-layer lookahead behavior.Validation
pytestinstalled.python -m compileall ucm\integration\vllm\ucm_connector.py ucm\integration\vllm\tests\test_layerwise_load_ahead.py.Notes
layerwise_load_ahead=1preserves the previous behavior. Values like2or4can improve overlap when layer load latency is higher than per-layer inference latency, but should be tuned with Cache Store queue, host buffer, and H2D stream pressure in mind.