feat(vllm): DfkvStoreConnector — direct vLLM KV connector (GPUDirect RDMA, bypass LMCache) [WIP]#46
Conversation
Update: scatter-gather key coalescing (perf) + honest performance characterizationFunctional correctness is fully solved (separate from the WIP note below): all 5 DeepSeek-V4-Flash Scatter-gather coalescing (new, additive): added Performance reality (single test node, hd04-gpu1-0065):
Where it pays off: a production multi-node ring distributes keys across N servers/disks (≈N× read bandwidth → load becomes RDMA-bandwidth-bound, sub-second → prefill-skip dominates), or faster local storage / a buffered-I/O (page-cache) server mode for hot prefixes. Those weren't demonstrable on the single isolated test node. |
Final: performance optimization journey + resolved findings (single-node hd04-gpu1-0065, DeepSeek-V4-Flash, dp4·tp2)Correctness (proven)
Optimizations (load latency, 12k prompt = 834MB)
TTFT decomposition (warm hit, after JIT warm)
Scope noteH100 FP8 MLA prefill is very fast (12k=2.7 s, 48k=4.7 s); KV load grows linearly. So dfkv wins for short/medium prompts and is load-bound for long ones on a single disk — long-context needs the distributed ring (per-key fan-out across many disks/NICs). Co-locating N servers on one node/NIC backfires (QP/connection overhead) and is not representative of a real ring. |
Correction to the earlier perf comment:
|
| DFKV_RDMA_DEPTH | GET GB/s | GET p50 |
|---|---|---|
| 1 | 1.22 / 1.24 / 1.25 | ~0.41 ms |
| 32 | 1.22 / 1.25 / 1.24 | ~0.41 ms |
Indistinguishable. GET (and PUT) are depth-flat — the per-connection serve loop is in-order, so pipelining N requests on one connection just queues them. This matches docs/datapath-perf-notes.md (which already documented PUT-flat; now extended to GET). Depth is a latency-hider, not a throughput knob. README + perf-notes updated accordingly.
What actually optimizes the connector load (this PR):
- SG key coalescing —
batch_put_sg/batch_get_auto_sg: one key per chunk instead of one per layer-segment → 25392 → 1242 keys (20×) → far fewer per-key RDMA round-trips. This is the real, robust connector-side load lever. - Throughput levers in general (per perf-notes): multi-connection fan-out (
batch_concurrency, default 8) and fewer/larger keys.
Side observation: single-connection raw GET is ~1.24 GB/s, while the connector's per-rank load was ~0.2–0.5 GB/s — i.e. there is connector-side headroom (SG scatter / GPUDirect / random-key disk reads), a separate track from depth.
Net: keep DFKV_RDMA_DEPTH at its default unless you're explicitly network-latency-bound; the load win is SG + fan-out, not depth.
…gated io_uring async GET Two additive dfkv-core features that back the new vLLM KV connector; the existing batch_put/batch_get_auto and the sync serve loop are untouched (CI green: TSan + clang/g++ + Soft-RoCE RDMA loopback). - Scatter-gather API: dfkv_batch_put_sg / dfkv_batch_get_auto_sg gather/scatter N non-contiguous buffers under one key via a single RDMA multi-SGE work request (SGE0=header + <=29 payload segs on max_sge=30 HCAs; server stays opaque, one blob per key). QP max_send/recv_sge raised 2 -> min(30, device cap); the legacy 2-SGE path is unchanged, so it's additive. Lets the connector coalesce a chunk's per-layer segments into one key (25392 -> 1242 keys, ~20x fewer per-key disk reads). - io_uring async GET (DFKV_SERVER_URING, default off; built under -DDFKV_WITH_URING): the serve loop submits a WaitComp batch's reads concurrently and drains them in request order, preserving the zero-copy in-order-reply invariant; falls back to the verbatim sync loop on ring-init failure. Default-off = production byte-identical. - Tests: sg_test (N=1/2/29 + >29 guard), rdma_loopback SG/uring roundtrips, c_api guard. - docs/datapath-perf-notes: DFKV_RDMA_DEPTH is throughput-flat for GET too (benchmarked, GET 1.24 GB/s at depth 1 == 32) — a latency-hider, not a throughput knob.
…Direct RDMA, bypass LMCache) A KVConnectorBase_V1 connector under integration/vllm that stores/loads KV cache directly to/from a dfkv cluster over GPUDirect RDMA, occupying the same --kv-transfer-config slot as MooncakeStoreConnector. Pure-Python ctypes over libdfkv.so; bypasses LMCache (which does not work for DeepSeek-V4 on this stack). Validated on hd04 H100 + IB, DeepSeek-V4-Flash (dp4/tp2): all 5 kv_cache_groups offload correctly (per-group MR registration handles the aliased MLA storage), full cross-restart AND cross-DP hit, output byte-identical to cold, vLLM skips the prefill. Uses the scatter-gather core API to coalesce per-chunk keys. Design/plan docs under docs/superpowers/; perf characterization (single-disk storage-bound load; one-time per-DP Triton JIT on the first warm request) is in the PR thread.
7180224 to
aca4b66
Compare
…t hang (P1 from /review) Fresh-eyes /review caught two real fail-soft defeats in the connector transfer threads: - SAVE except handler logged undefined flat_keys (leftover from pre-SG code) -> NameError on the exact dfkv-transport-exception path it exists to absorb. Use sg_keys. - recv thread did tp_rank % len(key_list) with no empty guard -> ZeroDivisionError before set_finished_request when every chunk is masked out (SWA edge) -> req never reported done -> vLLM hangs in WAITING_FOR_REMOTE_KVS. Add the empty-keylist early finish. Also documents a known cross-HCA limitation: kSgMaxPayloadSegs is a fixed 29, not the live max_sge-1; on max_sge<30 devices the SG batch fail-softs (recompute, no corruption) rather than failing only the offending key — TODO for heterogeneous-hardware support. io_uring P2s (EINTR CQE reap, partial-submit, short-read retry) are behind the default-off DFKV_SERVER_URING flag; tracked as follow-ups.
… hardening) The recv thread completed the request (set_finished_request + task_done) as trailing statements outside any finally, unlike the send thread. Any exception before that point (prepare_value, _group_segments_sg, load_mask, ...) skipped completion -> the req was never reported done -> vLLM hung in WAITING_FOR_REMOTE_KVS. Now the whole body is in try/except/finally: any unexpected failure marks the request's blocks for recompute and the finally always completes the request. Subsumes the empty-keylist guard's symptom.
…bustness, empty-key skip, true out_lens Fix 1 [P1] rdma_transport.cc CacheFromMulti/RangeIntoMulti: an oversized SG key (total>max_payload / header>control_cap / segs>max_sge-1) no longer poisons its whole node batch. Replace the up-front std::fill(kInvalid)+return with a per-item bad[] mask; offenders are marked kInvalid and SKIPPED in the window (no MR, no recv/send WR, no completion consumed — need=2*posted), siblings proceed. Connector treats per-key kInvalid as a fail-soft save/load miss. Fix 2 [P2] uring_reader.h BatchRead: (a) EINTR from submit_and_wait re-waits instead of resubmitting (no double-queue/user_data desync); (b) short submit is fully flushed via a submit loop so no stale SQEs linger; (c) short O_DIRECT reads re-prep the residual [off+done,len-done) and reap again, bounded by kMaxReaps, mirroring the sync PreadRangeDirectTo loop. Per-desc done_ tracks progress. Fix 3 [P2] kv_client.cc BatchPutSg/BatchGetAutoSg: skip items with empty key (out_ok/out_hit stay 0) so no junk empty-key blob is written and no wasted GET is issued — matches the C-ABI null guard intent. Fix 4 [P3] transport.h default RangeIntoMulti: out_lens[i] now reports the TRUE received payload length (min(header.payload_len, sum_caps)) instead of sum-of-caps, matching the RDMA override's received_bytes-header semantics. Tests: +Sg.EmptyKeySkipped (TCP) and +RdmaLoopback.ScatterGatherOversizedFailsOnly Offender (real CacheFromMulti/RangeIntoMulti path). Full ctest green with DFKV_WITH_URING ON (187/187, DFKV_SERVER_URING=1) and OFF (187/187).
…pth claim - §5: correct the DFKV_RDMA_DEPTH note — depth is throughput-flat (2026-06 benchmark), not a write-bandwidth booster; the lever is batch_concurrency + fewer/larger keys. - §9 (new): which post-dingodb#46 features do NOT apply to the HiCache/MLA path and why (SG = nothing to coalesce for one-object-per-page; io_uring = flat on single disk; depth = flat). Plus: vLLM and HiCache instances of the SAME model can share the dfkv cluster/ring but do NOT reuse each other's KV (different key schemes + KV layouts) — share nodes/capacity, isolate keyspace via distinct model_hash/name.
…ence (#47) * docs(vllm): document the DfkvStoreConnector — README + full deploy guide + config reference The merged vLLM connector (PR #46) had only a terse integration/vllm/README and no deploy doc. Add complete docs reflecting the shipped code: - README.md: add the vLLM connector to Engine integrations + Layout + bump the test count (53 -> 88 ctest entries, add the RDMA datapath CI job). - integration/vllm/README.md: complete the env-var table (incl. the critical PYTHONHASHSEED=0 for cross-process/restart key determinism), the full kv_connector_extra_config keys with defaults (load_async, enable_cross_layers_blocks, lookup_rpc_port), a geometry guard for shared pools, and the SG + JIT notes. - docs/vllm/DEPLOY.md (new): end-to-end deploy (build -> dfkv cluster -> connector -> vLLM -> verify) with a full config reference and per-scenario recommended settings (single/multi-DP/shared-pool/long-context), geometry guard, measured results, and a troubleshooting table. Mirrors docs/lmcache/DEPLOY.md. * docs: broaden README tagline beyond SGLang — dfkv now serves SGLang HiCache, LMCache, and vLLM The 'distributed KV cache for SGLang HiCache' title undersold the repo now that it backs three engines. Lead with LLM inference + list the three adapters. Also fix the now-contradictory 'without ... MDS ... dependency' line (dfkv ships its own dfkv_mds). * docs(hicache): add engine/feature-boundary section + correct stale depth claim - §5: correct the DFKV_RDMA_DEPTH note — depth is throughput-flat (2026-06 benchmark), not a write-bandwidth booster; the lever is batch_concurrency + fewer/larger keys. - §9 (new): which post-#46 features do NOT apply to the HiCache/MLA path and why (SG = nothing to coalesce for one-object-per-page; io_uring = flat on single disk; depth = flat). Plus: vLLM and HiCache instances of the SAME model can share the dfkv cluster/ring but do NOT reuse each other's KV (different key schemes + KV layouts) — share nodes/capacity, isolate keyspace via distinct model_hash/name.
New direct vLLM integration + scatter-gather datapath since v1.5.2: - vLLM DfkvStoreConnector (KVConnectorBase_V1, GPUDirect RDMA, bypass LMCache) — #46 - Scatter-gather batch API (batch_put_sg/batch_get_auto_sg, QP max_sge 2->30): one multi-SGE RDMA per chunk, ~20x fewer keys/disk-reads — #46 - io_uring async GET serve loop (opt-in DFKV_SERVER_URING, default off) — #46 - 7 fresh-eyes review fixes (per-item SG failure, recv-thread hardening, empty-key skip, io_uring EINTR/short-read, true out_lens) + 2 regression tests — #46 - Docs: vLLM deploy guide + config reference, README multi-engine, HiCache boundary — #47 No wire change (kProtoVersion still 1); v1.5.x compatible. CI green incl. TSan + RDMA datapath.
DfkvStoreConnector — direct vLLM KV connector for dfkv (GPUDirect RDMA, bypass LMCache)
A vLLM
KVConnectorBase_V1connector (vLLM 0.23.0) that stores/loads KV cache directly to/from a dfkv cluster over GPUDirect RDMA, occupying the same--kv-transfer-configslot asMooncakeStoreConnector. It bypasses LMCache entirely (LMCache's in-processLMCacheConnectorV1does not work for DeepSeek-V4 on this stack).Out-of-tree plugin, code under
integration/vllm/(symmetric tointegration/lmcache/); pure Python (ctypes overlibdfkv.so), no native build. Design/plan:docs/superpowers/specs|plans/2026-06-18-dfkv-vllm-store-connector*.Architecture
connector.py(role dispatch) +scheduler.py(hit decision) +worker.py(register / async transfer threads / lookup server). All dfkv contact is isolated indfkv_client.py(DfkvDeviceClient, raw GPU device pointers).register_kv_cachesregisters the paged KV region once viadfkv_register_memory(anibv_reg_mrthat, under nvidia-peermem, yields a GPUDirect MR). Transfers go RDMA directly to/from GPU memory, no host bounce. No layerwise transfer — load/save issued per-request inget_finished().Validated on real hardware (hd04 H100 + IB, DeepSeek-V4-Flash, DP4·TP2)
dfkv client transport=rdma; 0 crash.failed=0).Bug chain fixed during bring-up (each its own commit)
dfkv_batch_putper-key convention: dfkv returnsout[i]=1for OK (opposite of singledfkv_put); normalized indfkv_client.@seg{n}; lookup probes@seg0.PYTHONHASHSEED=0is required so vLLM's block-hash seed is deterministic across processes — otherwise dfkv keys never match across a restart/instance and the L3 cache is useless beyond one process lifetime.load_maskdelegates tostore_maskso the consumer reloads every group the producer stored.kv_cache_groupbase-addr partition inregister_kv_caches: the Mooncake template handed one flat addr list to every group's token DB — correct for single-group models, but for multi-group models (DeepSeek-V4) it scattered loaded KV into the wrong group's blocks. Now each group's token DB addresses only its own layers.store_mask(SlidingWindow groups keep only in-window tail chunks).Known limitation (WIP — draft)
DeepSeek-V4-Flash exposes 5
kv_cache_groups (1 full-MLA + 4 SlidingWindow / partial-state, block sizes 256/64/64/4/8). Full prefix-reuse across all SWA groups under chunked prefill is not yet complete: a fresh cross-restart lookup currently sees only a subset present (present≈138/1058 → hit_length=0), because SAVE (per-stepstore_maskat chunked token-lengths) and LOOKUP (store_maskat full token-length) don't yet agree on the windowed groups' chunk set at every window boundary. The mechanism and the single-/simple-group path are proven (15× speedup); robust coverage for V4-Flash's full 5-group SWA structure is the remaining work and likely needs a unified per-(group,chunk) enumeration shared by save/load/lookup rather than the Mooncake per-spec masks.Testing
Isolated node (hd04-gpu1-0065, fault pool), on-node
ctr/nerdctl, never touches production. dfkv_server on-host (--rdma-dev ib7s400p0, member uses rdma-port); vLLM container mounts the package + libdfkv.so;DFKV_RDMA=1 DFKV_RDMA_DEV=ib7s400p0,kv_connector_extra_config={members, model_hash, lib}.Not for merge yet — opening for visibility/review of the approach while the 5-group SWA coverage is finished.