Skip to content

feat(vllm): DfkvStoreConnector — direct vLLM KV connector (GPUDirect RDMA, bypass LMCache) [WIP]#46

Merged
ketor merged 6 commits into
dingodb:mainfrom
ketor:feat/vllm_store_connector
Jun 19, 2026
Merged

feat(vllm): DfkvStoreConnector — direct vLLM KV connector (GPUDirect RDMA, bypass LMCache) [WIP]#46
ketor merged 6 commits into
dingodb:mainfrom
ketor:feat/vllm_store_connector

Conversation

@ketor

@ketor ketor commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

DfkvStoreConnector — direct vLLM KV connector for dfkv (GPUDirect RDMA, bypass LMCache)

A vLLM KVConnectorBase_V1 connector (vLLM 0.23.0) that stores/loads KV cache directly to/from a dfkv cluster over GPUDirect RDMA, occupying the same --kv-transfer-config slot as MooncakeStoreConnector. It bypasses LMCache entirely (LMCache's in-process LMCacheConnectorV1 does not work for DeepSeek-V4 on this stack).

Out-of-tree plugin, code under integration/vllm/ (symmetric to integration/lmcache/); pure Python (ctypes over libdfkv.so), no native build. Design/plan: docs/superpowers/specs|plans/2026-06-18-dfkv-vllm-store-connector*.

Architecture

  • Mirrors Mooncake's store connector: thin connector.py (role dispatch) + scheduler.py (hit decision) + worker.py (register / async transfer threads / lookup server). All dfkv contact is isolated in dfkv_client.py (DfkvDeviceClient, raw GPU device pointers).
  • register_kv_caches registers the paged KV region once via dfkv_register_memory (an ibv_reg_mr that, under nvidia-peermem, yields a GPUDirect MR). Transfers go RDMA directly to/from GPU memory, no host bounce. No layerwise transfer — load/save issued per-request in get_finished().
  • Fail-soft invariant: any dfkv failure only means "fewer hits = more compute"; it never blocks/503s inference.

Validated on real hardware (hd04 H100 + IB, DeepSeek-V4-Flash, DP4·TP2)

  • GPUDirect RDMA round-trip from GPU memory: byte-correct (P0).
  • Connector loads + serves under vLLM 0.23.0; dfkv client transport=rdma; 0 crash.
  • Save path writes KV to dfkv (failed=0).
  • Cross-restart load: after a vLLM restart the lookup finds the dfkv-resident KV and the connector loads it back; vLLM skips the prefill — TTFT 18 s → ~1.2 s (≈15×) demonstrated.

Bug chain fixed during bring-up (each its own commit)

  1. dfkv_batch_put per-key convention: dfkv returns out[i]=1 for OK (opposite of single dfkv_put); normalized in dfkv_client.
  2. SAVE/LOAD/LOOKUP key alignment: always suffix @seg{n}; lookup probes @seg0.
  3. PYTHONHASHSEED=0 is required so vLLM's block-hash seed is deterministic across processes — otherwise dfkv keys never match across a restart/instance and the L3 cache is useless beyond one process lifetime.
  4. load_mask delegates to store_mask so the consumer reloads every group the producer stored.
  5. Per-kv_cache_group base-addr partition in register_kv_caches: the Mooncake template handed one flat addr list to every group's token DB — correct for single-group models, but for multi-group models (DeepSeek-V4) it scattered loaded KV into the wrong group's blocks. Now each group's token DB addresses only its own layers.
  6. Lookup candidates gated by store_mask (SlidingWindow groups keep only in-window tail chunks).

Known limitation (WIP — draft)

DeepSeek-V4-Flash exposes 5 kv_cache_groups (1 full-MLA + 4 SlidingWindow / partial-state, block sizes 256/64/64/4/8). Full prefix-reuse across all SWA groups under chunked prefill is not yet complete: a fresh cross-restart lookup currently sees only a subset present (present≈138/1058 → hit_length=0), because SAVE (per-step store_mask at chunked token-lengths) and LOOKUP (store_mask at full token-length) don't yet agree on the windowed groups' chunk set at every window boundary. The mechanism and the single-/simple-group path are proven (15× speedup); robust coverage for V4-Flash's full 5-group SWA structure is the remaining work and likely needs a unified per-(group,chunk) enumeration shared by save/load/lookup rather than the Mooncake per-spec masks.

Testing

Isolated node (hd04-gpu1-0065, fault pool), on-node ctr/nerdctl, never touches production. dfkv_server on-host (--rdma-dev ib7s400p0, member uses rdma-port); vLLM container mounts the package + libdfkv.so; DFKV_RDMA=1 DFKV_RDMA_DEV=ib7s400p0, kv_connector_extra_config={members, model_hash, lib}.

Not for merge yet — opening for visibility/review of the approach while the 5-group SWA coverage is finished.

@ketor

ketor commented Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

Update: scatter-gather key coalescing (perf) + honest performance characterization

Functional correctness is fully solved (separate from the WIP note below): all 5 DeepSeek-V4-Flash kv_cache_groups now offload correctly (the bug was register_kv_caches deduping aliased MLA storage — g3 is 168 views of g0's blocks — in the same loop that collected per-layer addrs; decoupled so every layer gets its segment). Cross-restart hit is full (present=1058/1058, hit_length=11776), load failed=0, output byte-identical to cold, and vLLM reliably skips the prefill (verified by trace num_new_tokens=224, and by local-cache hits at 0.23s).

Scatter-gather coalescing (new, additive): added dfkv_batch_put_sg/dfkv_batch_get_auto_sg to the dfkv core (one key gathers ≤29 non-contiguous GPU segments via an RDMA multi-SGE work request; server stays opaque/unchanged; QP max_sge 2→30; old API untouched; 170/170 + 183/183 ctest green incl. a real multi-SGE RDMA roundtrip). The connector now coalesces each chunk's per-layer segments into one @sg{n} key. Measured: keys 25392 → 1242 (~20×), save/load failed=0, output correct.

Performance reality (single test node, hd04-gpu1-0065):

  • H100 FP8 MLA prefill is extremely fast: 12k tokens 2.7s, 48k tokens 4.7s.
  • dfkv warm load is bottlenecked by O_DIRECT reads from ONE shared disk (~290–358 MB/s/rank, 8 ranks sharing one MegaRAID SSD): 12k load 2.9s, 48k load 9.5s. RDMA (400G) is not the bottleneck (6.7GB ≈ 0.13s at line rate); the single shared disk is.
  • ⇒ On this single-node setup warm-load ≈/> cold-prefill, so the L3 cache is net-neutral/negative for TTFT here. SG eliminated the per-key overhead (the connector-controllable factor); the residual is storage bandwidth/distribution.

Where it pays off: a production multi-node ring distributes keys across N servers/disks (≈N× read bandwidth → load becomes RDMA-bandwidth-bound, sub-second → prefill-skip dominates), or faster local storage / a buffered-I/O (page-cache) server mode for hot prefixes. Those weren't demonstrable on the single isolated test node.

@ketor

ketor commented Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

Final: performance optimization journey + resolved findings (single-node hd04-gpu1-0065, DeepSeek-V4-Flash, dp4·tp2)

Correctness (proven)

  • All 5 kv_cache_groups offload correctly (decoupled per-group MR registration for aliased MLA storage). Cross-restart hit is full (present=1058/1058, hit_length=11776), failed=0, output byte-identical to cold, vLLM reliably skips the prefill (num_new_tokens=224).
  • Cross-DP reuse is reliable (NOT DP-dependent). Keys = model@tp_rank@pp_rank@block_hash@sg, no DP field; PYTHONHASHSEED=0 is in effect in every DP engine subprocess (hash probe byte-identical across DP0-3), tp_rank is DP-local. Proven: DP3 saved → after restart DP2 got a full hit on DP3's data. Hit/miss is temporal ("has this prompt been saved yet"), not spatial.

Optimizations (load latency, 12k prompt = 834MB)

step load
baseline (no SG, depth=1, slow disk) 5–11 s
+ SG coalescing (25392 → 1242 keys, 20×) 2.9 s
+ DFKV_RDMA_DEPTH=32 + fast disk 1.6 s
+ io_uring async GET 1.6 s (no gain)
  • Biggest finding: DFKV_RDMA_DEPTH defaults to 1 (RDMA pipeline disabled → one in-flight read per connection). Setting it to 32 cut the load ~5× and the save ~4×. This is a one-line env that also benefits the production SGLang HiCache ring. Recommend DFKV_RDMA_DEPTH>1 in production.
  • SG coalescing (this PR): batch_put_sg/batch_get_auto_sg (multi-SGE RDMA, server opaque, QD≤29 segs/key) → one key per chunk instead of one per layer-segment → load becomes bandwidth-bound, not per-key-round-trip-bound.
  • io_uring async GET (this PR, DFKV_SERVER_URING, default off): batch-and-wait model (matches Mooncake's uring_file.cpp batch_read), preserves in-order replies. Correctly implemented + ctest green, but no measurable gain on a single disk — the disk is already ~75% saturated by the per-connection pipeline (depth=32). It's there for genuinely read-serial-bound deployments; default-off = production-safe.

TTFT decomposition (warm hit, after JIT warm)

lookup 0.35 s + dfkv load 1.5 s + resume 0.03 s + tail-forward 0.2 s ≈ 2.0 s, vs cold prefill 2.7 s → dfkv wins for 12k once warm. The first warm request per DP rank pays a one-time Triton JIT compile (~1.9 s) of the resumed-prefill + SWA-index kernels (a shape startup warmup doesn't cover); subsequent requests run that forward in ~0.2 s. Recommend pre-warming (dp_size synthetic hits at startup, or extend _dummy_run).

Scope note

H100 FP8 MLA prefill is very fast (12k=2.7 s, 48k=4.7 s); KV load grows linearly. So dfkv wins for short/medium prompts and is load-bound for long ones on a single disk — long-context needs the distributed ring (per-key fan-out across many disks/NICs). Co-locating N servers on one node/NIC backfires (QP/connection overhead) and is not representative of a real ring.

@ketor ketor marked this pull request as ready for review June 19, 2026 15:55
@ketor

ketor commented Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

Correction to the earlier perf comment: DFKV_RDMA_DEPTH is depth-flat (retracting the "depth is the big win" claim)

The earlier comment credited a ~5× load speedup to DFKV_RDMA_DEPTH=32. That was confounded — it was measured while also switching to a faster, but shared, SSD whose read latency varies several-fold with co-tenant I/O. A controlled, isolated benchmark settles it:

dfkv_bench --threads 1 --bc 1 --size 512KiB --count 2000, single RC connection, io_uring server, depth 1 vs 32 interleaved ×3:

DFKV_RDMA_DEPTH GET GB/s GET p50
1 1.22 / 1.24 / 1.25 ~0.41 ms
32 1.22 / 1.25 / 1.24 ~0.41 ms

Indistinguishable. GET (and PUT) are depth-flat — the per-connection serve loop is in-order, so pipelining N requests on one connection just queues them. This matches docs/datapath-perf-notes.md (which already documented PUT-flat; now extended to GET). Depth is a latency-hider, not a throughput knob. README + perf-notes updated accordingly.

What actually optimizes the connector load (this PR):

  • SG key coalescingbatch_put_sg/batch_get_auto_sg: one key per chunk instead of one per layer-segment → 25392 → 1242 keys (20×) → far fewer per-key RDMA round-trips. This is the real, robust connector-side load lever.
  • Throughput levers in general (per perf-notes): multi-connection fan-out (batch_concurrency, default 8) and fewer/larger keys.

Side observation: single-connection raw GET is ~1.24 GB/s, while the connector's per-rank load was ~0.2–0.5 GB/s — i.e. there is connector-side headroom (SG scatter / GPUDirect / random-key disk reads), a separate track from depth.

Net: keep DFKV_RDMA_DEPTH at its default unless you're explicitly network-latency-bound; the load win is SG + fan-out, not depth.

ketor added 2 commits June 20, 2026 00:56
…gated io_uring async GET

Two additive dfkv-core features that back the new vLLM KV connector; the existing
batch_put/batch_get_auto and the sync serve loop are untouched (CI green: TSan +
clang/g++ + Soft-RoCE RDMA loopback).

- Scatter-gather API: dfkv_batch_put_sg / dfkv_batch_get_auto_sg gather/scatter N
  non-contiguous buffers under one key via a single RDMA multi-SGE work request
  (SGE0=header + <=29 payload segs on max_sge=30 HCAs; server stays opaque, one blob
  per key). QP max_send/recv_sge raised 2 -> min(30, device cap); the legacy 2-SGE
  path is unchanged, so it's additive. Lets the connector coalesce a chunk's per-layer
  segments into one key (25392 -> 1242 keys, ~20x fewer per-key disk reads).
- io_uring async GET (DFKV_SERVER_URING, default off; built under -DDFKV_WITH_URING):
  the serve loop submits a WaitComp batch's reads concurrently and drains them in
  request order, preserving the zero-copy in-order-reply invariant; falls back to the
  verbatim sync loop on ring-init failure. Default-off = production byte-identical.
- Tests: sg_test (N=1/2/29 + >29 guard), rdma_loopback SG/uring roundtrips, c_api guard.
- docs/datapath-perf-notes: DFKV_RDMA_DEPTH is throughput-flat for GET too (benchmarked,
  GET 1.24 GB/s at depth 1 == 32) — a latency-hider, not a throughput knob.
…Direct RDMA, bypass LMCache)

A KVConnectorBase_V1 connector under integration/vllm that stores/loads KV cache
directly to/from a dfkv cluster over GPUDirect RDMA, occupying the same
--kv-transfer-config slot as MooncakeStoreConnector. Pure-Python ctypes over
libdfkv.so; bypasses LMCache (which does not work for DeepSeek-V4 on this stack).

Validated on hd04 H100 + IB, DeepSeek-V4-Flash (dp4/tp2): all 5 kv_cache_groups
offload correctly (per-group MR registration handles the aliased MLA storage), full
cross-restart AND cross-DP hit, output byte-identical to cold, vLLM skips the prefill.
Uses the scatter-gather core API to coalesce per-chunk keys. Design/plan docs under
docs/superpowers/; perf characterization (single-disk storage-bound load; one-time
per-DP Triton JIT on the first warm request) is in the PR thread.
@ketor ketor force-pushed the feat/vllm_store_connector branch from 7180224 to aca4b66 Compare June 19, 2026 16:56
ketor added 4 commits June 20, 2026 01:06
…t hang (P1 from /review)

Fresh-eyes /review caught two real fail-soft defeats in the connector transfer threads:
- SAVE except handler logged undefined flat_keys (leftover from pre-SG code) -> NameError
  on the exact dfkv-transport-exception path it exists to absorb. Use sg_keys.
- recv thread did tp_rank % len(key_list) with no empty guard -> ZeroDivisionError before
  set_finished_request when every chunk is masked out (SWA edge) -> req never reported
  done -> vLLM hangs in WAITING_FOR_REMOTE_KVS. Add the empty-keylist early finish.

Also documents a known cross-HCA limitation: kSgMaxPayloadSegs is a fixed 29, not the
live max_sge-1; on max_sge<30 devices the SG batch fail-softs (recompute, no corruption)
rather than failing only the offending key — TODO for heterogeneous-hardware support.

io_uring P2s (EINTR CQE reap, partial-submit, short-read retry) are behind the
default-off DFKV_SERVER_URING flag; tracked as follow-ups.
… hardening)

The recv thread completed the request (set_finished_request + task_done) as trailing
statements outside any finally, unlike the send thread. Any exception before that point
(prepare_value, _group_segments_sg, load_mask, ...) skipped completion -> the req was
never reported done -> vLLM hung in WAITING_FOR_REMOTE_KVS. Now the whole body is in
try/except/finally: any unexpected failure marks the request's blocks for recompute and
the finally always completes the request. Subsumes the empty-keylist guard's symptom.
…bustness, empty-key skip, true out_lens

Fix 1 [P1] rdma_transport.cc CacheFromMulti/RangeIntoMulti: an oversized SG key
  (total>max_payload / header>control_cap / segs>max_sge-1) no longer poisons its
  whole node batch. Replace the up-front std::fill(kInvalid)+return with a per-item
  bad[] mask; offenders are marked kInvalid and SKIPPED in the window (no MR, no
  recv/send WR, no completion consumed — need=2*posted), siblings proceed. Connector
  treats per-key kInvalid as a fail-soft save/load miss.

Fix 2 [P2] uring_reader.h BatchRead: (a) EINTR from submit_and_wait re-waits
  instead of resubmitting (no double-queue/user_data desync); (b) short submit is
  fully flushed via a submit loop so no stale SQEs linger; (c) short O_DIRECT reads
  re-prep the residual [off+done,len-done) and reap again, bounded by kMaxReaps,
  mirroring the sync PreadRangeDirectTo loop. Per-desc done_ tracks progress.

Fix 3 [P2] kv_client.cc BatchPutSg/BatchGetAutoSg: skip items with empty key
  (out_ok/out_hit stay 0) so no junk empty-key blob is written and no wasted GET
  is issued — matches the C-ABI null guard intent.

Fix 4 [P3] transport.h default RangeIntoMulti: out_lens[i] now reports the TRUE
  received payload length (min(header.payload_len, sum_caps)) instead of sum-of-caps,
  matching the RDMA override's received_bytes-header semantics.

Tests: +Sg.EmptyKeySkipped (TCP) and +RdmaLoopback.ScatterGatherOversizedFailsOnly
  Offender (real CacheFromMulti/RangeIntoMulti path). Full ctest green with
  DFKV_WITH_URING ON (187/187, DFKV_SERVER_URING=1) and OFF (187/187).
@ketor ketor merged commit bbf19b2 into dingodb:main Jun 19, 2026
6 checks passed
ketor added a commit to ketor/dfkv that referenced this pull request Jun 19, 2026
…pth claim

- §5: correct the DFKV_RDMA_DEPTH note — depth is throughput-flat (2026-06 benchmark),
  not a write-bandwidth booster; the lever is batch_concurrency + fewer/larger keys.
- §9 (new): which post-dingodb#46 features do NOT apply to the HiCache/MLA path and why
  (SG = nothing to coalesce for one-object-per-page; io_uring = flat on single disk;
  depth = flat). Plus: vLLM and HiCache instances of the SAME model can share the
  dfkv cluster/ring but do NOT reuse each other's KV (different key schemes + KV
  layouts) — share nodes/capacity, isolate keyspace via distinct model_hash/name.
ketor added a commit that referenced this pull request Jun 19, 2026
…ence (#47)

* docs(vllm): document the DfkvStoreConnector — README + full deploy guide + config reference

The merged vLLM connector (PR #46) had only a terse integration/vllm/README and no
deploy doc. Add complete docs reflecting the shipped code:

- README.md: add the vLLM connector to Engine integrations + Layout + bump the test
  count (53 -> 88 ctest entries, add the RDMA datapath CI job).
- integration/vllm/README.md: complete the env-var table (incl. the critical
  PYTHONHASHSEED=0 for cross-process/restart key determinism), the full
  kv_connector_extra_config keys with defaults (load_async, enable_cross_layers_blocks,
  lookup_rpc_port), a geometry guard for shared pools, and the SG + JIT notes.
- docs/vllm/DEPLOY.md (new): end-to-end deploy (build -> dfkv cluster -> connector ->
  vLLM -> verify) with a full config reference and per-scenario recommended settings
  (single/multi-DP/shared-pool/long-context), geometry guard, measured results, and a
  troubleshooting table. Mirrors docs/lmcache/DEPLOY.md.

* docs: broaden README tagline beyond SGLang — dfkv now serves SGLang HiCache, LMCache, and vLLM

The 'distributed KV cache for SGLang HiCache' title undersold the repo now that it
backs three engines. Lead with LLM inference + list the three adapters. Also fix the
now-contradictory 'without ... MDS ... dependency' line (dfkv ships its own dfkv_mds).

* docs(hicache): add engine/feature-boundary section + correct stale depth claim

- §5: correct the DFKV_RDMA_DEPTH note — depth is throughput-flat (2026-06 benchmark),
  not a write-bandwidth booster; the lever is batch_concurrency + fewer/larger keys.
- §9 (new): which post-#46 features do NOT apply to the HiCache/MLA path and why
  (SG = nothing to coalesce for one-object-per-page; io_uring = flat on single disk;
  depth = flat). Plus: vLLM and HiCache instances of the SAME model can share the
  dfkv cluster/ring but do NOT reuse each other's KV (different key schemes + KV
  layouts) — share nodes/capacity, isolate keyspace via distinct model_hash/name.
ketor added a commit that referenced this pull request Jun 19, 2026
New direct vLLM integration + scatter-gather datapath since v1.5.2:
- vLLM DfkvStoreConnector (KVConnectorBase_V1, GPUDirect RDMA, bypass LMCache) — #46
- Scatter-gather batch API (batch_put_sg/batch_get_auto_sg, QP max_sge 2->30): one
  multi-SGE RDMA per chunk, ~20x fewer keys/disk-reads — #46
- io_uring async GET serve loop (opt-in DFKV_SERVER_URING, default off) — #46
- 7 fresh-eyes review fixes (per-item SG failure, recv-thread hardening, empty-key
  skip, io_uring EINTR/short-read, true out_lens) + 2 regression tests — #46
- Docs: vLLM deploy guide + config reference, README multi-engine, HiCache boundary — #47

No wire change (kProtoVersion still 1); v1.5.x compatible. CI green incl. TSan + RDMA datapath.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant