A small, self-contained distributed key-value cache that plugs into SGLang's HiCache as its L3 external KV store. Built to pool GPU-node NVMe SSDs into a shared, large-capacity KVCache pool for LLM inference (e.g. GLM-5.1 / MLA), without any DingoFS / brpc / MDS / S3-RADOS dependency — it runs on its own.
Origin: extracted from the DingoFS branch
feat/kvcache-sglang(src/cache/kvclient). The portable core has zero coupling to DingoFS, so it lives here as an independent repo. To instead fuse these semantics into the productiondingo-cache(brpc + MDS), seedocs/INTEGRATION.md.
dfkv_server— a cache-node daemon. Disk + LRU, cache-only (a miss is a clean NotFound; no object-store fallback), synchronous durable-visible writes. Supports multiple NVMe SSDs per node (--dir d1,d2,d3, intra-node Ketama). With--mds,--group,--id,--advertise,--weightit registers into the MDS tier; the old static--membersflag has been removed.dfkv_mds— stateless Membership Directory Service daemon. Flags:--listen <port>and--etcd <host:port>(default127.0.0.1:2379). The only etcd client in the system; holds each node's etcd lease on its behalf. Deploy as N replicas — no load-balancer needed; nodes and clients each pick any reachable MDS and fail over automatically.libdfkv.so— C ABI client (key→consistent-hash routing, value header with CRC + model/page/dtype/layer geometry guard, Put/Get/Exist).python/dfkv_hicache.py— SGLangHiCacheStorageplugin loaded via--hicache-storage-backend dynamic(no SGLang fork). MLA: one packed-latent object per page, no tp_rank suffix,backup_skip(only tp_rank 0 writes).
SGLang HiCache (zero-copy v1) → dfkv_hicache.py (ctypes) → libdfkv client
(Ketama route + header wrap/verify) → TCP/RDMA → dfkv_server (DiskCacheGroup
over N NVMe, LRU). Distributed = client-side consistent hashing; no replication
(regenerable KV → node loss = miss → recompute).
Membership is managed by the MDS tier (dfkv_mds + etcd). Nodes register
with the MDS on startup and send periodic heartbeats; etcd leases (TTL 30 s)
are the liveness signal. Clients call dfkv_start_mds_discovery(c, "ep1,ep2", group, poll_ms) to poll the MDS and rebuild the weighted consistent-hash ring
whenever the epoch (etcd revision) advances. Two-layer offline detection:
layer-2 — etcd lease expiry → MDS view changes → client epoch → ring rebuild
(authoritative removal, ≤ 30 s); layer-1 — PeerHealth fast avoidance: a
peer that fails transport IO is short-circuited to miss for a cooldown period
without any ring change. The legacy static path (dfkv_open(members=...) /
dfkv_set_members) still exists for simple or single-node setups.
cmake -S . -B build # add -DDFKV_STATIC_LIBSTDCXX=ON for portable binaries
cmake --build build -j
ctest --test-dir build --output-on-failure # C++ gtests + the Python plugin testArtifacts: build/dfkv_server, build/dfkv_mds, build/libdfkv.so.
# 1. Start etcd (one or three nodes, external)
# 2. Start MDS replicas (stateless, any number)
dfkv_mds --listen 9400 --etcd 127.0.0.1:2379
# 3. On each cache node (--mds requires --id and --advertise)
dfkv_server --dir /mnt/disk1/dfkv,/mnt/disk2/dfkv,/mnt/disk3/dfkv \
--port 12000 --cap 6597069766656 \
--mds 10.0.0.1:9400,10.0.0.2:9400 \
--group default --id n1 --advertise 10.0.0.10:12000
# 4. Client: MDS-based discovery (recommended)
# dfkv_start_mds_discovery(c, "10.0.0.1:9400,10.0.0.2:9400", "default", 3000);
# OR legacy static path (single-node / simple setups)
# dfkv_open("n1=10.0.0.10:12000,...", ...)Full rollout runbook (etcd + MDS + systemd units): docs/DEPLOY.md.
src/ portable C++ core (headers + .cc) + dfkv_server_main.cc + dfkv_mds_main.cc
python/ dfkv_hicache.py (SGLang dynamic backend plugin)
integration/lmcache/ dfkv_connector (LMCache RemoteConnector, ctypes over libdfkv.so)
tests/ gtest suites + tests/python (unittest + no-torch sglang shim)
docs/ DEPLOY.md (standalone rollout) · INTEGRATION.md (fuse into dingo-cache)
docs/hicache/ SGLang HiCache plugin docs (access_log, module README)
docs/lmcache/ LMCache connector docs (DESIGN · IMPLEMENTATION · DEPLOY)
- SGLang HiCache:
python/dfkv_hicache.py— seedocs/hicache/anddocs/DEPLOY.md. - LMCache:
integration/lmcache/(dfkv_connector) — seedocs/lmcache/DESIGN.md,docs/lmcache/IMPLEMENTATION.md,docs/lmcache/DEPLOY.md.
- Connection pooling + keep-alive (TCP_NODELAY): ~250× lower latency vs dial-per-call.
- Batch APIs with concurrent fan-out across nodes (
BatchPut/Get/Exist, C ABI + plugin). - Connect/IO timeouts + stale-connection retry: a hung node fails fast, never hangs.
- Observability (docs/METRICS.md): opt-in embedded Prometheus
/metricsondfkv_serveranddfkv_mds(--metrics-port); sampled op-latency histogram, eviction/error/per-disk/RDMA counters server-side; client-side counters (peer health, IO errors) viadfkv_stats_snapshot+ a plugin poller. Opt-in and off the datapath — no--metrics-port⇒ no listener, behavior unchanged. - Dynamic membership: MDS discovery (
dfkv_start_mds_discovery) polls the MDS tier and rebuilds the weighted Ketama ring on each etcd-epoch change. LegacySetMembers()hot-swap anddfkv_refresh_members(single-seed query) are still supported. - CLI tools:
dfkv_smoke(roundtrip check),dfkvctl— per-node ops (put/get/exist/stat) plus cluster views:dfkvctl ring(membership + ring vnode share) anddfkvctl stat --all(per-node metrics + cluster aggregate) via MDS. - RDMA transport (gated
-DDFKV_WITH_RDMA=ON, native libibverbs RC): device selected by name (DFKV_RDMA_DEV=ib7s400p0, comma-list = multi-rail), QP bootstrapped over a tiny TCP channel so the 400G data fabric needs no IP and may be separate from the IP network. Automatic TCP fallback when no device orDFKV_RDMAunset. Validated on 400G InfiniBand. - Zero-copy GET both ends: the server reads the block straight into the send buffer; the client scatters the payload directly into the caller's buffer (e.g. a SGLang HiCache registered host page) — no intermediate copies.
- Optional pipelining (
DFKV_RDMA_DEPTH=K): K requests in flight per connection. - NUMA-aware rail selection (
DFKV_RDMA_NUMA=1): pins buffers/serve-threads to the rail's NUMA node AND, with a multi-railDFKV_RDMA_DEV, picks a NUMA-local rail per connection (falls back to round-robin over all rails when no local rail exists). Off by default; vendor-neutral (sysfs +sched_getcpu, no libnuma/CUDA). - HiCache v2 (PoolTransfer) for multi-pool models (Mamba/SWA/DeepSeek-V4).
- Packaging: CPack (deb/rpm/tgz) + Dockerfile; graceful shutdown; leveled logging.
TDD; 53 C++ ctest entries + 7 Python tests green, 0 warnings, ThreadSanitizer-clean.
CI: gcc/clang build+test, TSan, RDMA compile-check, static-artifact build. License: Apache-2.0.
See docs/DEPLOY.md (rollout) and the round report in the ai_david KB.