Conversation
…-queue retry, wire compression, and queueHint race fix
Phase B.1 — SWIM-style indirect heartbeat probes:
- Add WithDistIndirectProbes(k, timeout) option; when a direct probe
fails, up to k random alive relays probe the target on the caller's
behalf — target is only marked suspect if every relay also fails
- Add /internal/probe HTTP endpoint and IndirectHealth() transport method
- Refuted direct failures now refresh LastSeen rather than escalating
- Expose dist.heartbeat.indirect_probe.{success,failure,refuted} metrics
Phase B.2 — migration failure retry via hint queue:
- migrateIfNeeded queues a hint on ForwardSet failure instead of
logging-and-dropping silently
- replicateTo hint enqueue broadened from ErrBackendNotFound-only to
any transport error (timeouts, 5xx, connection resets)
- Fix race in queueHint: snapshot hintBytes under hintsMu before unlock
to prevent concurrent adjustHintAccounting in the replay loop from
racing the metric write
Phase B.3 — on-wire gzip compression for the dist HTTP transport:
- Add DistHTTPLimits.CompressionThreshold; ForwardSet gzip-compresses
Set request bodies exceeding the threshold; server decompresses
transparently via fiber v3 Content-Encoding auto-decoding
Refactor: extract membershipSnapshot() helper from Metrics() to keep
function length under the lint cap
Add contract tests for all three phases and the queueHint race fix
…book (Phase C.1–C.3) Phase C.1 — Drain endpoint: - Add DistMemory.Drain(ctx) and POST /dist/drain HTTP endpoint; marks the node for graceful shutdown in a one-way, idempotent transition - /health returns 503 while draining so load balancers stop routing - Set/Remove reject with sentinel.ErrDraining; Get continues to serve - Add IsDraining() accessor and dist.drains metric (CAS ensures it fires exactly once per transition) Phase C.2 — Cursor-based key enumeration: - Replace the naive full-set /internal/keys response with shard-level cursor pagination (next_cursor token per page) - Add optional ?limit=<n> param; truncated=true in the response flags a partially-read shard and returns the same cursor for re-request - DistHTTPTransport.ListKeys now walks pages internally with a 1024-page safety cap; all existing callers (anti-entropy fallback, tests) are unchanged - Extract listKeysPage helper and keysPageResp wire type Phase C.3 — Operations runbook: - Add docs/operations.md covering split-brain, hint-queue overflow, rebalance under load, and replica-loss failure modes; each mode maps to the metrics that surface it - Document observability wiring (logger/tracer/meter), drain procedure, and capacity-planning notes Tests: dist_drain_test.go (3 cases) and dist_keys_cursor_test.go (2 cases)
…r binary and Docker cluster Critical fixes in the DistMemory layer: - factory.go: forward cfg.DistMemoryOptions to NewDistMemory; pre-fix all WithDistNode/WithDistSeeds/WithDistReplication calls were silent no-ops, leaving every node with a standalone default configuration. - dist_memory.go: accept `id@addr` seed syntax via parseSeedSpec so the consistent-hash ring is built with real peer IDs; pre-fix, seeds were upserted with empty IDs — every node treated itself as sole owner and writes never propagated across the cluster. - dist_memory.go: route removeImpl to owners[0] (primary), mirroring setImpl; pre-fix, replica-initiated removes skipped the primary and the value lingered until TTL. New features: - Add HyperCache.DistDrain(ctx) convenience method for graceful shutdown without type-asserting through the unexported backend field. - Add production server binary at cmd/hypercache-server with multi-stage distroless Dockerfile and 12-factor HYPERCACHE_* env configuration. - Add 5-node Docker Compose cluster (docker-compose.cluster.yml, replication=3, host ports 8081–8085 / 9081–9085). - Add Makefile targets: start-dev-cluster / stop-dev-cluster. - Add integration regression test for id@addr seed-spec propagation (tests/integration/dist_seed_spec_test.go). - Add cluster smoke-test script (scripts/tests/10-test-cluster-api.sh).
- Add `.gitleaksconfig.toml` extending the default gitleaks ruleset with
a global allowlist for config and test shell files; wire it into the
`gitleaks.yml` workflow via `GITLEAKS_CONFIG`.
- Completely rewrite `scripts/tests/10-test-cluster-api.sh` to be a
proper regression suite for the Phase D cluster bugs:
- Replaces raw `curl` one-liners with reusable helper functions
(`put_value`, `expect_value`, `expect_404`, `delete_key`) that assert
both HTTP status codes and response body fields.
- Collects all failures before exiting (non-short-circuit) so operators
get a full report in one run.
- Adds configurable `PORTS`, `WRITE_PORT`, and `DELETE_PORT` env vars
for flexible local/CI overrides.
- Phases cover: cluster propagation, wire-encoding fidelity for
non-owner GETs, and cross-node DELETE propagation.
Replace the custom `.gitleaksconfig.toml` (which extended the default gitleaks config and defined path-based allowlists) with a `.gitleaksignore` file that allowlists specific fingerprints for known curl auth header occurrences in docker-compose and test scripts. Remove the `GITLEAKS_CONFIG` env var from the GitHub Actions workflow, allowing gitleaks to use its built-in defaults and pick up the new `.gitleaksignore` automatically.
Add a GitHub Actions workflow, Makefile target, and supporting scripts to catch cross-node bugs that in-process unit tests miss. - .github/workflows/cluster.yml: new CI job that boots the 5-node docker-compose stack, waits for all /healthz endpoints, runs the assertion script, and dumps container logs on failure - Makefile: add `test-cluster` target mirroring the CI flow for local development, propagating the smoke's exit code on teardown - scripts/tests/wait-for-cluster.sh: polling helper that blocks until every node's /healthz returns 200, configurable via PORTS / TIMEOUT_SECS / POLL_INTERVAL env vars - CHANGELOG.md: document all additions under [Unreleased] - cspell.config.yaml: add healthz to the word list This specifically guards against the class of regressions that escaped Phase D review: factory dropping DistMemoryOptions, seeds without node IDs producing broken rings, and json.RawMessage mis-encoding on non-owner GET requests.
Add .github/workflows/image.yml to build and publish the hypercache-server Docker image for linux/amd64 and linux/arm64 via buildx + QEMU. Trigger behaviour: - pull_request: build-only (no push) to catch Dockerfile regressions - push to main: publish :main and :sha-<short> - semver tag push (v*.*.*): publish :v1.2.3, :1.2.3, :1.2, :1, :latest :latest is intentionally restricted to semver tag pushes so production deployments pinning :latest always resolve to a stable release rather than an in-flight main commit. GHA layer caching keeps re-builds fast when only Go source has changed. Also replace stdlib encoding/json with github.com/goccy/go-json in dist_memory.go and integration tests, update CHANGELOG.md, and add buildx to the cspell allow-list.
Introduce the standalone server binary that runs a single HyperCache node with the DistMemory backend. The server exposes three HTTP listeners (client API :8080, management :8081, dist :7946), all configurable via 12-factor environment variables. Key components: - Fiber-based REST API with PUT/GET/DELETE for cache keys and a /v1/owners/:key visibility endpoint - Bearer token auth middleware mirroring the dist HTTP auth posture - Base64-aware writeValue to fix the non-owner GET asymmetry where replicated []byte values round-trip through JSON as base64 strings - Graceful shutdown: SIGTERM/SIGINT → drain → API shutdown → cache stop - Multi-stage Dockerfile targeting distroless for minimal image size - Comprehensive README covering config, API usage, and local 5-node cluster via Docker Compose - Unit tests for writeValue codec paths and decodeBase64Bytes edge cases Also un-ignore cmd/ and bin/ in .gitignore to track the new binary.
Add curl-auth-header false positive exclusions for API usage examples in cmd/hypercache-server/README.md.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Phase B.1 — SWIM-style indirect heartbeat probes:
Phase B.2 — migration failure retry via hint queue:
Phase B.3 — on-wire gzip compression for the dist HTTP transport:
Refactor: extract membershipSnapshot() helper from Metrics() to keep function length under the lint cap
Add contract tests for all three phases and the queueHint race fix