fix(worker-pool): drain-aware worker eviction (do-not-disrupt + grace + health guard) by bill-ph · Pull Request #682 · PostHog/duckgres

bill-ph · 2026-06-04T19:36:49Z

Why

On 2026-06-04 a Karpenter Drift roll (Bottlerocket AMI 1.61→1.62) tainted worker nodes out from under running queries. 31 in-flight queries were killed, each a median ~5s after its node was tainted while the node still had ~108s of life left — and the control plane itself canceled them (3 failed health checks → K8s worker unresponsive, deleting pod), even though the workers were draining. The dominant casualty was posthog_data_import_2's DELETE FROM ducklake.posthog.events (24 of them, 21–37 min each, 0 ultimately succeeded).

The CP drains its own sessions on a roll (900s grace, unbounded — control.go:46); workers had none of that: no terminationGracePeriodSeconds (→ 30s default), no do-not-disrupt, and the health loop canceled the query the moment it couldn't reach the (draining) worker. This PR gives a busy worker the same protection the control plane already gives itself.

What

1a — karpenter.sh/do-not-disrupt on busy workers (controlplane/worker_disruption_guard.go): the CP stamps the annotation on a worker while it serves a session and clears it when idle, so Karpenter skips a node running a query. Reconciled every 5s on the shared K8sWorkerPool, covering both the flat pool and OrgReservedPool (they share its worker map). Patches only on busy↔idle transitions. Requires the new pods: patch RBAC.
1c — worker terminationGracePeriodSeconds=600 (k8s_pool.go, was unset → 30s): a real drain window on SIGTERM.
2a — health-check guard (k8s_pool.go HealthCheckLoop): a worker failing health checks while its pod is already Terminating (planned node drain) is no longer marked Lost / canceled — the loop defers to the informer-driven pod-terminated path so the worker drains. Terminating is read from the existing pod informer cache — no API call, no new RBAC.

Scope / non-goals

Protects against voluntary disruption (drift/consolidation/expiry) only. Involuntary loss (spot reclaim, node/hardware failure) is unchanged; that residual tail needs transparent statement retry for commit-safe statements — a separate change.
Disruption budgets are intentionally untouched: the investigation found workers already had Drift budget = 1 and it did not prevent the kills (it paces initiation, not query survival).
The in-memory doNotDisruptApplied dedup flag resets on CP restart; the failure direction is safe (a still-busy worker is re-patched; a worker idled across the restart leaves a stale annotation that merely defers consolidation until the idle reaper retires it — it never clears a busy worker). Documented in-code with a follow-up.

Tests

unit (controlplane/worker_disruption_guard_test.go): reconciler set/clear/idempotent/skip-exiting; workerPodTerminating cache check. Plus a grace-period assertion in TestK8sPool_SpawnWorkerCreatesCorrectPod.
manifest (tests/manifests/manifests_test.go): k8s/rbac.yaml grants pods: patch.
e2e-mw-dev (tests/e2e-mw-dev/harness.sh): regression guard — a busy worker carries do-not-disrupt and an idle worker clears it; worker terminationGracePeriodSeconds=600. (Asserting Karpenter actually defers needs a real node drain, out of scope for the in-Job harness; the annotation contract is what gates Karpenter, so that is what we assert.)

🤖 Generated with Claude Code

… + health guard) On 2026-06-04 a Karpenter Drift roll (Bottlerocket AMI 1.61→1.62) tainted worker nodes out from under running queries: 31 in-flight queries were killed, each a median ~5s after its node was tainted while the node still had ~108s of life left. The control plane itself canceled them (3 failed health checks → "worker unresponsive"), even though the workers were draining. The CP drains its own sessions on a roll (900s grace, unbounded — control.go:46); workers had none of that. This gives a busy worker the same protection. 1a (controlplane/worker_disruption_guard.go): the CP stamps karpenter.sh/do-not-disrupt on a worker while it serves a session and clears it when idle, so Karpenter skips a node running a query. Reconciled every 5s on the shared K8sWorkerPool, covering both the flat pool and OrgReservedPool (which share its worker map). Patches only on busy<->idle transitions. Requires the new pods:patch RBAC. 1c (k8s_pool.go): worker pods set terminationGracePeriodSeconds=600 (was unset → 30s default) — a real drain window for in-flight queries on SIGTERM. 2a (k8s_pool.go HealthCheckLoop): a worker failing health checks while its pod is already Terminating (planned node drain) is no longer marked Lost / canceled; the loop defers to the informer-driven pod-terminated path so the worker drains. Terminating is read from the pod informer cache — no API call, no new RBAC. Protects against voluntary disruption only. Involuntary loss (spot reclaim, node failure) is unchanged; that residual tail needs transparent statement retry for commit-safe statements (follow-up). Disruption budgets are intentionally not touched: workers already had Drift budget=1 and it did not prevent the kills. Tests: - unit: reconciler set/clear/idempotent/skip-exiting; pod-Terminating cache check; pod-spec grace assertion (TestK8sPool_SpawnWorkerCreatesCorrectPod). - manifest: k8s/rbac.yaml now grants pods:patch. - e2e-mw-dev/harness.sh: a busy worker carries do-not-disrupt and an idle worker clears it; worker terminationGracePeriodSeconds=600. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… grace dependency Holistic review against the live Karpenter config (v1.9.0, karpenter.sh/v1) surfaced two issues: 1. 2a regression: deferring "mark Lost" whenever a worker's pod was Terminating kept the worker in the pool until the pod was actually deleted. findIdle/ leastLoaded only skip workers whose done channel is closed (pod gone), so a Terminating-but-not-deleted IDLE worker stayed acquirable — a new session could be routed to a shutting-down pod, for no benefit (no query to protect). Fix: gate the deferral on activeSessions>0. Idle workers are marked Lost promptly as before; only busy workers (a query to drain) defer. 2. The duckgres-workers / -colocated / -cp NodePools all have terminationGracePeriod: None. In Karpenter v1, do-not-disrupt then blocks Drift/expiration INDEFINITELY (no forced removal), so a long/stuck/idle-held session could wedge a security AMI roll. The do-not-disrupt change (1a) MUST ship with a NodePool terminationGracePeriod ceiling. Documented as a blocking companion infra change in the PR; not fixable in this repo (ArgoCD-managed karpenter-config). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

bill-ph · 2026-06-04T19:59:57Z

Holistic validation against the live Karpenter/k8s config

Checked the fix against the actual prod Karpenter (controller:1.9.0, karpenter.sh/v1) and NodePool specs, not just the code.

✅ Confirmed correct

1a mechanism: Karpenter v1 honors pod-level karpenter.sh/do-not-disrupt — a node with such a pod is excluded from voluntary disruption (Drift/consolidation/expiry) candidates. So stamping it on busy workers does block the Drift that caused this incident.
Budget interaction: live budgets are [{nodes:"1",reasons:["Drifted"]}, {nodes:"100%",reasons:["Empty"]}] (confirms the investigation). do-not-disrupt nodes are filtered from candidates before budget accounting, so a blocked busy node doesn't consume the Drift=1 slot — the roll still progresses on idle nodes and waits on busy ones. No deadlock.
consolidationPolicy: WhenEmpty (not Underutilized), so consolidation only touches empty nodes (no do-not-disrupt pods anyway).

⚠️ Fixed in `8afdf27` (2a regression)

Deferring "mark Lost" on any Terminating pod kept the worker acquirable (findIdle/leastLoaded only skip workers whose done channel is closed = pod fully gone). An idle worker being drained would stay in the pool and could receive a new session on a shutting-down pod. Now gated on activeSessions>0 — only busy workers (an actual query to drain) defer; idle workers are marked Lost promptly as before.

🚩 BLOCKING companion infra change (not fixable in this repo)

All three NodePools have terminationGracePeriod: None (+ expireAfter: 720h). In Karpenter v1, with no terminationGracePeriod, do-not-disrupt blocks Drift/expiration indefinitely — Karpenter never force-removes the node. So 1a as-is can let a long/stuck/idle-held session wedge a security AMI roll forever. The NodePools are ArgoCD-managed (karpenter-config-managed-warehouse-prod-us), so this must land there, alongside this PR:

# duckgres-workers, duckgres-workers-colocated (and optionally duckgres-cp)
spec:
  template:
    spec:
      terminationGracePeriod: 1h   # do-not-disrupt holds a busy node, but a
                                   # drift/security roll force-completes after 1h.
                                   # Tune to cover P99 query vs roll-urgency.

This is the "bounded, never blocked" ceiling: a busy worker's query gets up to ~1h of protection during a drift, then the node is force-drained (pods get their terminationGracePeriodSeconds=600 window within it). 1a should not be enabled in prod until this is set, or do-not-disrupt becomes an availability risk during the next CVE roll.

Not yet covered (as designed)

Involuntary loss (spot reclaim / node failure) still kills the in-flight query; 2a only softens it (lets it drain within grace). The durable answer is transparent statement retry for commit-safe statements — separate change.
The annotation tracks active sessions, not active queries, so an idle-held connection keeps a node protected (safe direction; bounded by the terminationGracePeriod ceiling above).

bill-ph · 2026-06-04T20:10:24Z

Blocking companion infra PR is now open: PostHog/charts#11756 — adds terminationGracePeriod: 1h to the duckgres-workers / duckgres-workers-colocated NodePools (the ceiling that bounds do-not-disrupt). Roll out together; this one should not reach prod without charts#11756, or do-not-disrupt becomes an availability risk on the next CVE/AMI roll.

… failover) The reconciler tracked applied-state in an in-memory ManagedWorker field. That loses the orphan-clear case on CP death/failover: when a CP dies, its sessions die with it (worker goes idle) but the worker pod keeps the do-not-disrupt annotation the dead CP stamped. A surviving/replacement CP adopts the worker with applied=false and activeSessions=0, so applied==busy==false and the reconciler skips it forever — leaving an orphaned annotation that suppresses Karpenter consolidation until the idle reaper happens to delete the pod. Reconcile against the pod's ACTUAL annotation read from the pod informer cache (new podHasDoNotDisrupt, no API call) instead of an in-memory flag. Any CP now self-corrects: it sees desired=idle vs current=annotated and clears the orphan. Removes the ManagedWorker field entirely (worker_mgr.go back to unchanged). Also gates the 2a health-check deferral on activeSessions>0 (prior commit), so a Terminating *idle* worker is still marked Lost promptly rather than lingering in the acquirable pool. New regression test TestReconcileDisruptionGuardsClearsStaleAnnotationAfterFailover covers the orphan path; steady-state still issues zero patches. Note (no code change): verified against Karpenter v1 docs that spec.template.spec.terminationGracePeriod DOES make Drift bypass do-not-disrupt after the grace ("a node may be disrupted via drift even if there are pods with ... the karpenter.sh/do-not-disrupt annotation"), so the charts#11756 ceiling correctly bounds the hold for AMI/CVE drift — not just expiration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

bill-ph · 2026-06-04T20:33:23Z

Follow-up from review (commit 5d99d79):

CP-failover orphan (real gap, fixed). The reconciler tracked applied-state in an in-memory field. On CP death the worker's sessions die (worker goes idle) but its pod keeps the annotation; a replacement CP adopted it with applied=false, activeSessions=0 → applied==busy → skipped forever, orphaning the annotation (suppressing consolidation until the idle reaper deletes the pod). Now the reconciler compares busy against the pod's actual annotation from the informer cache (podHasDoNotDisrupt, no API call), so any CP self-corrects and clears the orphan. In-memory field removed (worker_mgr.go back to unchanged). New test ...ClearsStaleAnnotationAfterFailover.

Karpenter semantics confirmed. Per the v1 disruption docs, terminationGracePeriod makes Drift (not just expiration) bypass do-not-disrupt/PDBs after the grace — so charts#11756's 1h bounds the hold for an AMI/CVE drift roll, exactly the goal.

Loops reconcileDisruptionGuards + workerPodTerminating against 16 goroutines flipping activeSessions under the pool lock (as Acquire/ReleaseWorker do). Asserts the snapshot-under-RLock / patch-without-lock discipline is race-free. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…lity + scheduling) Follow-up to the dual-model review of #682. BLOCKING — relabel adopted pods so the informer sees them The pod informer is label-scoped to duckgres/control-plane=<cpID> (cpID = os.Hostname(), per CP pod). adoptClaimedWorker did not relabel, so after a CP restart/rollout an adopted worker kept the dead CP's label and was invisible to the new CP's informer — defeating podHasDoNotDisrupt (stale annotation never cleared; busy adopted worker patched every 5s forever), workerPodTerminating (2a never protected adopted workers), and informer-driven cleanup. Now adoptClaimedWorker merge-patches duckgres/control-plane to this CP. should-fix — don't hand a new session to a Terminating worker isGenericSessionSchedulableWorkerLocked and OrgReservedPool. workerReadyForSchedulingLocked now exclude workers whose pod is Terminating, so the 2a deferral can't leave a shutting-down pod schedulable once its session releases. Codex P1 follow-up: the runtime-store claim path bypasses those predicates, so adoptClaimedWorker also rejects a claim whose pod is already Terminating (routes into the existing claim-retire/fallback). should-fix — shrink the reconcile race window 5s -> 2s (eager-stamp on the 0->1 session transition deferred: it touches the hot acquire path on both pools, and with the terminationGracePeriod ceiling a disruption inside the window is non-fatal anyway.) test — harness assert_drain_aware_eviction attributes the annotation to its OWN held session via a baseline diff (not "first annotated pod"), and drops $(seq) for a POSIX while-loop (#!/bin/sh + set -eu). New unit test TestRelabelAdoptedPodToThisCP; reconcile/terminating/-race tests updated. Charts rollout-ordering + 1h-semantics documented in PostHog/charts#11756. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

bill-ph · 2026-06-04T23:18:28Z

Dual-model review — findings resolved (commit `295e6ef`)

Ran a second independent review (Claude + Codex). Net: 1 blocking + several should-fix, all addressed.

🔴 BLOCKING — adopted workers were invisible to the informer (Codex-found, verified)
The pod informer is label-scoped to duckgres/control-plane=<cpID> and cpID = os.Hostname() (per CP pod), but adoptClaimedWorker didn't relabel. After a CP restart/rollout, adopted workers kept the dead CP's label → the new CP's informer couldn't see them → the failover orphan-clear silently didn't work, busy adopted workers got a PATCH every 5s forever, and 2a never protected them. Fix: adoptClaimedWorker now merge-patches duckgres/control-plane to the adopting CP. (Safe during rollout overlap: the old CP observes the relabel as a delete → drops the worker via the stale-lease path, no pod delete.)

🟡 should-fix

New sessions could land on a Terminating worker → added a Terminating check to both pools' scheduler predicates, and (Codex P1) adoptClaimedWorker rejects a claim whose pod is already Terminating (the claim path bypassed the predicates).
Reconcile race window 5s→2s (full eager-stamp on 0→1 deferred — hot-path; the ceiling makes a mid-window disruption non-fatal).
Harness: assertion now attributes the annotation to its own held session (baseline diff, not "first annotated pod"); dropped $(seq) for a POSIX while-loop.

charts#11756 got a rollout-ordering note (NodeClaims snapshot terminationGracePeriod at creation → cycle worker NodeClaims before enabling this) + 1h-semantics clarification.

Verified: go build -tags kubernetes ./..., unit tests incl. a new TestRelabelAdoptedPodToThisCP and the failover regression, all green under -race; harness sh -n clean.

bill-ph and others added 2 commits June 4, 2026 15:36

bill-ph and others added 2 commits June 4, 2026 16:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(worker-pool): drain-aware worker eviction (do-not-disrupt + grace + health guard)#682

fix(worker-pool): drain-aware worker eviction (do-not-disrupt + grace + health guard)#682
bill-ph wants to merge 5 commits into
mainfrom
claude/worker-drain-aware-eviction

bill-ph commented Jun 4, 2026

Uh oh!

bill-ph commented Jun 4, 2026

Uh oh!

bill-ph commented Jun 4, 2026

Uh oh!

bill-ph commented Jun 4, 2026

Uh oh!

bill-ph commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bill-ph commented Jun 4, 2026

Why

What

Scope / non-goals

Tests

Uh oh!

bill-ph commented Jun 4, 2026

Holistic validation against the live Karpenter/k8s config

✅ Confirmed correct

⚠️ Fixed in 8afdf27 (2a regression)

🚩 BLOCKING companion infra change (not fixable in this repo)

Not yet covered (as designed)

Uh oh!

bill-ph commented Jun 4, 2026

Uh oh!

bill-ph commented Jun 4, 2026

Uh oh!

bill-ph commented Jun 4, 2026

Dual-model review — findings resolved (commit 295e6ef)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚠️ Fixed in `8afdf27` (2a regression)

Dual-model review — findings resolved (commit `295e6ef`)