Skip to content

fix(worker-pool): drain-aware worker eviction (do-not-disrupt + grace + health guard)#682

Open
bill-ph wants to merge 5 commits into
mainfrom
claude/worker-drain-aware-eviction
Open

fix(worker-pool): drain-aware worker eviction (do-not-disrupt + grace + health guard)#682
bill-ph wants to merge 5 commits into
mainfrom
claude/worker-drain-aware-eviction

Conversation

@bill-ph
Copy link
Copy Markdown
Collaborator

@bill-ph bill-ph commented Jun 4, 2026

Why

On 2026-06-04 a Karpenter Drift roll (Bottlerocket AMI 1.61→1.62) tainted worker nodes out from under running queries. 31 in-flight queries were killed, each a median ~5s after its node was tainted while the node still had ~108s of life left — and the control plane itself canceled them (3 failed health checks → K8s worker unresponsive, deleting pod), even though the workers were draining. The dominant casualty was posthog_data_import_2's DELETE FROM ducklake.posthog.events (24 of them, 21–37 min each, 0 ultimately succeeded).

The CP drains its own sessions on a roll (900s grace, unbounded — control.go:46); workers had none of that: no terminationGracePeriodSeconds (→ 30s default), no do-not-disrupt, and the health loop canceled the query the moment it couldn't reach the (draining) worker. This PR gives a busy worker the same protection the control plane already gives itself.

What

  • 1a — karpenter.sh/do-not-disrupt on busy workers (controlplane/worker_disruption_guard.go): the CP stamps the annotation on a worker while it serves a session and clears it when idle, so Karpenter skips a node running a query. Reconciled every 5s on the shared K8sWorkerPool, covering both the flat pool and OrgReservedPool (they share its worker map). Patches only on busy↔idle transitions. Requires the new pods: patch RBAC.
  • 1c — worker terminationGracePeriodSeconds=600 (k8s_pool.go, was unset → 30s): a real drain window on SIGTERM.
  • 2a — health-check guard (k8s_pool.go HealthCheckLoop): a worker failing health checks while its pod is already Terminating (planned node drain) is no longer marked Lost / canceled — the loop defers to the informer-driven pod-terminated path so the worker drains. Terminating is read from the existing pod informer cache — no API call, no new RBAC.

Scope / non-goals

  • Protects against voluntary disruption (drift/consolidation/expiry) only. Involuntary loss (spot reclaim, node/hardware failure) is unchanged; that residual tail needs transparent statement retry for commit-safe statements — a separate change.
  • Disruption budgets are intentionally untouched: the investigation found workers already had Drift budget = 1 and it did not prevent the kills (it paces initiation, not query survival).
  • The in-memory doNotDisruptApplied dedup flag resets on CP restart; the failure direction is safe (a still-busy worker is re-patched; a worker idled across the restart leaves a stale annotation that merely defers consolidation until the idle reaper retires it — it never clears a busy worker). Documented in-code with a follow-up.

Tests

  • unit (controlplane/worker_disruption_guard_test.go): reconciler set/clear/idempotent/skip-exiting; workerPodTerminating cache check. Plus a grace-period assertion in TestK8sPool_SpawnWorkerCreatesCorrectPod.
  • manifest (tests/manifests/manifests_test.go): k8s/rbac.yaml grants pods: patch.
  • e2e-mw-dev (tests/e2e-mw-dev/harness.sh): regression guard — a busy worker carries do-not-disrupt and an idle worker clears it; worker terminationGracePeriodSeconds=600. (Asserting Karpenter actually defers needs a real node drain, out of scope for the in-Job harness; the annotation contract is what gates Karpenter, so that is what we assert.)

🤖 Generated with Claude Code

bill-ph and others added 2 commits June 4, 2026 15:36
… + health guard)

On 2026-06-04 a Karpenter Drift roll (Bottlerocket AMI 1.61→1.62) tainted worker
nodes out from under running queries: 31 in-flight queries were killed, each a
median ~5s after its node was tainted while the node still had ~108s of life
left. The control plane itself canceled them (3 failed health checks → "worker
unresponsive"), even though the workers were draining. The CP drains its own
sessions on a roll (900s grace, unbounded — control.go:46); workers had none of
that. This gives a busy worker the same protection.

1a (controlplane/worker_disruption_guard.go): the CP stamps
   karpenter.sh/do-not-disrupt on a worker while it serves a session and clears
   it when idle, so Karpenter skips a node running a query. Reconciled every 5s
   on the shared K8sWorkerPool, covering both the flat pool and OrgReservedPool
   (which share its worker map). Patches only on busy<->idle transitions.
   Requires the new pods:patch RBAC.
1c (k8s_pool.go): worker pods set terminationGracePeriodSeconds=600 (was unset
   → 30s default) — a real drain window for in-flight queries on SIGTERM.
2a (k8s_pool.go HealthCheckLoop): a worker failing health checks while its pod is
   already Terminating (planned node drain) is no longer marked Lost / canceled;
   the loop defers to the informer-driven pod-terminated path so the worker
   drains. Terminating is read from the pod informer cache — no API call, no new
   RBAC.

Protects against voluntary disruption only. Involuntary loss (spot reclaim, node
failure) is unchanged; that residual tail needs transparent statement retry for
commit-safe statements (follow-up). Disruption budgets are intentionally not
touched: workers already had Drift budget=1 and it did not prevent the kills.

Tests:
- unit: reconciler set/clear/idempotent/skip-exiting; pod-Terminating cache check;
  pod-spec grace assertion (TestK8sPool_SpawnWorkerCreatesCorrectPod).
- manifest: k8s/rbac.yaml now grants pods:patch.
- e2e-mw-dev/harness.sh: a busy worker carries do-not-disrupt and an idle worker
  clears it; worker terminationGracePeriodSeconds=600.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… grace dependency

Holistic review against the live Karpenter config (v1.9.0, karpenter.sh/v1)
surfaced two issues:

1. 2a regression: deferring "mark Lost" whenever a worker's pod was Terminating
   kept the worker in the pool until the pod was actually deleted. findIdle/
   leastLoaded only skip workers whose done channel is closed (pod gone), so a
   Terminating-but-not-deleted IDLE worker stayed acquirable — a new session
   could be routed to a shutting-down pod, for no benefit (no query to protect).
   Fix: gate the deferral on activeSessions>0. Idle workers are marked Lost
   promptly as before; only busy workers (a query to drain) defer.

2. The duckgres-workers / -colocated / -cp NodePools all have
   terminationGracePeriod: None. In Karpenter v1, do-not-disrupt then blocks
   Drift/expiration INDEFINITELY (no forced removal), so a long/stuck/idle-held
   session could wedge a security AMI roll. The do-not-disrupt change (1a) MUST
   ship with a NodePool terminationGracePeriod ceiling. Documented as a blocking
   companion infra change in the PR; not fixable in this repo (ArgoCD-managed
   karpenter-config).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@bill-ph
Copy link
Copy Markdown
Collaborator Author

bill-ph commented Jun 4, 2026

Holistic validation against the live Karpenter/k8s config

Checked the fix against the actual prod Karpenter (controller:1.9.0, karpenter.sh/v1) and NodePool specs, not just the code.

✅ Confirmed correct

  • 1a mechanism: Karpenter v1 honors pod-level karpenter.sh/do-not-disrupt — a node with such a pod is excluded from voluntary disruption (Drift/consolidation/expiry) candidates. So stamping it on busy workers does block the Drift that caused this incident.
  • Budget interaction: live budgets are [{nodes:"1",reasons:["Drifted"]}, {nodes:"100%",reasons:["Empty"]}] (confirms the investigation). do-not-disrupt nodes are filtered from candidates before budget accounting, so a blocked busy node doesn't consume the Drift=1 slot — the roll still progresses on idle nodes and waits on busy ones. No deadlock.
  • consolidationPolicy: WhenEmpty (not Underutilized), so consolidation only touches empty nodes (no do-not-disrupt pods anyway).

⚠️ Fixed in 8afdf27 (2a regression)

Deferring "mark Lost" on any Terminating pod kept the worker acquirable (findIdle/leastLoaded only skip workers whose done channel is closed = pod fully gone). An idle worker being drained would stay in the pool and could receive a new session on a shutting-down pod. Now gated on activeSessions>0 — only busy workers (an actual query to drain) defer; idle workers are marked Lost promptly as before.

🚩 BLOCKING companion infra change (not fixable in this repo)

All three NodePools have terminationGracePeriod: None (+ expireAfter: 720h). In Karpenter v1, with no terminationGracePeriod, do-not-disrupt blocks Drift/expiration indefinitely — Karpenter never force-removes the node. So 1a as-is can let a long/stuck/idle-held session wedge a security AMI roll forever. The NodePools are ArgoCD-managed (karpenter-config-managed-warehouse-prod-us), so this must land there, alongside this PR:

# duckgres-workers, duckgres-workers-colocated (and optionally duckgres-cp)
spec:
  template:
    spec:
      terminationGracePeriod: 1h   # do-not-disrupt holds a busy node, but a
                                   # drift/security roll force-completes after 1h.
                                   # Tune to cover P99 query vs roll-urgency.

This is the "bounded, never blocked" ceiling: a busy worker's query gets up to ~1h of protection during a drift, then the node is force-drained (pods get their terminationGracePeriodSeconds=600 window within it). 1a should not be enabled in prod until this is set, or do-not-disrupt becomes an availability risk during the next CVE roll.

Not yet covered (as designed)

  • Involuntary loss (spot reclaim / node failure) still kills the in-flight query; 2a only softens it (lets it drain within grace). The durable answer is transparent statement retry for commit-safe statements — separate change.
  • The annotation tracks active sessions, not active queries, so an idle-held connection keeps a node protected (safe direction; bounded by the terminationGracePeriod ceiling above).

@bill-ph
Copy link
Copy Markdown
Collaborator Author

bill-ph commented Jun 4, 2026

Blocking companion infra PR is now open: PostHog/charts#11756 — adds terminationGracePeriod: 1h to the duckgres-workers / duckgres-workers-colocated NodePools (the ceiling that bounds do-not-disrupt). Roll out together; this one should not reach prod without charts#11756, or do-not-disrupt becomes an availability risk on the next CVE/AMI roll.

… failover)

The reconciler tracked applied-state in an in-memory ManagedWorker field. That
loses the orphan-clear case on CP death/failover: when a CP dies, its sessions
die with it (worker goes idle) but the worker pod keeps the do-not-disrupt
annotation the dead CP stamped. A surviving/replacement CP adopts the worker
with applied=false and activeSessions=0, so applied==busy==false and the
reconciler skips it forever — leaving an orphaned annotation that suppresses
Karpenter consolidation until the idle reaper happens to delete the pod.

Reconcile against the pod's ACTUAL annotation read from the pod informer cache
(new podHasDoNotDisrupt, no API call) instead of an in-memory flag. Any CP now
self-corrects: it sees desired=idle vs current=annotated and clears the orphan.
Removes the ManagedWorker field entirely (worker_mgr.go back to unchanged).

Also gates the 2a health-check deferral on activeSessions>0 (prior commit), so a
Terminating *idle* worker is still marked Lost promptly rather than lingering in
the acquirable pool.

New regression test TestReconcileDisruptionGuardsClearsStaleAnnotationAfterFailover
covers the orphan path; steady-state still issues zero patches.

Note (no code change): verified against Karpenter v1 docs that
spec.template.spec.terminationGracePeriod DOES make Drift bypass do-not-disrupt
after the grace ("a node may be disrupted via drift even if there are pods with
... the karpenter.sh/do-not-disrupt annotation"), so the charts#11756 ceiling
correctly bounds the hold for AMI/CVE drift — not just expiration.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@bill-ph
Copy link
Copy Markdown
Collaborator Author

bill-ph commented Jun 4, 2026

Follow-up from review (commit 5d99d79):

CP-failover orphan (real gap, fixed). The reconciler tracked applied-state in an in-memory field. On CP death the worker's sessions die (worker goes idle) but its pod keeps the annotation; a replacement CP adopted it with applied=false, activeSessions=0applied==busy → skipped forever, orphaning the annotation (suppressing consolidation until the idle reaper deletes the pod). Now the reconciler compares busy against the pod's actual annotation from the informer cache (podHasDoNotDisrupt, no API call), so any CP self-corrects and clears the orphan. In-memory field removed (worker_mgr.go back to unchanged). New test ...ClearsStaleAnnotationAfterFailover.

Karpenter semantics confirmed. Per the v1 disruption docs, terminationGracePeriod makes Drift (not just expiration) bypass do-not-disrupt/PDBs after the grace — so charts#11756's 1h bounds the hold for an AMI/CVE drift roll, exactly the goal.

bill-ph and others added 2 commits June 4, 2026 16:42
Loops reconcileDisruptionGuards + workerPodTerminating against 16 goroutines
flipping activeSessions under the pool lock (as Acquire/ReleaseWorker do).
Asserts the snapshot-under-RLock / patch-without-lock discipline is race-free.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lity + scheduling)

Follow-up to the dual-model review of #682.

BLOCKING — relabel adopted pods so the informer sees them
  The pod informer is label-scoped to duckgres/control-plane=<cpID> (cpID =
  os.Hostname(), per CP pod). adoptClaimedWorker did not relabel, so after a CP
  restart/rollout an adopted worker kept the dead CP's label and was invisible to
  the new CP's informer — defeating podHasDoNotDisrupt (stale annotation never
  cleared; busy adopted worker patched every 5s forever), workerPodTerminating
  (2a never protected adopted workers), and informer-driven cleanup. Now
  adoptClaimedWorker merge-patches duckgres/control-plane to this CP.

should-fix — don't hand a new session to a Terminating worker
  isGenericSessionSchedulableWorkerLocked and OrgReservedPool.
  workerReadyForSchedulingLocked now exclude workers whose pod is Terminating, so
  the 2a deferral can't leave a shutting-down pod schedulable once its session
  releases. Codex P1 follow-up: the runtime-store claim path bypasses those
  predicates, so adoptClaimedWorker also rejects a claim whose pod is already
  Terminating (routes into the existing claim-retire/fallback).

should-fix — shrink the reconcile race window 5s -> 2s
  (eager-stamp on the 0->1 session transition deferred: it touches the hot
  acquire path on both pools, and with the terminationGracePeriod ceiling a
  disruption inside the window is non-fatal anyway.)

test — harness assert_drain_aware_eviction attributes the annotation to its OWN
  held session via a baseline diff (not "first annotated pod"), and drops
  $(seq) for a POSIX while-loop (#!/bin/sh + set -eu).

New unit test TestRelabelAdoptedPodToThisCP; reconcile/terminating/-race tests
updated. Charts rollout-ordering + 1h-semantics documented in PostHog/charts#11756.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@bill-ph
Copy link
Copy Markdown
Collaborator Author

bill-ph commented Jun 4, 2026

Dual-model review — findings resolved (commit 295e6ef)

Ran a second independent review (Claude + Codex). Net: 1 blocking + several should-fix, all addressed.

🔴 BLOCKING — adopted workers were invisible to the informer (Codex-found, verified)
The pod informer is label-scoped to duckgres/control-plane=<cpID> and cpID = os.Hostname() (per CP pod), but adoptClaimedWorker didn't relabel. After a CP restart/rollout, adopted workers kept the dead CP's label → the new CP's informer couldn't see them → the failover orphan-clear silently didn't work, busy adopted workers got a PATCH every 5s forever, and 2a never protected them. Fix: adoptClaimedWorker now merge-patches duckgres/control-plane to the adopting CP. (Safe during rollout overlap: the old CP observes the relabel as a delete → drops the worker via the stale-lease path, no pod delete.)

🟡 should-fix

  • New sessions could land on a Terminating worker → added a Terminating check to both pools' scheduler predicates, and (Codex P1) adoptClaimedWorker rejects a claim whose pod is already Terminating (the claim path bypassed the predicates).
  • Reconcile race window 5s→2s (full eager-stamp on 0→1 deferred — hot-path; the ceiling makes a mid-window disruption non-fatal).
  • Harness: assertion now attributes the annotation to its own held session (baseline diff, not "first annotated pod"); dropped $(seq) for a POSIX while-loop.

charts#11756 got a rollout-ordering note (NodeClaims snapshot terminationGracePeriod at creation → cycle worker NodeClaims before enabling this) + 1h-semantics clarification.

Verified: go build -tags kubernetes ./..., unit tests incl. a new TestRelabelAdoptedPodToThisCP and the failover regression, all green under -race; harness sh -n clean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant