From 3ad1b83792109e3ec81922bf1230f548f5202f59 Mon Sep 17 00:00:00 2001 From: Sam Xu Date: Sun, 31 May 2026 15:50:49 +0800 Subject: [PATCH] docs(runbooks): mark pg-pool #459 follow-ups shipped (chunk fanout + health/db) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The "Still TODO (post-#455)" section listed the summarizer-fanout chunking and /api/health/db probe as future work. Both shipped in #459 — updates the runbook to reflect reality + adds the probe command and the saturation- signal semantics (503 only when waiting>0 AND idle===0). Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/runbooks/pg-pool-exhaustion.md | 19 ++++++++++++++----- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/docs/runbooks/pg-pool-exhaustion.md b/docs/runbooks/pg-pool-exhaustion.md index 245cf939..6bcb083d 100644 --- a/docs/runbooks/pg-pool-exhaustion.md +++ b/docs/runbooks/pg-pool-exhaustion.md @@ -59,13 +59,22 @@ Applied 2026-05-26 in PR #455 (`backend/config/db-pg.ts`): With this, an exhausted pool fails fast as a 5xx instead of hanging — user sees an error, on-call sees an alert, response is actionable. -## Still TODO (post-#455) +## Follow-ups shipped in #459 (2026-05-31) -- **Audit summarizer + heartbeat dispatch concurrency** — burst-rate-limit the per-event PG calls so a 60-pod fanout doesn't claim 60 slots simultaneously. Chunking by 10 with `await Promise.all` per batch is the natural shape. -- **Add `/api/health/db` probe** that returns `pool.idleCount` + `pool.waitingCount` and alerts when waiting > 5 for >30s. Would have caught this before user impact. +Both items below were "still TODO" after #455; PR #459 shipped them: + +- **Summarizer fanout chunked** — `SchedulerService.dispatchPodSummaryRequests` now enqueues in batches of `SUMMARIZER_FANOUT_BATCH_SIZE` (default 10) with a `SUMMARIZER_FANOUT_BATCH_PAUSE_MS` gap (default 500ms) between batches, instead of a bare `Promise.all` over all installations. For 60 pods that spreads the burst across ~3s so the consumer side never claims all pool slots at once. +- **`/api/health/db` probe added** — returns `{ pg: { max, total, idle, waiting, connectionTimeoutMillis } }` with NO `SELECT` round-trip (safe to scrape every 10s). Returns **503 only when `waiting > 0 AND idle === 0`** (true saturation); transient `waiting > 0 / idle > 0` returns 200 to avoid alert noise. Code: `backend/routes/health.ts`. + +Probe it: + +```bash +kubectl exec -n commonly-dev deploy/backend -- curl -sS http://localhost:5000/api/health/db +# → {"pg":{"status":"ok","max":50,"total":N,"idle":N,"waiting":0,...}} +``` ## Related - Incident issue: [#454](https://github.com/Team-Commonly/commonly/issues/454) -- Fix PR: [#455](https://github.com/Team-Commonly/commonly/pull/455) -- Code: `backend/config/db-pg.ts` (pool config), `backend/controllers/podController.ts:199-227` (getAllPods PG call site), `backend/services/summarizerService.ts` (likely surge source) +- Fix PRs: [#455](https://github.com/Team-Commonly/commonly/pull/455) (pool ceiling), [#459](https://github.com/Team-Commonly/commonly/pull/459) (fanout chunk + health probe) +- Code: `backend/config/db-pg.ts` (pool config), `backend/controllers/podController.ts:199-227` (getAllPods PG call site), `backend/services/schedulerService.ts` (`dispatchPodSummaryRequests` — the chunked surge source), `backend/routes/health.ts` (`/api/health/db`)