From 3ad1b83792109e3ec81922bf1230f548f5202f59 Mon Sep 17 00:00:00 2001
From: Sam Xu <xcjsam@g.ucla.edu>
Date: Sun, 31 May 2026 15:50:49 +0800
Subject: [PATCH] docs(runbooks): mark pg-pool #459 follow-ups shipped (chunk
 fanout + health/db)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The "Still TODO (post-#455)" section listed the summarizer-fanout chunking
and /api/health/db probe as future work. Both shipped in #459 — updates the
runbook to reflect reality + adds the probe command and the saturation-
signal semantics (503 only when waiting>0 AND idle===0).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/runbooks/pg-pool-exhaustion.md | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/docs/runbooks/pg-pool-exhaustion.md b/docs/runbooks/pg-pool-exhaustion.md
index 245cf939..6bcb083d 100644
--- a/docs/runbooks/pg-pool-exhaustion.md
+++ b/docs/runbooks/pg-pool-exhaustion.md
@@ -59,13 +59,22 @@ Applied 2026-05-26 in PR #455 (`backend/config/db-pg.ts`):
 
 With this, an exhausted pool fails fast as a 5xx instead of hanging — user sees an error, on-call sees an alert, response is actionable.
 
-## Still TODO (post-#455)
+## Follow-ups shipped in #459 (2026-05-31)
 
-- **Audit summarizer + heartbeat dispatch concurrency** — burst-rate-limit the per-event PG calls so a 60-pod fanout doesn't claim 60 slots simultaneously. Chunking by 10 with `await Promise.all` per batch is the natural shape.
-- **Add `/api/health/db` probe** that returns `pool.idleCount` + `pool.waitingCount` and alerts when waiting > 5 for >30s. Would have caught this before user impact.
+Both items below were "still TODO" after #455; PR #459 shipped them:
+
+- **Summarizer fanout chunked** — `SchedulerService.dispatchPodSummaryRequests` now enqueues in batches of `SUMMARIZER_FANOUT_BATCH_SIZE` (default 10) with a `SUMMARIZER_FANOUT_BATCH_PAUSE_MS` gap (default 500ms) between batches, instead of a bare `Promise.all` over all installations. For 60 pods that spreads the burst across ~3s so the consumer side never claims all pool slots at once.
+- **`/api/health/db` probe added** — returns `{ pg: { max, total, idle, waiting, connectionTimeoutMillis } }` with NO `SELECT` round-trip (safe to scrape every 10s). Returns **503 only when `waiting > 0 AND idle === 0`** (true saturation); transient `waiting > 0 / idle > 0` returns 200 to avoid alert noise. Code: `backend/routes/health.ts`.
+
+Probe it:
+
+```bash
+kubectl exec -n commonly-dev deploy/backend -- curl -sS http://localhost:5000/api/health/db
+# → {"pg":{"status":"ok","max":50,"total":N,"idle":N,"waiting":0,...}}
+```
 
 ## Related
 
 - Incident issue: [#454](https://github.com/Team-Commonly/commonly/issues/454)
-- Fix PR: [#455](https://github.com/Team-Commonly/commonly/pull/455)
-- Code: `backend/config/db-pg.ts` (pool config), `backend/controllers/podController.ts:199-227` (getAllPods PG call site), `backend/services/summarizerService.ts` (likely surge source)
+- Fix PRs: [#455](https://github.com/Team-Commonly/commonly/pull/455) (pool ceiling), [#459](https://github.com/Team-Commonly/commonly/pull/459) (fanout chunk + health probe)
+- Code: `backend/config/db-pg.ts` (pool config), `backend/controllers/podController.ts:199-227` (getAllPods PG call site), `backend/services/schedulerService.ts` (`dispatchPodSummaryRequests` — the chunked surge source), `backend/routes/health.ts` (`/api/health/db`)