fix(db-pg): bump pool.max to 50 + add connectionTimeoutMillis=5s (#454) by samxu01 · Pull Request #455 · Team-Commonly/commonly

samxu01 · 2026-05-26T11:33:00Z

Summary

Backend went unresponsive on app-dev 2026-05-26 — xcjsam couldn't load any pods, /api/pods + /api/messages hung indefinitely (60s+ TTFB, 0 bytes). Workaround was `kubectl rollout restart deploy/backend`. This PR is the structural fix.

Root cause

pg.Pool defaults to max=10 and connectionTimeoutMillis=0 (wait forever). The hourly summarizer fans out 60 summary.request events; with only 10 slots, the surge starves the pool. Every subsequent pool.query() — including user-facing getAllPods — waits forever on connection acquire instead of failing fast. UI shows perpetual loading with no diagnostic signal.

Changes

Default max=50 — Aiven dev plan supports 100+ connections; 50 gives the 60-event summarizer burst headroom without claiming the entire DB connection budget.
Default connectionTimeoutMillis=5000ms — fails fast as a 5xx so the user sees an error rather than a perpetual loading state.
Both tunable via env (PG_POOL_MAX, PG_POOL_CONNECT_TIMEOUT_MS); non-numeric / non-positive values fall through to the safe default rather than zeroing the pool.
Two new regression tests assert max >= 50 and connectionTimeoutMillis is finite (≤60s). Guards against a future zero-cap refactor regressing back to the hang-forever shape.

What this does NOT fix

Two of the four fixes from #454 are out of scope for this quick PR:

Audit summarizer fan-out (chunking the 60 events instead of bursting). Need a follow-up to find the slowest per-event PG query and reduce concurrency at the source.
Add /api/health/db probe for pool.idleCount + pool.waitingCount alerting. Real observability work, separate PR.

These bumps are the highest-leverage 2-line fix that converts a silent infinite-hang into a fast-fail, which is enough to unblock users today.

Test plan

Local: change is a Pool constructor config tweak. Two new unit tests added.
CI: backend test + lint passes.
Post-Deploy: verify pool config landed via `kubectl exec deploy/backend -- node -e "const {pool} = require('./dist/config/db-pg'); console.log({max: pool.options.max, ct: pool.options.connectionTimeoutMillis})"` — expect max: 50, ct: 5000. Then re-trigger the hourly summarizer fanout and confirm /api/pods still responds.

Refs #454.

🤖 Generated with Claude Code

Backend went unresponsive on app-dev 2026-05-26 — xcjsam couldn't load any pods, /api/pods + /api/messages hung indefinitely. Root cause: pg.Pool defaults are max=10 and connectionTimeoutMillis=0 (wait forever). The hourly summarizer fans out 60 summary.request events; with only 10 slots, the surge starves the pool and every subsequent pool.query() — including user-facing getAllPods — waits forever on connection acquire. UI shows perpetual loading with no diagnostic signal. - Default max=50 (Aiven dev plan supports 100+; 50 leaves headroom for the 60-event burst without claiming the entire connection budget). - Default connectionTimeoutMillis=5000ms — fails fast as a 5xx so the user sees an error rather than a perpetual "loading" state. Without this, an exhausted pool is indistinguishable from a slow query at the client. - Both tunable via env (PG_POOL_MAX, PG_POOL_CONNECT_TIMEOUT_MS); non-numeric / non-positive values fall through to the safe default rather than zeroing the pool. - Two new regression tests assert the default max is ≥50 and that connectionTimeoutMillis is finite — guards against a future zero-cap refactor regressing back to the hang-forever shape. Issue #454 (live incident triage). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Captures the 2026-05-26 incident: probe localhost → exec one-off node → read logs for surge trigger → kubectl rollout restart → structural fix in PR #455. Pairs with the bumped Pool defaults to make future repros fast to triage. Pointer added in commonly-skills/devops/SKILL.md (separate repo). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The unit-test job in tests.yml doesn't set PG_HOST (only the Service Tests Tier 1 job boots real PG). Without it, `pgConfig.host ? new Pool(...) : null` returns null, so `require('pg').Pool.mock.calls` is empty and the new "Pool config" assertions throw "Cannot read properties of undefined" on `mock.calls[0][0]`. Sets `process.env.PG_HOST = 'localhost-test'` before requiring db-pg when the env is unset. Placeholder value is intentional — the real Pool is mocked anyway; we just need the ctor path to execute. Fixes the new tests added in this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous attempt (set PG_HOST before require) still left `require('pg').Pool.mock.calls[0][0]` undefined in CI. Root cause is subtle: `require('pg')` inside this file can return a different module exports object than the one db-pg.ts saw, so the `.mock.calls` array we read may be empty even though Pool was actually constructed. Fix: have the jest.mock factory capture the ctor args into a module-scope `let capturedPoolArgs` directly. No require indirection. Also moves the `process.env.PG_HOST` placeholder above the const declarations (more defensive — jest.mock factory may run on a hoisted schedule that races the env assignment otherwise). Adds a "captures Pool ctor args (sanity guard)" assertion so the failure mode is explicit if this ever regresses again. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

samxu01 · 2026-05-27T01:15:57Z

Squash-merged to main per feedback-pr-merge-pattern.

…455, refs #454) Backend went unresponsive on app-dev 2026-05-26 — xcjsam couldn't load any pods, /api/pods + /api/messages hung indefinitely. Root cause: pg.Pool defaults are max=10 and connectionTimeoutMillis=0 (wait forever). The hourly summarizer fans out 60 summary.request events; with only 10 slots, the surge starves the pool. Every subsequent pool.query() — including user-facing getAllPods — waits forever on connection acquire instead of failing fast. UI shows perpetual loading with no diagnostic signal because Express never times out the awaiting handler. backend/config/db-pg.ts: - Default max=50 (tunable via PG_POOL_MAX). Aiven dev plan supports 100+ connections; 50 leaves headroom for the 60-event summarizer burst without claiming the entire DB connection budget. - Default connectionTimeoutMillis=5000ms (tunable via PG_POOL_CONNECT_TIMEOUT_MS). Fails fast as a 5xx so the user sees an error rather than a perpetual loading state. - Non-numeric / non-positive env values fall through to the safe default rather than zeroing the pool. backend/__tests__/unit/config/db-pg.test.js: - 3 regression tests guard against zero-cap regression. Pool ctor args captured via factory closure (more robust than the Pool.mock.calls[0][0] indirection, which can race jest's module caching in CI). - Sets PG_HOST placeholder so the ctor path runs in the unit-test environment that doesn't boot real PG. docs/runbooks/pg-pool-exhaustion.md: - Diagnosis flow (probe localhost → exec one-off node → check resources → read logs for surge trigger), recovery procedure (kubectl rollout restart), structural fix explanation, still-TODO list (chunk summarizer fanout, add /api/health/db probe). Refs #454 (incident triage). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

samxu01 and others added 4 commits May 26, 2026 04:32

samxu01 closed this May 27, 2026

This was referenced May 31, 2026

fix(#454): chunk summarizer fanout + add /api/health/db pool probe #459

Merged

docs(runbooks): mark pg-pool #459 follow-ups shipped #461

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(db-pg): bump pool.max to 50 + add connectionTimeoutMillis=5s (#454)#455

fix(db-pg): bump pool.max to 50 + add connectionTimeoutMillis=5s (#454)#455
samxu01 wants to merge 4 commits into
mainfrom
worktree-2026-05-26-pg-pool-config

samxu01 commented May 26, 2026

Uh oh!

samxu01 commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

samxu01 commented May 26, 2026

Summary

Root cause

Changes

What this does NOT fix

Test plan

Uh oh!

samxu01 commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant