Skip to content

fix(db-pg): bump pool.max to 50 + add connectionTimeoutMillis=5s (#454)#455

Closed
samxu01 wants to merge 4 commits into
mainfrom
worktree-2026-05-26-pg-pool-config
Closed

fix(db-pg): bump pool.max to 50 + add connectionTimeoutMillis=5s (#454)#455
samxu01 wants to merge 4 commits into
mainfrom
worktree-2026-05-26-pg-pool-config

Conversation

@samxu01
Copy link
Copy Markdown
Contributor

@samxu01 samxu01 commented May 26, 2026

Summary

Backend went unresponsive on app-dev 2026-05-26 — xcjsam couldn't load any pods, /api/pods + /api/messages hung indefinitely (60s+ TTFB, 0 bytes). Workaround was `kubectl rollout restart deploy/backend`. This PR is the structural fix.

Root cause

pg.Pool defaults to max=10 and connectionTimeoutMillis=0 (wait forever). The hourly summarizer fans out 60 summary.request events; with only 10 slots, the surge starves the pool. Every subsequent pool.query() — including user-facing getAllPods — waits forever on connection acquire instead of failing fast. UI shows perpetual loading with no diagnostic signal.

Changes

  • Default max=50 — Aiven dev plan supports 100+ connections; 50 gives the 60-event summarizer burst headroom without claiming the entire DB connection budget.
  • Default connectionTimeoutMillis=5000ms — fails fast as a 5xx so the user sees an error rather than a perpetual loading state.
  • Both tunable via env (PG_POOL_MAX, PG_POOL_CONNECT_TIMEOUT_MS); non-numeric / non-positive values fall through to the safe default rather than zeroing the pool.
  • Two new regression tests assert max >= 50 and connectionTimeoutMillis is finite (≤60s). Guards against a future zero-cap refactor regressing back to the hang-forever shape.

What this does NOT fix

Two of the four fixes from #454 are out of scope for this quick PR:

  • Audit summarizer fan-out (chunking the 60 events instead of bursting). Need a follow-up to find the slowest per-event PG query and reduce concurrency at the source.
  • Add /api/health/db probe for pool.idleCount + pool.waitingCount alerting. Real observability work, separate PR.

These bumps are the highest-leverage 2-line fix that converts a silent infinite-hang into a fast-fail, which is enough to unblock users today.

Test plan

  • Local: change is a Pool constructor config tweak. Two new unit tests added.
  • CI: backend test + lint passes.
  • Post-Deploy: verify pool config landed via `kubectl exec deploy/backend -- node -e "const {pool} = require('./dist/config/db-pg'); console.log({max: pool.options.max, ct: pool.options.connectionTimeoutMillis})"` — expect max: 50, ct: 5000. Then re-trigger the hourly summarizer fanout and confirm /api/pods still responds.

Refs #454.

🤖 Generated with Claude Code

samxu01 and others added 4 commits May 26, 2026 04:32
Backend went unresponsive on app-dev 2026-05-26 — xcjsam couldn't load
any pods, /api/pods + /api/messages hung indefinitely. Root cause: pg.Pool
defaults are max=10 and connectionTimeoutMillis=0 (wait forever). The
hourly summarizer fans out 60 summary.request events; with only 10 slots,
the surge starves the pool and every subsequent pool.query() — including
user-facing getAllPods — waits forever on connection acquire. UI shows
perpetual loading with no diagnostic signal.

- Default max=50 (Aiven dev plan supports 100+; 50 leaves headroom
  for the 60-event burst without claiming the entire connection budget).
- Default connectionTimeoutMillis=5000ms — fails fast as a 5xx so the
  user sees an error rather than a perpetual "loading" state. Without
  this, an exhausted pool is indistinguishable from a slow query at the
  client.
- Both tunable via env (PG_POOL_MAX, PG_POOL_CONNECT_TIMEOUT_MS);
  non-numeric / non-positive values fall through to the safe default
  rather than zeroing the pool.
- Two new regression tests assert the default max is ≥50 and that
  connectionTimeoutMillis is finite — guards against a future zero-cap
  refactor regressing back to the hang-forever shape.

Issue #454 (live incident triage).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures the 2026-05-26 incident: probe localhost → exec one-off node →
read logs for surge trigger → kubectl rollout restart → structural fix
in PR #455. Pairs with the bumped Pool defaults to make future repros
fast to triage. Pointer added in commonly-skills/devops/SKILL.md
(separate repo).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The unit-test job in tests.yml doesn't set PG_HOST (only the Service
Tests Tier 1 job boots real PG). Without it, `pgConfig.host ?
new Pool(...) : null` returns null, so `require('pg').Pool.mock.calls`
is empty and the new "Pool config" assertions throw
"Cannot read properties of undefined" on `mock.calls[0][0]`.

Sets `process.env.PG_HOST = 'localhost-test'` before requiring db-pg
when the env is unset. Placeholder value is intentional — the real
Pool is mocked anyway; we just need the ctor path to execute.

Fixes the new tests added in this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous attempt (set PG_HOST before require) still left
`require('pg').Pool.mock.calls[0][0]` undefined in CI. Root cause is
subtle: `require('pg')` inside this file can return a different module
exports object than the one db-pg.ts saw, so the `.mock.calls` array we
read may be empty even though Pool was actually constructed.

Fix: have the jest.mock factory capture the ctor args into a
module-scope `let capturedPoolArgs` directly. No require indirection.

Also moves the `process.env.PG_HOST` placeholder above the const
declarations (more defensive — jest.mock factory may run on a hoisted
schedule that races the env assignment otherwise).

Adds a "captures Pool ctor args (sanity guard)" assertion so the
failure mode is explicit if this ever regresses again.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@samxu01
Copy link
Copy Markdown
Contributor Author

samxu01 commented May 27, 2026

Squash-merged to main per feedback-pr-merge-pattern.

samxu01 added a commit that referenced this pull request May 27, 2026
…455, refs #454)

Backend went unresponsive on app-dev 2026-05-26 — xcjsam couldn't load
any pods, /api/pods + /api/messages hung indefinitely. Root cause:
pg.Pool defaults are max=10 and connectionTimeoutMillis=0 (wait forever).
The hourly summarizer fans out 60 summary.request events; with only 10
slots, the surge starves the pool. Every subsequent pool.query() —
including user-facing getAllPods — waits forever on connection acquire
instead of failing fast. UI shows perpetual loading with no diagnostic
signal because Express never times out the awaiting handler.

backend/config/db-pg.ts:
- Default max=50 (tunable via PG_POOL_MAX). Aiven dev plan supports
  100+ connections; 50 leaves headroom for the 60-event summarizer
  burst without claiming the entire DB connection budget.
- Default connectionTimeoutMillis=5000ms (tunable via
  PG_POOL_CONNECT_TIMEOUT_MS). Fails fast as a 5xx so the user sees
  an error rather than a perpetual loading state.
- Non-numeric / non-positive env values fall through to the safe
  default rather than zeroing the pool.

backend/__tests__/unit/config/db-pg.test.js:
- 3 regression tests guard against zero-cap regression. Pool ctor
  args captured via factory closure (more robust than the
  Pool.mock.calls[0][0] indirection, which can race jest's module
  caching in CI).
- Sets PG_HOST placeholder so the ctor path runs in the unit-test
  environment that doesn't boot real PG.

docs/runbooks/pg-pool-exhaustion.md:
- Diagnosis flow (probe localhost → exec one-off node → check resources
  → read logs for surge trigger), recovery procedure (kubectl rollout
  restart), structural fix explanation, still-TODO list (chunk
  summarizer fanout, add /api/health/db probe).

Refs #454 (incident triage).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@samxu01 samxu01 closed this May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant