fix(db-pg): bump pool.max to 50 + add connectionTimeoutMillis=5s (#454)#455
Closed
samxu01 wants to merge 4 commits into
Closed
fix(db-pg): bump pool.max to 50 + add connectionTimeoutMillis=5s (#454)#455samxu01 wants to merge 4 commits into
samxu01 wants to merge 4 commits into
Conversation
Backend went unresponsive on app-dev 2026-05-26 — xcjsam couldn't load any pods, /api/pods + /api/messages hung indefinitely. Root cause: pg.Pool defaults are max=10 and connectionTimeoutMillis=0 (wait forever). The hourly summarizer fans out 60 summary.request events; with only 10 slots, the surge starves the pool and every subsequent pool.query() — including user-facing getAllPods — waits forever on connection acquire. UI shows perpetual loading with no diagnostic signal. - Default max=50 (Aiven dev plan supports 100+; 50 leaves headroom for the 60-event burst without claiming the entire connection budget). - Default connectionTimeoutMillis=5000ms — fails fast as a 5xx so the user sees an error rather than a perpetual "loading" state. Without this, an exhausted pool is indistinguishable from a slow query at the client. - Both tunable via env (PG_POOL_MAX, PG_POOL_CONNECT_TIMEOUT_MS); non-numeric / non-positive values fall through to the safe default rather than zeroing the pool. - Two new regression tests assert the default max is ≥50 and that connectionTimeoutMillis is finite — guards against a future zero-cap refactor regressing back to the hang-forever shape. Issue #454 (live incident triage). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures the 2026-05-26 incident: probe localhost → exec one-off node → read logs for surge trigger → kubectl rollout restart → structural fix in PR #455. Pairs with the bumped Pool defaults to make future repros fast to triage. Pointer added in commonly-skills/devops/SKILL.md (separate repo). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The unit-test job in tests.yml doesn't set PG_HOST (only the Service
Tests Tier 1 job boots real PG). Without it, `pgConfig.host ?
new Pool(...) : null` returns null, so `require('pg').Pool.mock.calls`
is empty and the new "Pool config" assertions throw
"Cannot read properties of undefined" on `mock.calls[0][0]`.
Sets `process.env.PG_HOST = 'localhost-test'` before requiring db-pg
when the env is unset. Placeholder value is intentional — the real
Pool is mocked anyway; we just need the ctor path to execute.
Fixes the new tests added in this PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous attempt (set PG_HOST before require) still left
`require('pg').Pool.mock.calls[0][0]` undefined in CI. Root cause is
subtle: `require('pg')` inside this file can return a different module
exports object than the one db-pg.ts saw, so the `.mock.calls` array we
read may be empty even though Pool was actually constructed.
Fix: have the jest.mock factory capture the ctor args into a
module-scope `let capturedPoolArgs` directly. No require indirection.
Also moves the `process.env.PG_HOST` placeholder above the const
declarations (more defensive — jest.mock factory may run on a hoisted
schedule that races the env assignment otherwise).
Adds a "captures Pool ctor args (sanity guard)" assertion so the
failure mode is explicit if this ever regresses again.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
|
Squash-merged to main per feedback-pr-merge-pattern. |
samxu01
added a commit
that referenced
this pull request
May 27, 2026
…455, refs #454) Backend went unresponsive on app-dev 2026-05-26 — xcjsam couldn't load any pods, /api/pods + /api/messages hung indefinitely. Root cause: pg.Pool defaults are max=10 and connectionTimeoutMillis=0 (wait forever). The hourly summarizer fans out 60 summary.request events; with only 10 slots, the surge starves the pool. Every subsequent pool.query() — including user-facing getAllPods — waits forever on connection acquire instead of failing fast. UI shows perpetual loading with no diagnostic signal because Express never times out the awaiting handler. backend/config/db-pg.ts: - Default max=50 (tunable via PG_POOL_MAX). Aiven dev plan supports 100+ connections; 50 leaves headroom for the 60-event summarizer burst without claiming the entire DB connection budget. - Default connectionTimeoutMillis=5000ms (tunable via PG_POOL_CONNECT_TIMEOUT_MS). Fails fast as a 5xx so the user sees an error rather than a perpetual loading state. - Non-numeric / non-positive env values fall through to the safe default rather than zeroing the pool. backend/__tests__/unit/config/db-pg.test.js: - 3 regression tests guard against zero-cap regression. Pool ctor args captured via factory closure (more robust than the Pool.mock.calls[0][0] indirection, which can race jest's module caching in CI). - Sets PG_HOST placeholder so the ctor path runs in the unit-test environment that doesn't boot real PG. docs/runbooks/pg-pool-exhaustion.md: - Diagnosis flow (probe localhost → exec one-off node → check resources → read logs for surge trigger), recovery procedure (kubectl rollout restart), structural fix explanation, still-TODO list (chunk summarizer fanout, add /api/health/db probe). Refs #454 (incident triage). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 31, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Backend went unresponsive on
app-dev2026-05-26 — xcjsam couldn't load any pods,/api/pods+/api/messageshung indefinitely (60s+ TTFB, 0 bytes). Workaround was `kubectl rollout restart deploy/backend`. This PR is the structural fix.Root cause
pg.Pooldefaults tomax=10andconnectionTimeoutMillis=0(wait forever). The hourly summarizer fans out 60summary.requestevents; with only 10 slots, the surge starves the pool. Every subsequentpool.query()— including user-facinggetAllPods— waits forever on connection acquire instead of failing fast. UI shows perpetual loading with no diagnostic signal.Changes
max=50— Aiven dev plan supports 100+ connections; 50 gives the 60-event summarizer burst headroom without claiming the entire DB connection budget.connectionTimeoutMillis=5000ms— fails fast as a 5xx so the user sees an error rather than a perpetual loading state.PG_POOL_MAX,PG_POOL_CONNECT_TIMEOUT_MS); non-numeric / non-positive values fall through to the safe default rather than zeroing the pool.max >= 50andconnectionTimeoutMillisis finite (≤60s). Guards against a future zero-cap refactor regressing back to the hang-forever shape.What this does NOT fix
Two of the four fixes from #454 are out of scope for this quick PR:
/api/health/dbprobe for pool.idleCount + pool.waitingCount alerting. Real observability work, separate PR.These bumps are the highest-leverage 2-line fix that converts a silent infinite-hang into a fast-fail, which is enough to unblock users today.
Test plan
max: 50, ct: 5000. Then re-trigger the hourly summarizer fanout and confirm /api/pods still responds.Refs #454.
🤖 Generated with Claude Code