Skip to content

fix(distribution-monitor): tolerate lock timeout on boot-time index check#492

Merged
ddeboer merged 1 commit into
mainfrom
fix/distribution-monitor-boot-index-lock-timeout
Jun 18, 2026
Merged

fix(distribution-monitor): tolerate lock timeout on boot-time index check#492
ddeboer merged 1 commit into
mainfrom
fix/distribution-monitor-boot-index-lock-timeout

Conversation

@ddeboer

@ddeboer ddeboer commented Jun 18, 2026

Copy link
Copy Markdown
Member

Problem

The status monitor crash-loops on startup with:

Failed to start monitoring service
query: CREATE UNIQUE INDEX IF NOT EXISTS latest_observations_monitor_idx ON latest_observations (monitor)
PostgresError 55P03 (lock_not_available)

latest_observations is a materialized view, and the boot path always runs a CREATE UNIQUE INDEX IF NOT EXISTS to provide the unique index that REFRESH … CONCURRENTLY requires. Even the IF NOT EXISTS form must take a SHARE lock on the view before it can skip, and that lock conflicts with the EXCLUSIVE lock held by a REFRESH … CONCURRENTLY running in another instance.

During a rolling deploy the old and new pods overlap (default RollingUpdate, maxUnavailable: 0), so the new pod's index re-check waits on the old pod's refresh. With the per-connection lock_timeout (30s) introduced earlier, that wait now raises 55P03 — and because the statement was unguarded, it aborted startup and the monitor crash-looped indefinitely, leaving the status site without availabilities.

Fix

Wrap the index creation in ensureLatestObservationsIndex(), which swallows a 55P03 lock-timeout and lets startup proceed. Whenever the view is being refreshed the index already exists, so a failed re-check is harmless. Any other error still propagates unchanged.

This keeps RollingUpdate (two overlapping pods) working — no deployment-strategy change needed.

Changes

  • store.ts — extract ensureLatestObservationsIndex() (tolerates 55P03, re-throws everything else) and isLockNotAvailable() (detects the SQLSTATE through drizzle's wrapped cause chain).
  • store.test.ts — unit tests for both helpers (fast, no lock-wait): swallow vs. re-throw, and SQLSTATE detection across nested causes and non-object inputs.
  • vite.config.ts — autoUpdate raised the coverage thresholds to match.

Notes

This addresses the crash-loop. A separate, pre-existing concern remains: the observations table has no retention/pruning, and a REFRESH … CONCURRENTLY was observed wedged for hours — worth its own issue.

…heck

- Wrap the materialized-view unique-index creation in ensureLatestObservationsIndex, which
  swallows a 55P03 (lock_not_available) instead of aborting startup. During a rolling deploy
  a contended REFRESH ... CONCURRENTLY holds an EXCLUSIVE lock; the boot-time SHARE-locking
  re-check then times out and previously crash-looped the monitor.
- Re-throw any other error unchanged.
- Add isLockNotAvailable to detect the SQLSTATE through the wrapped driver-error cause chain.
- Unit-test both helpers; autoUpdate refreshes the coverage thresholds.
@ddeboer ddeboer merged commit fa598d7 into main Jun 18, 2026
2 checks passed
@ddeboer ddeboer deleted the fix/distribution-monitor-boot-index-lock-timeout branch June 18, 2026 11:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant