fix(distribution-monitor): tolerate lock timeout on boot-time index check by ddeboer · Pull Request #492 · ldelements/lde

ddeboer · 2026-06-18T11:33:49Z

Problem

The status monitor crash-loops on startup with:

Failed to start monitoring service
query: CREATE UNIQUE INDEX IF NOT EXISTS latest_observations_monitor_idx ON latest_observations (monitor)
PostgresError 55P03 (lock_not_available)

latest_observations is a materialized view, and the boot path always runs a CREATE UNIQUE INDEX IF NOT EXISTS to provide the unique index that REFRESH … CONCURRENTLY requires. Even the IF NOT EXISTS form must take a SHARE lock on the view before it can skip, and that lock conflicts with the EXCLUSIVE lock held by a REFRESH … CONCURRENTLY running in another instance.

During a rolling deploy the old and new pods overlap (default RollingUpdate, maxUnavailable: 0), so the new pod's index re-check waits on the old pod's refresh. With the per-connection lock_timeout (30s) introduced earlier, that wait now raises 55P03 — and because the statement was unguarded, it aborted startup and the monitor crash-looped indefinitely, leaving the status site without availabilities.

Fix

Wrap the index creation in ensureLatestObservationsIndex(), which swallows a 55P03 lock-timeout and lets startup proceed. Whenever the view is being refreshed the index already exists, so a failed re-check is harmless. Any other error still propagates unchanged.

This keeps RollingUpdate (two overlapping pods) working — no deployment-strategy change needed.

Changes

store.ts — extract ensureLatestObservationsIndex() (tolerates 55P03, re-throws everything else) and isLockNotAvailable() (detects the SQLSTATE through drizzle's wrapped cause chain).
store.test.ts — unit tests for both helpers (fast, no lock-wait): swallow vs. re-throw, and SQLSTATE detection across nested causes and non-object inputs.
vite.config.ts — autoUpdate raised the coverage thresholds to match.

Notes

This addresses the crash-loop. A separate, pre-existing concern remains: the observations table has no retention/pruning, and a REFRESH … CONCURRENTLY was observed wedged for hours — worth its own issue.

…heck - Wrap the materialized-view unique-index creation in ensureLatestObservationsIndex, which swallows a 55P03 (lock_not_available) instead of aborting startup. During a rolling deploy a contended REFRESH ... CONCURRENTLY holds an EXCLUSIVE lock; the boot-time SHARE-locking re-check then times out and previously crash-looped the monitor. - Re-throw any other error unchanged. - Add isLockNotAvailable to detect the SQLSTATE through the wrapped driver-error cause chain. - Unit-test both helpers; autoUpdate refreshes the coverage thresholds.

ddeboer merged commit fa598d7 into main Jun 18, 2026
2 checks passed

ddeboer deleted the fix/distribution-monitor-boot-index-lock-timeout branch June 18, 2026 11:41

This was referenced Jun 18, 2026

build(deps): bump @lde/distribution-monitor to 0.1.14 netwerk-digitaal-erfgoed/network-of-terms#1877

Merged

Add retention/pruning for the observations table (unbounded growth) #494

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(distribution-monitor): tolerate lock timeout on boot-time index check#492

fix(distribution-monitor): tolerate lock timeout on boot-time index check#492
ddeboer merged 1 commit into
mainfrom
fix/distribution-monitor-boot-index-lock-timeout

ddeboer commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ddeboer commented Jun 18, 2026

Problem

Fix

Changes

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant