feat(universaldb): graceful postgres leader handoff on shutdown by MasterPtato · Pull Request #5337 · rivet-dev/rivet

MasterPtato · 2026-06-25T20:34:04Z

No description provided.

MasterPtato · 2026-06-25T20:34:19Z

Stack for rivet-dev/rivet

Current stack:

[SLOP(claude-opus-4-8-high)] perf(universaldb): batch leader apply and fold follower commit round-trips #5342 vtrtqpyp · 2026-06-26 02:20:21
[slopfix] docs(self-hosting): promote postgres from experimental to recommended OSS multi-node backend #5341 zsmysquk · 2026-06-26 02:20:19
[slopfix] fix(self-host): repair dev compose generation, kitchen-sink serverful runner, and postgres/udb bootstrap #5340 wpvkrymu · 2026-06-25 22:53:05
[SLOP(claude-opus-4-8-high)] feat(universaldb): graceful postgres leader handoff on shutdown #5337 mvxpzxsl · 2026-06-25 20:34:04 👈
[slopfix] test(universaldb): postgres leader failover + abort resolver task on driver drop #5336 tvmntvqo · 2026-06-25 20:34:03
[SLOP(claude-opus-4-8-high)] feat(ups): table-backed postgres transport with coalesced doorbell #5335 xutxrzrv · 2026-06-25 20:34:01
[SLOP(claude-opus-4-8-high)] feat(universaldb): postgres leader-resolver driver overhaul #5329 xrrmomwz · 2026-06-25 00:55:07
[SLOP(claude-opus-4-8)] feat(pegboard-envoy): rate-limit envoy ws ingress and cap get_pages, trim unused metrics #5327 tnzyzqrl · 2026-06-24 20:57:36
[SLOP(claude-opus-4-8)] feat(util): add rate limiter primitive and ingress throttles for actor create and gateway websocket #5326 pnyswwtm · 2026-06-24 20:57:35
[SLOP(claude-opus-4-8)] feat(rivetkit-core): add serde duration and size metrics #5322 nqkrytvr · 2026-06-23 22:49:13
Add observation for serde_bare #5321 oskoutmw · 2026-06-23 20:23:10
Add observation and ser/de metrics for serde calls #5320 ztnuwvts · 2026-06-23 20:23:09

Dependencies:

[SLOP(claude-opus-4-8-medium)] fix(rivetkit-napi): reap completed tasks from event-loop JoinSet (#5315) #5314 xxqwtrzp · 2026-06-22 22:42:01
main

Get stack: forklift get 5337
Push local edits: forklift submit
Merge when ready: forklift merge 5337

…der handoff on shutdown

claude · 2026-06-25T20:43:14Z

PR Review: feat(universaldb): graceful postgres leader handoff on shutdown

What the PR Does

This PR adds a graceful shutdown path for the UniversalDB Postgres driver. Before this change, when the engine received SIGTERM, the Postgres leader lease would expire after the full TTL (10s) before a standby could take over, turning every rolling deploy into a ~10s commit stall. After this change:

DatabaseDriver trait gains a shutdown() method with a default no-op.
PostgresDatabaseDriver::shutdown() aborts the resolver and GC tasks, then calls resolver::handoff().
resolver::handoff() runs lease::release() (UPDATE sets expires_at = now() fenced on leader_addr), then fires pg_notify on a new ELECTION_CHANNEL.
Standby nodes listen on ELECTION_CHANNEL to wake from ELECTION_RETRY sleep early, enabling near-instant takeover.
service-manager calls pools.udb()?.shutdown().await after all services stop.
A new integration test (test_postgres_graceful_handoff) asserts takeover in under 5s.

The design is correct in its overall shape. Safety properties are well thought out: release is fenced on leader_addr so a follower or post-takeover successor cannot accidentally clear a live lease; ELECTION_CHANNEL notification is best-effort with the TTL as a backstop.

Bug: abort() Is Not a Synchronous Stop

File: engine/packages/universaldb/src/driver/postgres/database.rs

self.resolver_handle.abort();   // schedules cancellation only
self.gc_handle.abort();
resolver::handoff(&self.shared).await;  // races an in-flight renew

JoinHandle::abort() schedules a cancellation but does not block until the task yields and drops. If the resolver is currently awaiting the Postgres ACK for lease::renew() (bytes already sent), Tokio cannot cancel that in-flight I/O. The lease::release() docstring flags this: "Renewal must already be stopped before calling this, otherwise a racing renew could re-extend the lease."

Concrete race:

abort() fires. Resolver is in lease::renew() awaiting the Postgres response.
handoff() runs release(), setting expires_at = now().
Renew response arrives from Postgres; renew() completes and overwrites expires_at with now() + 10s.
Standby sees a live lease and must wait the full TTL. The feature is defeated.

Recommended fix: Introduce a CancellationToken that lead() checks before each renew() call, so renewal never begins if shutdown is in progress. A simpler bounded-wait alternative is to await the join handle with a short timeout before calling handoff().

Issue: No Timeout on handoff() Can Hang Shutdown

File: engine/packages/universaldb/src/driver/postgres/resolver/mod.rs

handoff() awaits lease::release() then notify_election() with no wrapping tokio::time::timeout. If Postgres is unreachable at shutdown, both pool checkouts can block indefinitely. Since service-manager calls pools.udb()?.shutdown().await, a hung handoff() stalls all downstream cleanup. Unreachable Postgres is not exotic: it can be the very reason the process is being restarted.

Recommended fix: Wrap the handoff body in a tokio::time::timeout of ~2s (well under the 10s TTL) and log a warning on timeout. This bounds shutdown time.

Minor: release() and notify_election() Are Not Atomic

The UPDATE (expire lease) and pg_notify (wake standbys) happen on separate pool connections. If the process is killed between them, standbys wait up to ELECTION_RETRY (2s) before the next poll. The TTL guarantees correctness regardless; combining into one connection+transaction tightens the handoff.

Style / Conventions

Explicit lifetime on trait default (driver/mod.rs): The idiomatic form is fn shutdown(&self) -> BoxFut<'_, ()> (lifetime elision) rather than fn shutdown<'a>(&'a self) -> BoxFut<'a, ()>. All other DatabaseDriver methods use 'a explicitly; whichever style is chosen should be consistent throughout.

Double-abort in test (failover.rs): Drop fires resolver_handle.abort() again after shutdown() already aborted it. Harmless, but a comment would help future readers.

Test Quality

test_postgres_graceful_handoff is well-structured and meaningfully proves the feature. Two observations:

tokio::time::sleep(Duration::from_secs(4)) for Docker readiness pre-exists this PR but makes tests potentially flaky on slow CI runners. Not blocking.
Timing assertion (< 8s) vs. poll window (5s): These are independent; the overall assertion (well under the 10s TTL) is the meaningful check.

Security / Correctness

The fencing on leader_addr in release() is correct: a follower or new leader cannot accidentally clear the active lease. pg_notify for ELECTION_CHANNEL is sent only when release() returns true, so follower nodes do not generate spurious wakeups. The listener reconnection path on Closed re-subscribes to ELECTION_CHANNEL correctly.

Summary

The design goal is sound and the approach is well-structured. The primary concern before merge is the abort() is not synchronous stop race: under load, this could silently defeat the handoff. The missing timeout on handoff() is a secondary safety concern for edge-case shutdown hangs. The rest are polish items.

[SLOP(claude-opus-4-8-high)] feat(universaldb): graceful postgres lea…

7399a2d

…der handoff on shutdown

MasterPtato force-pushed the stack/slop-claude-opus-4-8-high-feat-universaldb-graceful-postgres-leader-handoff-on-shutdown-mvxpzxsl branch from eaee251 to 7399a2d Compare June 25, 2026 22:53

NathanFlurry changed the title ~~[SLOP(claude-opus-4-8-high)] feat(universaldb): graceful postgres leader handoff on shutdown~~ feat(universaldb): graceful postgres leader handoff on shutdown Jun 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(universaldb): graceful postgres leader handoff on shutdown#5337

MasterPtato commented Jun 25, 2026 •

edited

Loading

Uh oh!

MasterPtato commented Jun 25, 2026 •

edited

Loading

Uh oh!

claude Bot commented Jun 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

MasterPtato commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MasterPtato commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MasterPtato commented Jun 25, 2026 •

edited

Loading

MasterPtato commented Jun 25, 2026 •

edited

Loading

claude Bot commented Jun 25, 2026 •

edited

Loading