Skip to content

feat(universaldb): graceful postgres leader handoff on shutdown#5337

Open
MasterPtato wants to merge 1 commit into
stack/slopfix-test-universaldb-postgres-leader-failover-abort-resolver-task-on-driver-drop-tvmntvqofrom
stack/slop-claude-opus-4-8-high-feat-universaldb-graceful-postgres-leader-handoff-on-shutdown-mvxpzxsl
Open

feat(universaldb): graceful postgres leader handoff on shutdown#5337
MasterPtato wants to merge 1 commit into
stack/slopfix-test-universaldb-postgres-leader-failover-abort-resolver-task-on-driver-drop-tvmntvqofrom
stack/slop-claude-opus-4-8-high-feat-universaldb-graceful-postgres-leader-handoff-on-shutdown-mvxpzxsl

Conversation

@MasterPtato

@MasterPtato MasterPtato commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

No description provided.

@MasterPtato

MasterPtato commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Stack for rivet-dev/rivet

Current stack:

Dependencies:

Get stack: forklift get 5337
Push local edits: forklift submit
Merge when ready: forklift merge 5337

@claude

claude Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

PR Review: feat(universaldb): graceful postgres leader handoff on shutdown

What the PR Does

This PR adds a graceful shutdown path for the UniversalDB Postgres driver. Before this change, when the engine received SIGTERM, the Postgres leader lease would expire after the full TTL (10s) before a standby could take over, turning every rolling deploy into a ~10s commit stall. After this change:

  1. DatabaseDriver trait gains a shutdown() method with a default no-op.
  2. PostgresDatabaseDriver::shutdown() aborts the resolver and GC tasks, then calls resolver::handoff().
  3. resolver::handoff() runs lease::release() (UPDATE sets expires_at = now() fenced on leader_addr), then fires pg_notify on a new ELECTION_CHANNEL.
  4. Standby nodes listen on ELECTION_CHANNEL to wake from ELECTION_RETRY sleep early, enabling near-instant takeover.
  5. service-manager calls pools.udb()?.shutdown().await after all services stop.
  6. A new integration test (test_postgres_graceful_handoff) asserts takeover in under 5s.

The design is correct in its overall shape. Safety properties are well thought out: release is fenced on leader_addr so a follower or post-takeover successor cannot accidentally clear a live lease; ELECTION_CHANNEL notification is best-effort with the TTL as a backstop.


Bug: abort() Is Not a Synchronous Stop

File: engine/packages/universaldb/src/driver/postgres/database.rs

self.resolver_handle.abort();   // schedules cancellation only
self.gc_handle.abort();
resolver::handoff(&self.shared).await;  // races an in-flight renew

JoinHandle::abort() schedules a cancellation but does not block until the task yields and drops. If the resolver is currently awaiting the Postgres ACK for lease::renew() (bytes already sent), Tokio cannot cancel that in-flight I/O. The lease::release() docstring flags this: "Renewal must already be stopped before calling this, otherwise a racing renew could re-extend the lease."

Concrete race:

  1. abort() fires. Resolver is in lease::renew() awaiting the Postgres response.
  2. handoff() runs release(), setting expires_at = now().
  3. Renew response arrives from Postgres; renew() completes and overwrites expires_at with now() + 10s.
  4. Standby sees a live lease and must wait the full TTL. The feature is defeated.

Recommended fix: Introduce a CancellationToken that lead() checks before each renew() call, so renewal never begins if shutdown is in progress. A simpler bounded-wait alternative is to await the join handle with a short timeout before calling handoff().


Issue: No Timeout on handoff() Can Hang Shutdown

File: engine/packages/universaldb/src/driver/postgres/resolver/mod.rs

handoff() awaits lease::release() then notify_election() with no wrapping tokio::time::timeout. If Postgres is unreachable at shutdown, both pool checkouts can block indefinitely. Since service-manager calls pools.udb()?.shutdown().await, a hung handoff() stalls all downstream cleanup. Unreachable Postgres is not exotic: it can be the very reason the process is being restarted.

Recommended fix: Wrap the handoff body in a tokio::time::timeout of ~2s (well under the 10s TTL) and log a warning on timeout. This bounds shutdown time.


Minor: release() and notify_election() Are Not Atomic

The UPDATE (expire lease) and pg_notify (wake standbys) happen on separate pool connections. If the process is killed between them, standbys wait up to ELECTION_RETRY (2s) before the next poll. The TTL guarantees correctness regardless; combining into one connection+transaction tightens the handoff.


Style / Conventions

Explicit lifetime on trait default (driver/mod.rs): The idiomatic form is fn shutdown(&self) -> BoxFut<'_, ()> (lifetime elision) rather than fn shutdown<'a>(&'a self) -> BoxFut<'a, ()>. All other DatabaseDriver methods use 'a explicitly; whichever style is chosen should be consistent throughout.

Double-abort in test (failover.rs): Drop fires resolver_handle.abort() again after shutdown() already aborted it. Harmless, but a comment would help future readers.


Test Quality

test_postgres_graceful_handoff is well-structured and meaningfully proves the feature. Two observations:

  1. tokio::time::sleep(Duration::from_secs(4)) for Docker readiness pre-exists this PR but makes tests potentially flaky on slow CI runners. Not blocking.
  2. Timing assertion (< 8s) vs. poll window (5s): These are independent; the overall assertion (well under the 10s TTL) is the meaningful check.

Security / Correctness

The fencing on leader_addr in release() is correct: a follower or new leader cannot accidentally clear the active lease. pg_notify for ELECTION_CHANNEL is sent only when release() returns true, so follower nodes do not generate spurious wakeups. The listener reconnection path on Closed re-subscribes to ELECTION_CHANNEL correctly.


Summary

The design goal is sound and the approach is well-structured. The primary concern before merge is the abort() is not synchronous stop race: under load, this could silently defeat the handoff. The missing timeout on handoff() is a secondary safety concern for edge-case shutdown hangs. The rest are polish items.

@MasterPtato MasterPtato force-pushed the stack/slop-claude-opus-4-8-high-feat-universaldb-graceful-postgres-leader-handoff-on-shutdown-mvxpzxsl branch from eaee251 to 7399a2d Compare June 25, 2026 22:53
@NathanFlurry NathanFlurry changed the title [SLOP(claude-opus-4-8-high)] feat(universaldb): graceful postgres leader handoff on shutdown feat(universaldb): graceful postgres leader handoff on shutdown Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant