feat(universaldb): postgres leader-resolver driver overhaul by MasterPtato · Pull Request #5329 · rivet-dev/rivet

MasterPtato · 2026-06-25T00:55:07Z

No description provided.

MasterPtato · 2026-06-25T00:55:16Z

Stack for rivet-dev/rivet

Current stack:

[SLOP(claude-opus-4-8-high)] perf(universaldb): batch leader apply and fold follower commit round-trips #5342 vtrtqpyp · 2026-06-26 02:20:21
[slopfix] docs(self-hosting): promote postgres from experimental to recommended OSS multi-node backend #5341 zsmysquk · 2026-06-26 02:20:19
[slopfix] fix(self-host): repair dev compose generation, kitchen-sink serverful runner, and postgres/udb bootstrap #5340 wpvkrymu · 2026-06-25 22:53:05
[SLOP(claude-opus-4-8-high)] feat(universaldb): graceful postgres leader handoff on shutdown #5337 mvxpzxsl · 2026-06-25 20:34:04
[slopfix] test(universaldb): postgres leader failover + abort resolver task on driver drop #5336 tvmntvqo · 2026-06-25 20:34:03
[SLOP(claude-opus-4-8-high)] feat(ups): table-backed postgres transport with coalesced doorbell #5335 xutxrzrv · 2026-06-25 20:34:01
[SLOP(claude-opus-4-8-high)] feat(universaldb): postgres leader-resolver driver overhaul #5329 xrrmomwz · 2026-06-25 00:55:07 👈
[SLOP(claude-opus-4-8)] feat(pegboard-envoy): rate-limit envoy ws ingress and cap get_pages, trim unused metrics #5327 tnzyzqrl · 2026-06-24 20:57:36
[SLOP(claude-opus-4-8)] feat(util): add rate limiter primitive and ingress throttles for actor create and gateway websocket #5326 pnyswwtm · 2026-06-24 20:57:35
[SLOP(claude-opus-4-8)] feat(rivetkit-core): add serde duration and size metrics #5322 nqkrytvr · 2026-06-23 22:49:13
Add observation for serde_bare #5321 oskoutmw · 2026-06-23 20:23:10
Add observation and ser/de metrics for serde calls #5320 ztnuwvts · 2026-06-23 20:23:09

Dependencies:

[SLOP(claude-opus-4-8-medium)] fix(rivetkit-napi): reap completed tasks from event-loop JoinSet (#5315) #5314 xxqwtrzp · 2026-06-22 22:42:01
main

Get stack: forklift get 5329
Push local edits: forklift submit
Merge when ready: forklift merge 5329

claude · 2026-06-25T01:03:17Z

PR Review: Postgres Leader-Resolver Driver Overhaul

This PR replaces the old per-transaction conflict_ranges exclusion-constraint approach (btree_gist, GIST-indexed ranges, Postgres advisory locking) with a centralized leader-resolver pattern: one elected node drains a udb_commit_requests queue, resolves conflicts in-memory via the shared TransactionConflictTracker, applies writes in batched Postgres transactions, and replies to followers via NOTIFY. The architecture is much cleaner and avoids the distributed deadlock / GiST contention issues of the old approach.

Four confirmed findings below, ranked by severity.

1. Fresh-database cold-window spuriously rejects all writes for 5 seconds

engine/packages/universaldb/src/driver/postgres/resolver/mod.rs:141

recovery_floor reads last_value from udb_version_seq to determine the highest version a prior leader could have committed. On a brand-new database where no nextval has ever been called, PostgreSQL initialises last_value = 1 (the sequence start value) with is_called = false. recovery_floor therefore returns max(durable=0, seq_high=1) = 1, setting recovery_version = 1.

During the 5-second cold window (TXN_TIMEOUT), any commit with start_version < recovery_version is rejected. Every first-boot follower opens a snapshot with read_version = durable_version = 0, so start_version = 0 < 1 = recovery_version → all writes fail as conflicts and retry. All write transactions stall for the full 5-second cold window on a fresh install.

Fix: Account for is_called:

SELECT CASE WHEN is_called THEN last_value ELSE 0 END FROM udb_version_seq

2. LISTEN race during reconnect silently drops notifications for new subscribers

engine/packages/universaldb/src/driver/postgres/listener.rs:176

In connect_and_run, the reconnect sequence is:

Connect
Snapshot registered channels via iter_async
Issue LISTEN for each snapshotted channel
Set *client.lock() = Some(new_client) ← client set last

If a concurrent listen("new_channel") call arrives between steps 2 and 4, it finds client = None and skips the LISTEN. The channel IS inserted into scc::HashMap, so it won't be picked up by the re-LISTEN loop on the current reconnect — that loop already ran. No LISTEN is ever issued for this channel until the next connection drop.

The inline comment says "the lifecycle task re-LISTENs on reconnect" — but for channels added in this window, that reconnect already happened.

Fix: Set *client = Some(new_client) before the re-LISTEN loop. New listen() calls that arrive after that point will immediately issue their own LISTEN on the active client.

3. `wait_for_leader` polls with `sleep` instead of using the existing `watch::Receiver`

engine/packages/universaldb/src/driver/postgres/commit.rs:97

async fn wait_for_leader(shared: &Arc<PostgresShared>) -> Result<LeaseInfo> {
    let deadline = Instant::now() + LEADER_WAIT_TIMEOUT;
    loop {
        if let Some(lease) = shared.current_lease() { return Ok(lease); }
        if Instant::now() >= deadline { return Err(DatabaseError::NotCommitted.into()); }
        tokio::time::sleep(RESULT_POLL_INTERVAL).await;  // ← banned pattern
    }
}

This is the loop { check; sleep } pattern that CLAUDE.md prohibits: "If it is inside a loop { check; sleep } body, it is polling and should be event-driven instead." PostgresShared already has a watch::Sender<Option<LeaseInfo>> / watch::Receiver pair (lease_tx / lease_rx) that fires whenever set_lease is called — the infrastructure for event-driven notification is in place but not exposed to the call site.

Fix: Add a method to PostgresShared to subscribe to the lease watch channel, then rewrite as:

async fn wait_for_leader(shared: &Arc<PostgresShared>) -> Result<LeaseInfo> {
    let mut lease_rx = shared.subscribe_lease();
    let deadline = tokio::time::Instant::now() + LEADER_WAIT_TIMEOUT;
    loop {
        if let Some(lease) = shared.current_lease() { return Ok(lease); }
        tokio::select! {
            res = lease_rx.changed() => { if res.is_err() { break; } }
            _ = tokio::time::sleep_until(deadline) => {
                return Err(DatabaseError::NotCommitted.into());
            }
        }
    }
    Err(DatabaseError::NotCommitted.into())
}

4. GC query deletes `pending` rows — SQL does not match the stated invariant

engine/packages/universaldb/src/driver/postgres/database.rs:219–221

"DELETE FROM udb_commit_requests
 WHERE created_at < now() - ($1::bigint * interval '1 second')"

The constant is documented as: "Terminal and orphaned commit-request rows older than this are garbage collected." But the SQL has no status filter — it deletes every row, including pending ones.

A pending row stuck for >60 seconds (e.g., pool exhaustion, leader stall) is silently deleted. The waiting follower in await_result receives Ok(None) and returns NotCommitted, causing a retry — which is safe today — but this diverges from the documented invariant and would silently break any future assertion or metric that expects live pending rows to always exist.

Fix: Either add AND status != 'pending' to the DELETE, or update the constant's doc comment to reflect the actual behaviour.

…ver driver overhaul

MasterPtato mentioned this pull request Jun 25, 2026

feat(pegboard-envoy): rate-limit envoy ws ingress and cap get_pages, trim unused metrics #5327

Open

[SLOP(claude-opus-4-8-high)] feat(universaldb): postgres leader-resol…

6196b67

…ver driver overhaul

MasterPtato force-pushed the stack/slop-claude-opus-4-8-high-feat-universaldb-postgres-leader-resolver-driver-overhaul-xrrmomwz branch from d270b2b to 6196b67 Compare June 25, 2026 20:33

NathanFlurry changed the title ~~[SLOP(claude-opus-4-8-high)] feat(universaldb): postgres leader-resolver driver overhaul~~ feat(universaldb): postgres leader-resolver driver overhaul Jun 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(universaldb): postgres leader-resolver driver overhaul#5329

MasterPtato commented Jun 25, 2026 •

edited

Loading

Uh oh!

MasterPtato commented Jun 25, 2026 •

edited

Loading

Uh oh!

claude Bot commented Jun 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

MasterPtato commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MasterPtato commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Postgres Leader-Resolver Driver Overhaul

1. Fresh-database cold-window spuriously rejects all writes for 5 seconds

2. LISTEN race during reconnect silently drops notifications for new subscribers

3. wait_for_leader polls with sleep instead of using the existing watch::Receiver

4. GC query deletes pending rows — SQL does not match the stated invariant

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MasterPtato commented Jun 25, 2026 •

edited

Loading

MasterPtato commented Jun 25, 2026 •

edited

Loading

claude Bot commented Jun 25, 2026 •

edited

Loading

3. `wait_for_leader` polls with `sleep` instead of using the existing `watch::Receiver`

4. GC query deletes `pending` rows — SQL does not match the stated invariant