test(patroni): end-to-end failover, recovery & data-integrity suite by postgresql007 · Pull Request #19 · cybertec-postgresql/pg_hardstorage

postgresql007 · 2026-06-29T11:30:01Z

End-to-end Patroni coverage that previously didn't exist: real 3-node Spilo/Patroni clusters, real leader changes, exercising the product's actual gap-detection/slot-recreation logic (replication.EnsureSlot) and — most importantly — verifying the data we recover is correct.

All //go:build integration; runs in the test (integration, …) CI job and via make test-integration.

Failover — gap detection & prevention

WALContinuityEndToEnd — gap detected across one + repeated graceful switchovers; steady-state = no false alarm.
MultiSlotAndBootstrap — each of several slots recreated with its own gap; a fresh slot = no false alarm.
PermanentSlotSurvivesSwitchover — a Patroni permanent_slot is carried across the switchover (SlotFound, not recreated): gap prevention.
HardLeaderKill — SIGKILL the leader → replica promoted → slot recreated with a gap.

Recovery — the cluster healing

KilledLeaderRejoinsAsReplica — revived node rejoins as a healthy replica (3 members, one leader).
ReplicaLossKeepsPrimary — killing a replica does not fail over; primary stays writable with its slot; replica recovers.
FrozenLeaderFailoverAndRejoin — docker pause fences the leader → replica promoted → unpause → ex-leader demotes & rejoins.

Data integrity — highest priority

AcrossHardFailoverAndRejoin — a content-addressed digest is byte-identical on the promoted leader (no loss/corruption); the revived node, read directly, converges to the leader's exact digest.
AcrossSwitchover — every committed row preserved across the handoff; new leader durably holds further writes.

Bug found & fixed

The second-switchover assertion in WALContinuityEndToEnd was flaky: GapBytes could legitimately be 0 — the product is correct (a slot recreated with no WAL movement past last_confirmed has no gap). The test now advances WAL + checkpoints on the promoted leader before reconciling, so the recreate point is deterministically past last_confirmed.

Validation (local, with Docker)

Failover suite: 10/10 iterations over an hour.
Recovery + data-integrity (5 tests): 8/8 iterations over an hour — 24 data-integrity assertions, zero data loss/corruption.

Design note: drives EnsureSlot over the topology's host-reachable ConnString() (re-discovers the leader each call) rather than follower.Coordinator, whose DSNFor would need container-internal addresses a host-side test can't route. The Coordinator's REST/event plumbing stays covered by the existing httptest unit tests.

Adds the previously-missing end-to-end Patroni failover test. It brings up a real 3-node Spilo/Patroni cluster and drives the production gap-detection / slot-recreation logic (replication.EnsureSlot — the core of the WAL "gap auditor" + recreate-on-detection mechanism) across real switchovers, instead of the mocked /cluster endpoint or simulated single-node DROP the existing tests use. One cluster bring-up, several phases: - create a (non-permanent) slot on the leader, capture last_confirmed, advance WAL; - real patroni_switchover → the slot is gone on the new leader → EnsureSlot must recreate it AND report the gap (start==last_confirmed, end>last_confirmed, bytes==end-start); - steady state: re-ensure on the present slot → SlotFound, no false-alarm; - second switchover (failback) → recreate+gap holds across repeated leader changes. Drives EnsureSlot over the topology's host-reachable ConnString (which re-discovers the leader each call) rather than follower.Coordinator, whose DSNFor would need to reach Patroni-reported container-internal addresses a host-side test can't route. Build-tagged `integration`; runs in the Docker CI integration job. Compiles under `go vet -tags integration`.

… assertion Expands the end-to-end Patroni suite from one gap-detection test to a comprehensive set, all //go:build integration against a real 3-node Spilo/Patroni cluster. Failover (gap detection / prevention): - WALContinuityEndToEnd: gap detected across one and repeated graceful switchovers; steady-state reports no false alarm. - MultiSlotAndBootstrap: each of several slots recreated with its own gap; a fresh slot reports no false alarm. - PermanentSlotSurvivesSwitchover: a Patroni permanent_slot is carried across the switchover (SlotFound, not recreated) — gap prevention. - HardLeaderKill: SIGKILL the leader; a replica is promoted; the slot is recreated with a gap. Recovery (the cluster healing): - KilledLeaderRejoinsAsReplica: revived node rejoins as a healthy replica (3 members, one leader). - ReplicaLossKeepsPrimary: killing a replica does not fail over; the primary stays writable with its slot; the replica recovers. - FrozenLeaderFailoverAndRejoin: docker pause fences the leader, a replica is promoted, unpause makes the ex-leader demote and rejoin. Data integrity (highest priority — the data we recover must be correct): - AcrossHardFailoverAndRejoin: a content-addressed digest is byte- identical on the promoted leader (no loss/corruption), and the revived node, read directly, converges to the leader's exact digest. - AcrossSwitchover: every committed row preserved across the handoff; the new leader durably holds further writes. Fix: the second-switchover assertion in WALContinuityEndToEnd was flaky (GapBytes could legitimately be 0 — the product is correct: a slot recreated with no WAL movement past last_confirmed has no gap). It now drives WAL forward and checkpoints on the promoted leader before reconciling, so the recreate point is deterministically past last_confirmed. Soaked locally: failover suite 10/10 over an hour; the 5 recovery + data-integrity tests 8/8 over an hour, with 24 data-integrity assertions and zero data loss/corruption.

The per-PR integration matrix runs `go test -tags=integration` over all packages with a 10-minute budget. The new Patroni failover / recovery / data-integrity tests each stand up a real 3-node Spilo/Patroni cluster, so the topology package alone now runs well past 10 minutes and tripped `panic: test timed out after 10m0s` in CI (not a logic failure — purely wallclock). Move them behind an additional `patroni` build tag and run them in a dedicated CI job (`make test-patroni`, 45-min budget) instead of the 4-way PG matrix — they're PostgreSQL-version independent (Spilo pins its own PG), so a single lane suffices. The standard integration matrix keeps only the light lifecycle test and is fast/green again.

postgresql007 added 2 commits June 29, 2026 13:29

postgresql007 changed the title ~~test(patroni): end-to-end WAL gap-detection across real failovers~~ test(patroni): end-to-end failover, recovery & data-integrity suite Jun 29, 2026

postgresql007 merged commit 091a45d into main Jun 29, 2026
34 checks passed

postgresql007 deleted the test/patroni-failover-e2e branch June 29, 2026 19:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test(patroni): end-to-end failover, recovery & data-integrity suite#19

test(patroni): end-to-end failover, recovery & data-integrity suite#19
postgresql007 merged 3 commits into
mainfrom
test/patroni-failover-e2e

postgresql007 commented Jun 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

postgresql007 commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Failover — gap detection & prevention

Recovery — the cluster healing

Data integrity — highest priority

Bug found & fixed

Validation (local, with Docker)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

postgresql007 commented Jun 29, 2026 •

edited

Loading