test(patroni): end-to-end failover, recovery & data-integrity suite#19
Merged
Conversation
Adds the previously-missing end-to-end Patroni failover test. It brings up a real 3-node Spilo/Patroni cluster and drives the production gap-detection / slot-recreation logic (replication.EnsureSlot — the core of the WAL "gap auditor" + recreate-on-detection mechanism) across real switchovers, instead of the mocked /cluster endpoint or simulated single-node DROP the existing tests use. One cluster bring-up, several phases: - create a (non-permanent) slot on the leader, capture last_confirmed, advance WAL; - real patroni_switchover → the slot is gone on the new leader → EnsureSlot must recreate it AND report the gap (start==last_confirmed, end>last_confirmed, bytes==end-start); - steady state: re-ensure on the present slot → SlotFound, no false-alarm; - second switchover (failback) → recreate+gap holds across repeated leader changes. Drives EnsureSlot over the topology's host-reachable ConnString (which re-discovers the leader each call) rather than follower.Coordinator, whose DSNFor would need to reach Patroni-reported container-internal addresses a host-side test can't route. Build-tagged `integration`; runs in the Docker CI integration job. Compiles under `go vet -tags integration`.
… assertion Expands the end-to-end Patroni suite from one gap-detection test to a comprehensive set, all //go:build integration against a real 3-node Spilo/Patroni cluster. Failover (gap detection / prevention): - WALContinuityEndToEnd: gap detected across one and repeated graceful switchovers; steady-state reports no false alarm. - MultiSlotAndBootstrap: each of several slots recreated with its own gap; a fresh slot reports no false alarm. - PermanentSlotSurvivesSwitchover: a Patroni permanent_slot is carried across the switchover (SlotFound, not recreated) — gap prevention. - HardLeaderKill: SIGKILL the leader; a replica is promoted; the slot is recreated with a gap. Recovery (the cluster healing): - KilledLeaderRejoinsAsReplica: revived node rejoins as a healthy replica (3 members, one leader). - ReplicaLossKeepsPrimary: killing a replica does not fail over; the primary stays writable with its slot; the replica recovers. - FrozenLeaderFailoverAndRejoin: docker pause fences the leader, a replica is promoted, unpause makes the ex-leader demote and rejoin. Data integrity (highest priority — the data we recover must be correct): - AcrossHardFailoverAndRejoin: a content-addressed digest is byte- identical on the promoted leader (no loss/corruption), and the revived node, read directly, converges to the leader's exact digest. - AcrossSwitchover: every committed row preserved across the handoff; the new leader durably holds further writes. Fix: the second-switchover assertion in WALContinuityEndToEnd was flaky (GapBytes could legitimately be 0 — the product is correct: a slot recreated with no WAL movement past last_confirmed has no gap). It now drives WAL forward and checkpoints on the promoted leader before reconciling, so the recreate point is deterministically past last_confirmed. Soaked locally: failover suite 10/10 over an hour; the 5 recovery + data-integrity tests 8/8 over an hour, with 24 data-integrity assertions and zero data loss/corruption.
The per-PR integration matrix runs `go test -tags=integration` over all packages with a 10-minute budget. The new Patroni failover / recovery / data-integrity tests each stand up a real 3-node Spilo/Patroni cluster, so the topology package alone now runs well past 10 minutes and tripped `panic: test timed out after 10m0s` in CI (not a logic failure — purely wallclock). Move them behind an additional `patroni` build tag and run them in a dedicated CI job (`make test-patroni`, 45-min budget) instead of the 4-way PG matrix — they're PostgreSQL-version independent (Spilo pins its own PG), so a single lane suffices. The standard integration matrix keeps only the light lifecycle test and is fast/green again.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
End-to-end Patroni coverage that previously didn't exist: real 3-node Spilo/Patroni clusters, real leader changes, exercising the product's actual gap-detection/slot-recreation logic (
replication.EnsureSlot) and — most importantly — verifying the data we recover is correct.All
//go:build integration; runs in thetest (integration, …)CI job and viamake test-integration.Failover — gap detection & prevention
permanent_slotis carried across the switchover (SlotFound, not recreated): gap prevention.Recovery — the cluster healing
docker pausefences the leader → replica promoted →unpause→ ex-leader demotes & rejoins.Data integrity — highest priority
Bug found & fixed
The second-switchover assertion in
WALContinuityEndToEndwas flaky:GapBytescould legitimately be 0 — the product is correct (a slot recreated with no WAL movement pastlast_confirmedhas no gap). The test now advances WAL + checkpoints on the promoted leader before reconciling, so the recreate point is deterministically pastlast_confirmed.Validation (local, with Docker)
Design note: drives
EnsureSlotover the topology's host-reachableConnString()(re-discovers the leader each call) rather thanfollower.Coordinator, whoseDSNForwould need container-internal addresses a host-side test can't route. The Coordinator's REST/event plumbing stays covered by the existinghttptestunit tests.