Skip to content

test(patroni): end-to-end failover, recovery & data-integrity suite#19

Merged
postgresql007 merged 3 commits into
mainfrom
test/patroni-failover-e2e
Jun 29, 2026
Merged

test(patroni): end-to-end failover, recovery & data-integrity suite#19
postgresql007 merged 3 commits into
mainfrom
test/patroni-failover-e2e

Conversation

@postgresql007

@postgresql007 postgresql007 commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

End-to-end Patroni coverage that previously didn't exist: real 3-node Spilo/Patroni clusters, real leader changes, exercising the product's actual gap-detection/slot-recreation logic (replication.EnsureSlot) and — most importantly — verifying the data we recover is correct.

All //go:build integration; runs in the test (integration, …) CI job and via make test-integration.

Failover — gap detection & prevention

  • WALContinuityEndToEnd — gap detected across one + repeated graceful switchovers; steady-state = no false alarm.
  • MultiSlotAndBootstrap — each of several slots recreated with its own gap; a fresh slot = no false alarm.
  • PermanentSlotSurvivesSwitchover — a Patroni permanent_slot is carried across the switchover (SlotFound, not recreated): gap prevention.
  • HardLeaderKill — SIGKILL the leader → replica promoted → slot recreated with a gap.

Recovery — the cluster healing

  • KilledLeaderRejoinsAsReplica — revived node rejoins as a healthy replica (3 members, one leader).
  • ReplicaLossKeepsPrimary — killing a replica does not fail over; primary stays writable with its slot; replica recovers.
  • FrozenLeaderFailoverAndRejoindocker pause fences the leader → replica promoted → unpause → ex-leader demotes & rejoins.

Data integrity — highest priority

  • AcrossHardFailoverAndRejoin — a content-addressed digest is byte-identical on the promoted leader (no loss/corruption); the revived node, read directly, converges to the leader's exact digest.
  • AcrossSwitchover — every committed row preserved across the handoff; new leader durably holds further writes.

Bug found & fixed

The second-switchover assertion in WALContinuityEndToEnd was flaky: GapBytes could legitimately be 0 — the product is correct (a slot recreated with no WAL movement past last_confirmed has no gap). The test now advances WAL + checkpoints on the promoted leader before reconciling, so the recreate point is deterministically past last_confirmed.

Validation (local, with Docker)

  • Failover suite: 10/10 iterations over an hour.
  • Recovery + data-integrity (5 tests): 8/8 iterations over an hour — 24 data-integrity assertions, zero data loss/corruption.

Design note: drives EnsureSlot over the topology's host-reachable ConnString() (re-discovers the leader each call) rather than follower.Coordinator, whose DSNFor would need container-internal addresses a host-side test can't route. The Coordinator's REST/event plumbing stays covered by the existing httptest unit tests.

Adds the previously-missing end-to-end Patroni failover test. It brings
up a real 3-node Spilo/Patroni cluster and drives the production
gap-detection / slot-recreation logic (replication.EnsureSlot — the core
of the WAL "gap auditor" + recreate-on-detection mechanism) across real
switchovers, instead of the mocked /cluster endpoint or simulated
single-node DROP the existing tests use.

One cluster bring-up, several phases:
- create a (non-permanent) slot on the leader, capture last_confirmed,
  advance WAL;
- real patroni_switchover → the slot is gone on the new leader →
  EnsureSlot must recreate it AND report the gap (start==last_confirmed,
  end>last_confirmed, bytes==end-start);
- steady state: re-ensure on the present slot → SlotFound, no false-alarm;
- second switchover (failback) → recreate+gap holds across repeated
  leader changes.

Drives EnsureSlot over the topology's host-reachable ConnString (which
re-discovers the leader each call) rather than follower.Coordinator,
whose DSNFor would need to reach Patroni-reported container-internal
addresses a host-side test can't route. Build-tagged `integration`;
runs in the Docker CI integration job. Compiles under
`go vet -tags integration`.
… assertion

Expands the end-to-end Patroni suite from one gap-detection test to a
comprehensive set, all //go:build integration against a real 3-node
Spilo/Patroni cluster.

Failover (gap detection / prevention):
- WALContinuityEndToEnd: gap detected across one and repeated graceful
  switchovers; steady-state reports no false alarm.
- MultiSlotAndBootstrap: each of several slots recreated with its own
  gap; a fresh slot reports no false alarm.
- PermanentSlotSurvivesSwitchover: a Patroni permanent_slot is carried
  across the switchover (SlotFound, not recreated) — gap prevention.
- HardLeaderKill: SIGKILL the leader; a replica is promoted; the slot is
  recreated with a gap.

Recovery (the cluster healing):
- KilledLeaderRejoinsAsReplica: revived node rejoins as a healthy
  replica (3 members, one leader).
- ReplicaLossKeepsPrimary: killing a replica does not fail over; the
  primary stays writable with its slot; the replica recovers.
- FrozenLeaderFailoverAndRejoin: docker pause fences the leader, a
  replica is promoted, unpause makes the ex-leader demote and rejoin.

Data integrity (highest priority — the data we recover must be correct):
- AcrossHardFailoverAndRejoin: a content-addressed digest is byte-
  identical on the promoted leader (no loss/corruption), and the revived
  node, read directly, converges to the leader's exact digest.
- AcrossSwitchover: every committed row preserved across the handoff;
  the new leader durably holds further writes.

Fix: the second-switchover assertion in WALContinuityEndToEnd was flaky
(GapBytes could legitimately be 0 — the product is correct: a slot
recreated with no WAL movement past last_confirmed has no gap). It now
drives WAL forward and checkpoints on the promoted leader before
reconciling, so the recreate point is deterministically past
last_confirmed.

Soaked locally: failover suite 10/10 over an hour; the 5 recovery +
data-integrity tests 8/8 over an hour, with 24 data-integrity assertions
and zero data loss/corruption.
@postgresql007 postgresql007 changed the title test(patroni): end-to-end WAL gap-detection across real failovers test(patroni): end-to-end failover, recovery & data-integrity suite Jun 29, 2026
The per-PR integration matrix runs `go test -tags=integration` over all
packages with a 10-minute budget. The new Patroni failover / recovery /
data-integrity tests each stand up a real 3-node Spilo/Patroni cluster, so
the topology package alone now runs well past 10 minutes and tripped
`panic: test timed out after 10m0s` in CI (not a logic failure — purely
wallclock).

Move them behind an additional `patroni` build tag and run them in a
dedicated CI job (`make test-patroni`, 45-min budget) instead of the 4-way
PG matrix — they're PostgreSQL-version independent (Spilo pins its own PG),
so a single lane suffices. The standard integration matrix keeps only the
light lifecycle test and is fast/green again.
@postgresql007 postgresql007 merged commit 091a45d into main Jun 29, 2026
34 checks passed
@postgresql007 postgresql007 deleted the test/patroni-failover-e2e branch June 29, 2026 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant