Description
After a failover/switchover on a running cluster (bootstrapped with initdb, not a recovery/restore), the plugin-barman-cloud sidecar on replica pods keeps failing with "WAL archive check failed: Expected empty archive". This prevents the replica from becoming fully ready (1/2 containers).
The root cause is that barman-cloud-check-wal-archive is executed on every WAL Archive gRPC call, including on replica pods, and it fails because the S3 bucket already contains WAL files from previous timelines (which is expected after a switchover).
Environment
- CloudNativePG operator: v1.28.1
- plugin-barman-cloud: v0.11.0
- PostgreSQL: 18.3 (
ghcr.io/cloudnative-pg/postgresql:18.3)
- Kubernetes: v1.31
- Object storage: Ceph RGW (S3-compatible, via Rook)
Cluster configuration
The cluster is bootstrapped with initdb (no recovery, no externalClusters):
spec:
instances: 3
bootstrap:
initdb:
database: system
encoding: UTF8
owner: app
plugins:
- name: barman-cloud.cloudnative-pg.io
enabled: true
isWALArchiver: true
parameters:
barmanObjectName: my-object-store
Steps to reproduce
- Create a 3-instance CNPG cluster with
initdb bootstrap and plugin-barman-cloud with isWALArchiver: true
- Wait for all 3 pods to be 2/2 Running
- A failover or switchover occurs (automatic or manual), changing the timeline (e.g., timeline 1 → 2 → 3)
- After the failover, one or more replica pods get stuck at 1/2 — the
plugin-barman-cloud sidecar blocks WAL archiving
What happens
After the switchover, the former primary is demoted to replica. The instance manager detects leftover WAL files and explicitly triggers archiving on the demoted pod:
{"msg":"Detected ready WAL files in a former primary, triggering WAL archiving"}
This causes the plugin-barman-cloud sidecar to attempt WAL archiving. However, the S3 bucket legitimately contains WAL files from previous timelines (archived by this same pod when it was primary). The plugin executes barman-cloud-check-wal-archive, finds the bucket is not empty, and fails:
barman-cloud-check-wal-archive: ERROR: WAL archive check failed for server <cluster-name>: Expected empty archive
plugin-barman-cloud sidecar log (replica pod):
{"level":"info","msg":"barman-cloud-check-wal-archive checking the first wal"}
{"level":"info","logger":"barman-cloud-check-wal-archive","msg":"ERROR: WAL archive check failed for server <name>: Expected empty archive","pipe":"stderr"}
{"level":"error","msg":"Error invoking barman-cloud-check-wal-archive",
"options":["--endpoint-url","http://<ceph-rgw>","--cloud-provider","aws-s3","s3://<bucket>","<server-name>"],
"exitCode":-1,"error":"exit status 1"}
Expected behavior
All the cluster pods are correctly running after a switchover/failover.
Description
After a failover/switchover on a running cluster (bootstrapped with
initdb, not a recovery/restore), theplugin-barman-cloudsidecar on replica pods keeps failing with"WAL archive check failed: Expected empty archive". This prevents the replica from becoming fully ready (1/2 containers).The root cause is that
barman-cloud-check-wal-archiveis executed on every WAL Archive gRPC call, including on replica pods, and it fails because the S3 bucket already contains WAL files from previous timelines (which is expected after a switchover).Environment
ghcr.io/cloudnative-pg/postgresql:18.3)Cluster configuration
The cluster is bootstrapped with
initdb(no recovery, no externalClusters):Steps to reproduce
initdbbootstrap andplugin-barman-cloudwithisWALArchiver: trueplugin-barman-cloudsidecar blocks WAL archivingWhat happens
After the switchover, the former primary is demoted to replica. The instance manager detects leftover WAL files and explicitly triggers archiving on the demoted pod:
This causes the
plugin-barman-cloudsidecar to attempt WAL archiving. However, the S3 bucket legitimately contains WAL files from previous timelines (archived by this same pod when it was primary). The plugin executesbarman-cloud-check-wal-archive, finds the bucket is not empty, and fails:plugin-barman-cloud sidecar log (replica pod):
Expected behavior
All the cluster pods are correctly running after a switchover/failover.