Skip to content

Former primary stuck at 1/2 after failover due to barman-cloud-check-wal-archive "Expected empty archive" #828

@MonicaMagoniGW

Description

@MonicaMagoniGW

Description

After a failover/switchover on a running cluster (bootstrapped with initdb, not a recovery/restore), the plugin-barman-cloud sidecar on replica pods keeps failing with "WAL archive check failed: Expected empty archive". This prevents the replica from becoming fully ready (1/2 containers).

The root cause is that barman-cloud-check-wal-archive is executed on every WAL Archive gRPC call, including on replica pods, and it fails because the S3 bucket already contains WAL files from previous timelines (which is expected after a switchover).

Environment

  • CloudNativePG operator: v1.28.1
  • plugin-barman-cloud: v0.11.0
  • PostgreSQL: 18.3 (ghcr.io/cloudnative-pg/postgresql:18.3)
  • Kubernetes: v1.31
  • Object storage: Ceph RGW (S3-compatible, via Rook)

Cluster configuration

The cluster is bootstrapped with initdb (no recovery, no externalClusters):

spec:
  instances: 3
  bootstrap:
    initdb:
      database: system
      encoding: UTF8
      owner: app
  plugins:
    - name: barman-cloud.cloudnative-pg.io
      enabled: true
      isWALArchiver: true
      parameters:
        barmanObjectName: my-object-store

Steps to reproduce

  1. Create a 3-instance CNPG cluster with initdb bootstrap and plugin-barman-cloud with isWALArchiver: true
  2. Wait for all 3 pods to be 2/2 Running
  3. A failover or switchover occurs (automatic or manual), changing the timeline (e.g., timeline 1 → 2 → 3)
  4. After the failover, one or more replica pods get stuck at 1/2 — the plugin-barman-cloud sidecar blocks WAL archiving

What happens

After the switchover, the former primary is demoted to replica. The instance manager detects leftover WAL files and explicitly triggers archiving on the demoted pod:

{"msg":"Detected ready WAL files in a former primary, triggering WAL archiving"}

This causes the plugin-barman-cloud sidecar to attempt WAL archiving. However, the S3 bucket legitimately contains WAL files from previous timelines (archived by this same pod when it was primary). The plugin executes barman-cloud-check-wal-archive, finds the bucket is not empty, and fails:

barman-cloud-check-wal-archive: ERROR: WAL archive check failed for server <cluster-name>: Expected empty archive

plugin-barman-cloud sidecar log (replica pod):

{"level":"info","msg":"barman-cloud-check-wal-archive checking the first wal"}
{"level":"info","logger":"barman-cloud-check-wal-archive","msg":"ERROR: WAL archive check failed for server <name>: Expected empty archive","pipe":"stderr"}
{"level":"error","msg":"Error invoking barman-cloud-check-wal-archive",
  "options":["--endpoint-url","http://<ceph-rgw>","--cloud-provider","aws-s3","s3://<bucket>","<server-name>"],
  "exitCode":-1,"error":"exit status 1"}

Expected behavior

All the cluster pods are correctly running after a switchover/failover.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions