Skip to content

AutonomousDatabaseBackup CRs remain ACTIVE after OCI backup expires and reconcile forever with 404 #240

@cweeks72

Description

@cweeks72

What happened?

AutonomousDatabaseBackup CRs can remain in Kubernetes as ACTIVE after the backing OCI Autonomous Database backup has expired or been deleted by OCI retention.

When the operator reconciles those stale CRs, it calls GetAutonomousDatabaseBackup for the old .spec.autonomousDatabaseBackupOCID, receives 404 NotAuthorizedOrNotFound, emits Warning/ReconcileFailed events, and logs Reconciler error repeatedly. In our environment this repeats at about the controller reconcile interval, creating persistent log/event churn and repeated OCI API calls for backups that cannot recover.

Why this matters

For our use case, having historical OCI Autonomous Database backups mirrored into Kubernetes is of very limited value, while stale backup CRs create a lot of churn once the OCI-side backup ages out.

The best mitigation from an operator/user perspective would be one or both of:

  • handle 404 NotAuthorizedOrNotFound for previously known OCI backups as a terminal/stale condition, for example by marking the CR stale/expired/deleted or otherwise stopping the hot reconcile loop
  • provide a documented way to disable AutonomousDatabaseBackup mirroring/reconciliation entirely

Environment

  • oracle-database-operator v2.1.0 manifests
  • image pinned to container-registry.oracle.com/database/operator@sha256:b4729063f4ccd804c8bcc48576a49624090ff4f2c1b42e81deaafb55def7c622
  • Kubernetes: OKE v1.35.2
  • AutonomousDatabaseBackup CRs are apiVersion: database.oracle.com/v4
  • OCI Go SDK reported in logs: Oracle-GoSDK/65.105.1

Reproduction pattern

  1. Let the operator mirror OCI Autonomous Database backups as AutonomousDatabaseBackup CRs.
  2. Wait for one of the backing OCI Autonomous Database backups to expire or be deleted by OCI retention.
  3. Observe that the Kubernetes AutonomousDatabaseBackup CR remains present and ACTIVE, still referencing the expired/deleted OCI backup OCID.
  4. Restart the operator pod, or wait for normal reconcile.
  5. Observe repeated GetAutonomousDatabaseBackup 404 failures for the stale CR.

Expected behavior

Once the backing OCI backup no longer exists, the operator should not keep reconciling the Kubernetes mirror object forever as ACTIVE.

Possible acceptable outcomes:

  • mark the backup CR as expired/stale/not found
  • delete or otherwise stop reconciling the mirror object
  • back off substantially instead of retrying every reconcile period
  • allow operators to disable backup mirroring/reconciliation when they do not need Kubernetes backup mirror objects

Actual behavior

The stale backup CR remains ACTIVE. The operator repeatedly reconciles it and logs/errors on each failed OCI lookup.

Sanitized log excerpt:

ERROR Reconciler error
  controller: autonomousdatabasebackup
  controllerGroup: database.oracle.com
  controllerKind: AutonomousDatabaseBackup
  error: Error returned by Database Service. Http Status Code: 404.
         Error Code: NotAuthorizedOrNotFound.
         Message: Authorization failed or requested resource not found.
         Operation Name: GetAutonomousDatabaseBackup
         Client Version: Oracle-GoSDK/65.105.1
         Request Endpoint: GET https://database.<region>.oraclecloud.com/20160918/autonomousDatabaseBackups/<expired-backup-ocid>

DEBUG events
  type: Warning
  object:
    kind: AutonomousDatabaseBackup
    apiVersion: database.oracle.com/v4
  reason: ReconcileFailed

Notes

With v2.1.0 and matching CRDs/RBAC, restarting the operator with stale backup CRs present reproduced the 404 reconcile loop but did not immediately crash the operator; the pod stayed Running with 0 restarts during the short observation window.

We had previously seen CrashLoopBackOff with many stale backup CRs present on an older/mismatched deployment, but there was also separate CRD/RBAC drift at that time. This issue is filed for the reproducible stale backup lifecycle/log-churn problem on v2.1.0.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions