What happened?
AutonomousDatabaseBackup CRs can remain in Kubernetes as ACTIVE after the backing OCI Autonomous Database backup has expired or been deleted by OCI retention.
When the operator reconciles those stale CRs, it calls GetAutonomousDatabaseBackup for the old .spec.autonomousDatabaseBackupOCID, receives 404 NotAuthorizedOrNotFound, emits Warning/ReconcileFailed events, and logs Reconciler error repeatedly. In our environment this repeats at about the controller reconcile interval, creating persistent log/event churn and repeated OCI API calls for backups that cannot recover.
Why this matters
For our use case, having historical OCI Autonomous Database backups mirrored into Kubernetes is of very limited value, while stale backup CRs create a lot of churn once the OCI-side backup ages out.
The best mitigation from an operator/user perspective would be one or both of:
- handle
404 NotAuthorizedOrNotFound for previously known OCI backups as a terminal/stale condition, for example by marking the CR stale/expired/deleted or otherwise stopping the hot reconcile loop
- provide a documented way to disable
AutonomousDatabaseBackup mirroring/reconciliation entirely
Environment
oracle-database-operator v2.1.0 manifests
- image pinned to
container-registry.oracle.com/database/operator@sha256:b4729063f4ccd804c8bcc48576a49624090ff4f2c1b42e81deaafb55def7c622
- Kubernetes: OKE v1.35.2
AutonomousDatabaseBackup CRs are apiVersion: database.oracle.com/v4
- OCI Go SDK reported in logs:
Oracle-GoSDK/65.105.1
Reproduction pattern
- Let the operator mirror OCI Autonomous Database backups as
AutonomousDatabaseBackup CRs.
- Wait for one of the backing OCI Autonomous Database backups to expire or be deleted by OCI retention.
- Observe that the Kubernetes
AutonomousDatabaseBackup CR remains present and ACTIVE, still referencing the expired/deleted OCI backup OCID.
- Restart the operator pod, or wait for normal reconcile.
- Observe repeated
GetAutonomousDatabaseBackup 404 failures for the stale CR.
Expected behavior
Once the backing OCI backup no longer exists, the operator should not keep reconciling the Kubernetes mirror object forever as ACTIVE.
Possible acceptable outcomes:
- mark the backup CR as expired/stale/not found
- delete or otherwise stop reconciling the mirror object
- back off substantially instead of retrying every reconcile period
- allow operators to disable backup mirroring/reconciliation when they do not need Kubernetes backup mirror objects
Actual behavior
The stale backup CR remains ACTIVE. The operator repeatedly reconciles it and logs/errors on each failed OCI lookup.
Sanitized log excerpt:
ERROR Reconciler error
controller: autonomousdatabasebackup
controllerGroup: database.oracle.com
controllerKind: AutonomousDatabaseBackup
error: Error returned by Database Service. Http Status Code: 404.
Error Code: NotAuthorizedOrNotFound.
Message: Authorization failed or requested resource not found.
Operation Name: GetAutonomousDatabaseBackup
Client Version: Oracle-GoSDK/65.105.1
Request Endpoint: GET https://database.<region>.oraclecloud.com/20160918/autonomousDatabaseBackups/<expired-backup-ocid>
DEBUG events
type: Warning
object:
kind: AutonomousDatabaseBackup
apiVersion: database.oracle.com/v4
reason: ReconcileFailed
Notes
With v2.1.0 and matching CRDs/RBAC, restarting the operator with stale backup CRs present reproduced the 404 reconcile loop but did not immediately crash the operator; the pod stayed Running with 0 restarts during the short observation window.
We had previously seen CrashLoopBackOff with many stale backup CRs present on an older/mismatched deployment, but there was also separate CRD/RBAC drift at that time. This issue is filed for the reproducible stale backup lifecycle/log-churn problem on v2.1.0.
What happened?
AutonomousDatabaseBackupCRs can remain in Kubernetes asACTIVEafter the backing OCI Autonomous Database backup has expired or been deleted by OCI retention.When the operator reconciles those stale CRs, it calls
GetAutonomousDatabaseBackupfor the old.spec.autonomousDatabaseBackupOCID, receives404 NotAuthorizedOrNotFound, emitsWarning/ReconcileFailedevents, and logsReconciler errorrepeatedly. In our environment this repeats at about the controller reconcile interval, creating persistent log/event churn and repeated OCI API calls for backups that cannot recover.Why this matters
For our use case, having historical OCI Autonomous Database backups mirrored into Kubernetes is of very limited value, while stale backup CRs create a lot of churn once the OCI-side backup ages out.
The best mitigation from an operator/user perspective would be one or both of:
404 NotAuthorizedOrNotFoundfor previously known OCI backups as a terminal/stale condition, for example by marking the CR stale/expired/deleted or otherwise stopping the hot reconcile loopAutonomousDatabaseBackupmirroring/reconciliation entirelyEnvironment
oracle-database-operatorv2.1.0 manifestscontainer-registry.oracle.com/database/operator@sha256:b4729063f4ccd804c8bcc48576a49624090ff4f2c1b42e81deaafb55def7c622AutonomousDatabaseBackupCRs areapiVersion: database.oracle.com/v4Oracle-GoSDK/65.105.1Reproduction pattern
AutonomousDatabaseBackupCRs.AutonomousDatabaseBackupCR remains present andACTIVE, still referencing the expired/deleted OCI backup OCID.GetAutonomousDatabaseBackup404 failures for the stale CR.Expected behavior
Once the backing OCI backup no longer exists, the operator should not keep reconciling the Kubernetes mirror object forever as
ACTIVE.Possible acceptable outcomes:
Actual behavior
The stale backup CR remains
ACTIVE. The operator repeatedly reconciles it and logs/errors on each failed OCI lookup.Sanitized log excerpt:
Notes
With v2.1.0 and matching CRDs/RBAC, restarting the operator with stale backup CRs present reproduced the 404 reconcile loop but did not immediately crash the operator; the pod stayed
Runningwith0restarts during the short observation window.We had previously seen CrashLoopBackOff with many stale backup CRs present on an older/mismatched deployment, but there was also separate CRD/RBAC drift at that time. This issue is filed for the reproducible stale backup lifecycle/log-churn problem on v2.1.0.