OCPBUGS-42837: Do not set Degraded=True on transient errors#436
OCPBUGS-42837: Do not set Degraded=True on transient errors#436nrb wants to merge 4 commits intoopenshift:mainfrom
Conversation
CloudConfigReconciler: gate transient errors behind a 2-minute window Three related fixes to stop upgrade-time API blips from immediately setting CloudConfigControllerDegraded=True: 1. Infrastructure NotFound now calls setAvailableCondition (nil return) instead of setDegradedCondition, matching the main controller's existing behaviour. 2. Errors are classified as transient (API blips: all Get/Create/Update calls, feature-gate informer not yet synced) or permanent (config problems that won't self-heal: nil platformStatus, unsupported platform, missing user config key, nil FeatureGateAccess, transform failure). 3. handleTransientError() only sets degraded after consecutiveFailureSince has been set for longer than transientDegradedThreshold (2 min); handleDegradeError() sets degraded immediately and returns nil so controller-runtime does not requeue (existing watches re-trigger when the underlying config changes). clearFailureWindow() is called at every successful reconcile exit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Nolan Brubaker <nolan@nbrubaker.com>
|
@nrb: This pull request references Jira Issue OCPBUGS-42387, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds per-reconciler transient failure windows and unified transient vs. immediate-degrade error handlers across multiple controllers; routes prior direct-degrade/error returns through thresholded transient handling and updates tests to assert ClusterOperator Degraded/Available transitions using a fake clock. Changes
Sequence Diagram(s)sequenceDiagram
participant Reconciler
participant FailureWindow
participant Status as ClusterOperatorStatus
participant KubeAPI as Kubernetes API
Reconciler->>Reconciler: start reconciliation
alt success
Reconciler->>FailureWindow: clearFailureWindow()
Reconciler->>Status: set Available
Reconciler->>KubeAPI: update status (Available)
else transient error
Reconciler->>FailureWindow: check consecutiveFailureSince
alt not started
FailureWindow->>FailureWindow: set start = now
Reconciler->>KubeAPI: requeue (return error)
else within threshold
Reconciler->>KubeAPI: requeue (return error)
else threshold exceeded
Reconciler->>Status: set Degraded
Reconciler->>KubeAPI: update status (Degraded)
Reconciler->>KubeAPI: no requeue (watch triggers)
end
else permanent error
Reconciler->>Status: set Degraded immediately
Reconciler->>KubeAPI: update status (Degraded)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/retitle OCPBUGS-42837: Do not set Degraded=True on transient errors |
|
/jira refresh |
|
@nrb: This pull request references Jira Issue OCPBUGS-42387, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/controllers/cloud_config_sync_controller_test.go (1)
512-632: Add explicit tests for the 2-minute transient degradation window.The PR’s primary behavior change is time-gating transient degradation, but there’s no direct assertion here for “before threshold stays non-degraded” and “after threshold degrades,” or for window reset after a successful reconcile.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controllers/cloud_config_sync_controller_test.go` around lines 512 - 632, Add explicit unit tests exercising the 2-minute transient degradation window: write tests that call reconciler.Reconcile multiple times with a missing configmap key (using makeInfrastructureResource with Spec.CloudConfig.Key="notfound") and assert that cloudConfigControllerDegradedCondition remains false before the 2-minute threshold and becomes true after advancing time past 2 minutes; also add a test that after a successful reconcile (create the expected ConfigMap and call reconciler.Reconcile) the transient window/reset clears so subsequent missing-key reconciles start the timer anew. Use reconciler.Reconcile, makeInfrastructureResource, makeInfraStatus and inspect co.Status.Conditions for cloudConfigControllerDegradedCondition to locate and assert the condition transitions. Ensure tests simulate time advancement (or inject a clock into reconciler if available) so the 2-minute boundary is deterministically tested.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/controllers/cloud_config_sync_controller.go`:
- Around line 54-56: The code currently calls r.clearFailureWindow() before
r.setAvailableCondition(ctx) and returns on setAvailableCondition errors, which
resets consecutiveFailureSince even when availability update fails and also
bypasses handleTransientError; change each success-exit branch (where you
currently call r.clearFailureWindow() then setAvailableCondition) to first call
r.setAvailableCondition(ctx) and if it returns an error pass that error into
r.handleTransientError(ctx, req, err) (so transient failures are tracked by the
2-minute policy), and only after setAvailableCondition succeeds call
r.clearFailureWindow(); update all occurrences referencing clearFailureWindow(),
setAvailableCondition(ctx), handleTransientError, and consecutiveFailureSince
accordingly.
---
Nitpick comments:
In `@pkg/controllers/cloud_config_sync_controller_test.go`:
- Around line 512-632: Add explicit unit tests exercising the 2-minute transient
degradation window: write tests that call reconciler.Reconcile multiple times
with a missing configmap key (using makeInfrastructureResource with
Spec.CloudConfig.Key="notfound") and assert that
cloudConfigControllerDegradedCondition remains false before the 2-minute
threshold and becomes true after advancing time past 2 minutes; also add a test
that after a successful reconcile (create the expected ConfigMap and call
reconciler.Reconcile) the transient window/reset clears so subsequent
missing-key reconciles start the timer anew. Use reconciler.Reconcile,
makeInfrastructureResource, makeInfraStatus and inspect co.Status.Conditions for
cloudConfigControllerDegradedCondition to locate and assert the condition
transitions. Ensure tests simulate time advancement (or inject a clock into
reconciler if available) so the 2-minute boundary is deterministically tested.
ℹ️ Review info
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
📒 Files selected for processing (2)
pkg/controllers/cloud_config_sync_controller.gopkg/controllers/cloud_config_sync_controller_test.go
| r.clearFailureWindow() | ||
| if err := r.setAvailableCondition(ctx); err != nil { | ||
| return ctrl.Result{}, fmt.Errorf("failed to set conditions for cloud config controller: %v", err) |
There was a problem hiding this comment.
Move failure-window reset after successful availability update, and route availability-update failures through transient handling.
Line 54, Line 74, Line 171, and Line 183 clear consecutiveFailureSince before setAvailableCondition succeeds. If setAvailableCondition fails, the reconcile is still failed, but the failure window is reset. Also, those failures currently bypass handleTransientError, so they are not tracked by the 2-minute transient policy.
💡 Proposed fix pattern (apply to all four success-exit branches)
- r.clearFailureWindow()
- if err := r.setAvailableCondition(ctx); err != nil {
- return ctrl.Result{}, fmt.Errorf("failed to set conditions for cloud config controller: %v", err)
- }
+ if err := r.setAvailableCondition(ctx); err != nil {
+ return r.handleTransientError(ctx, fmt.Errorf("failed to set conditions for cloud config controller: %w", err))
+ }
+ r.clearFailureWindow()Also applies to: 74-76, 171-173, 183-185
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/controllers/cloud_config_sync_controller.go` around lines 54 - 56, The
code currently calls r.clearFailureWindow() before r.setAvailableCondition(ctx)
and returns on setAvailableCondition errors, which resets
consecutiveFailureSince even when availability update fails and also bypasses
handleTransientError; change each success-exit branch (where you currently call
r.clearFailureWindow() then setAvailableCondition) to first call
r.setAvailableCondition(ctx) and if it returns an error pass that error into
r.handleTransientError(ctx, req, err) (so transient failures are tracked by the
2-minute policy), and only after setAvailableCondition succeeds call
r.clearFailureWindow(); update all occurrences referencing clearFailureWindow(),
setAvailableCondition(ctx), handleTransientError, and consecutiveFailureSince
accordingly.
|
/retitle OCPBUGS-42837: Do not set Degraded=True on transient errors |
|
@nrb: This pull request references Jira Issue OCPBUGS-42837, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Mirror the CloudConfigReconciler pattern: transient API errors (Proxy get, system trust bundle read, ConfigMap write) are silently requeued and only set Degraded=True after transientDegradedThreshold (2 min) has elapsed. Errors that indicate corrupt cert data (merge failures) set Degraded=True immediately and return nil so controller-runtime does not requeue; existing watches re-trigger reconciliation when the data changes. Also adds two direct unit tests that verify the threshold gating via a fake clock, without running through the manager. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Nolan Brubaker <nolan@nbrubaker.com>
|
@nrb: This pull request references Jira Issue OCPBUGS-42837, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
pkg/controllers/cloud_config_sync_controller.go (1)
55-57:⚠️ Potential issue | 🟠 MajorStill resetting failure window before availability update succeeds (previously reported).
This same issue remains:
clearFailureWindow()is called beforesetAvailableCondition()and availability-update failures are not routed through transient handling, so failure tracking can be lost on failed reconciles.Also applies to: 75-77, 172-174, 184-186
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controllers/cloud_config_sync_controller.go` around lines 55 - 57, clearFailureWindow() is being invoked before setAvailableCondition(), which causes failure tracking to be reset even when the availability update fails; change the call order so that setAvailableCondition(ctx) is called first and only on success call clearFailureWindow(), and apply the same change to the other occurrences (the blocks around setAvailableCondition()/clearFailureWindow() at the other noted locations) so transient failures are preserved and routed through the existing failure-handling logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/controllers/clusteroperator_controller.go`:
- Around line 81-84: The failure window is being cleared before the availability
status is persisted, which can reset consecutive-failure tracking if
setStatusAvailable fails; change the flow so r.clearFailureWindow() is only
called after r.setStatusAvailable(ctx, conditionOverrides) returns no error
(i.e., move the clear call to follow successful completion of
setStatusAvailable), and apply the same fix for the other occurrence where
clearFailureWindow is invoked before a call to
setStatusAvailable/setStatusUnavailable (ensure clear is only executed after the
corresponding status setter succeeds).
- Around line 95-96: provisioningAllowed() currently returns an error that the
reconciler returns directly, bypassing the centralized retry/status logic;
instead of returning "ctrl.Result{}, err" in the branch after calling
provisioningAllowed, invoke the centralized handlers (handleTransientError or
handleDegradeError) so the error flows through the new policy path and
status/retry behavior is consistent. Locate the branch where provisioningAllowed
is called and replace the direct return with a call that forwards the error to
the appropriate handler (e.g., call handleTransientError(...) for transient
failures or handleDegradeError(...) for degraded conditions), passing the same
context, instance/request and logger so those handlers can set status and
determine the ctrl.Result and requeue behavior.
In `@pkg/controllers/trusted_ca_bundle_controller.go`:
- Around line 64-67: The code is clearing the failure window
(r.clearFailureWindow()) before successfully updating availability
(r.setAvailableCondition(ctx)), which can reset consecutiveFailureSince on a
failed status write; move the call to r.clearFailureWindow() so it only runs
after r.setAvailableCondition succeeds, and for any errors returned by
r.setAvailableCondition (the paths at the current calls around Lines 64, 77, and
111) route them through the controller's transient error handling path instead
of returning plain errors—i.e., wrap/return the setAvailableCondition error via
the existing transient-handling helper used in this controller so it increments
the failure window (consecutiveFailureSince) appropriately and triggers
transient requeue behavior.
---
Duplicate comments:
In `@pkg/controllers/cloud_config_sync_controller.go`:
- Around line 55-57: clearFailureWindow() is being invoked before
setAvailableCondition(), which causes failure tracking to be reset even when the
availability update fails; change the call order so that
setAvailableCondition(ctx) is called first and only on success call
clearFailureWindow(), and apply the same change to the other occurrences (the
blocks around setAvailableCondition()/clearFailureWindow() at the other noted
locations) so transient failures are preserved and routed through the existing
failure-handling logic.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 06897dbd-8d86-4ff7-9f18-2668da2179ed
📒 Files selected for processing (5)
pkg/controllers/cloud_config_sync_controller.gopkg/controllers/clusteroperator_controller.gopkg/controllers/clusteroperator_controller_test.gopkg/controllers/trusted_ca_bundle_controller.gopkg/controllers/trusted_ca_bundle_controller_test.go
| r.clearFailureWindow() | ||
| if err := r.setAvailableCondition(ctx); err != nil { | ||
| return ctrl.Result{}, fmt.Errorf("failed to set conditions for trusted CA bundle controller: %v", err) | ||
| } |
There was a problem hiding this comment.
Move failure-window reset after successful availability update, and track availability-update failures as transient.
Line 64 and Line 110 clear consecutiveFailureSince before setAvailableCondition succeeds. If that status write fails, the reconcile failed but the failure window is reset. Also, the setAvailableCondition error paths at Line 65, Line 77, and Line 111 return plain errors instead of going through transient handling.
Suggested fix pattern
- r.clearFailureWindow()
- if err := r.setAvailableCondition(ctx); err != nil {
- return ctrl.Result{}, fmt.Errorf("failed to set conditions for trusted CA bundle controller: %v", err)
- }
+ if err := r.setAvailableCondition(ctx); err != nil {
+ return r.handleTransientError(ctx, fmt.Errorf("failed to set conditions for trusted CA bundle controller: %w", err))
+ }
+ r.clearFailureWindow()Also applies to: 77-79, 110-113
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/controllers/trusted_ca_bundle_controller.go` around lines 64 - 67, The
code is clearing the failure window (r.clearFailureWindow()) before successfully
updating availability (r.setAvailableCondition(ctx)), which can reset
consecutiveFailureSince on a failed status write; move the call to
r.clearFailureWindow() so it only runs after r.setAvailableCondition succeeds,
and for any errors returned by r.setAvailableCondition (the paths at the current
calls around Lines 64, 77, and 111) route them through the controller's
transient error handling path instead of returning plain errors—i.e.,
wrap/return the setAvailableCondition error via the existing transient-handling
helper used in this controller so it increments the failure window
(consecutiveFailureSince) appropriately and triggers transient requeue behavior.
|
@nrb: This pull request references Jira Issue OCPBUGS-42837, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/controllers/clusteroperator_controller.go (1)
118-126:⚠️ Potential issue | 🟡 MinorRoute
setStatusAvailableandclearCloudControllerOwnerConditionfailures through transient error handling.The failures at lines 120 and 125 return errors directly, bypassing
handleTransientError. These are API calls that could fail transiently and should follow the same error-handling policy as other API failures in this reconciler for consistent behavior.Suggested fix
if err := r.setStatusAvailable(ctx, conditionOverrides); err != nil { klog.Errorf("Unable to sync cluster operator status: %s", err) - return ctrl.Result{}, err + return r.handleTransientError(ctx, conditionOverrides, err) } if err := r.clearCloudControllerOwnerCondition(ctx); err != nil { klog.Errorf("Unable to clear CloudControllerOwner condition: %s", err) - return ctrl.Result{}, err + return r.handleTransientError(ctx, conditionOverrides, err) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controllers/clusteroperator_controller.go` around lines 118 - 126, The two API-call errors from setStatusAvailable and clearCloudControllerOwnerCondition should be routed through the reconciler's transient error handler rather than returned directly; replace the direct returns for errors from r.setStatusAvailable(ctx, conditionOverrides) and r.clearCloudControllerOwnerCondition(ctx) with calls to r.handleTransientError(ctx, err, "<contextual message>") (or the existing handleTransientError signature used elsewhere) so the reconciler applies the same transient retry/backoff policy as other API failures and returns the Result/Error produced by handleTransientError.
🧹 Nitpick comments (1)
pkg/controllers/clusteroperator_controller.go (1)
241-294: Consider unifying error handling inprovisioningAllowed.The
provisioningAllowedmethod internally callssetStatusDegradedbefore returning errors (e.g., lines 245-249, 277-281). These errors are then routed throughhandleTransientErrorat line 94, which may callsetStatusDegradedagain after the threshold. While not incorrect (the later call updates the condition), this creates a dual-path status-setting pattern.This is existing behavior preserved by the PR, so it's not a blocking concern—just a potential future refactoring opportunity to centralize all status updates through the new transient/degrade handlers.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controllers/clusteroperator_controller.go` around lines 241 - 294, provisioningAllowed currently calls setStatusDegraded before returning errors which duplicates status updates later in handleTransientError; remove the early calls to setStatusDegraded inside provisioningAllowed (specifically the ones after checkControllerConditions error handling and after cloudprovider.IsCloudProviderExternal error handling) so that provisioningAllowed simply returns the error, and let the caller flow (which invokes handleTransientError) centralize calling setStatusDegraded/handleTransientError. Keep the existing returns and logs but eliminate the direct setStatusDegraded invocations to avoid dual-path status updates; references: provisioningAllowed, setStatusDegraded, handleTransientError, checkControllerConditions, cloudprovider.IsCloudProviderExternal.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/controllers/clusteroperator_controller.go`:
- Around line 79-89: The early-return path when the Infrastructure object is not
found (the r.Get call using client.ObjectKey{Name: infrastructureResourceName}
into infra) currently calls setStatusAvailable and returns without clearing any
prior transient failure window; update the branch so that after
setStatusAvailable succeeds you call r.clearFailureWindow(ctx,
conditionOverrides) (or the appropriate receiver method clearFailureWindow)
before returning ctrl.Result{}, nil, ensuring the failure window is reset; keep
existing error handling that routes through handleTransientError when
setStatusAvailable fails.
---
Outside diff comments:
In `@pkg/controllers/clusteroperator_controller.go`:
- Around line 118-126: The two API-call errors from setStatusAvailable and
clearCloudControllerOwnerCondition should be routed through the reconciler's
transient error handler rather than returned directly; replace the direct
returns for errors from r.setStatusAvailable(ctx, conditionOverrides) and
r.clearCloudControllerOwnerCondition(ctx) with calls to
r.handleTransientError(ctx, err, "<contextual message>") (or the existing
handleTransientError signature used elsewhere) so the reconciler applies the
same transient retry/backoff policy as other API failures and returns the
Result/Error produced by handleTransientError.
---
Nitpick comments:
In `@pkg/controllers/clusteroperator_controller.go`:
- Around line 241-294: provisioningAllowed currently calls setStatusDegraded
before returning errors which duplicates status updates later in
handleTransientError; remove the early calls to setStatusDegraded inside
provisioningAllowed (specifically the ones after checkControllerConditions error
handling and after cloudprovider.IsCloudProviderExternal error handling) so that
provisioningAllowed simply returns the error, and let the caller flow (which
invokes handleTransientError) centralize calling
setStatusDegraded/handleTransientError. Keep the existing returns and logs but
eliminate the direct setStatusDegraded invocations to avoid dual-path status
updates; references: provisioningAllowed, setStatusDegraded,
handleTransientError, checkControllerConditions,
cloudprovider.IsCloudProviderExternal.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7f5730af-7843-4b8d-9a94-135e4d83673a
📒 Files selected for processing (3)
pkg/controllers/cloud_config_sync_controller.gopkg/controllers/clusteroperator_controller.gopkg/controllers/clusteroperator_controller_test.go
|
/assign @mdbooth @JoelSpeed @damdo |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Add handleTransientError/handleDegradeError methods to CloudOperatorReconciler with an aggregatedTransientDegradedThreshold of 2m30s (longer than the sub-controller threshold of 2m, to accommodate sub-controller recovery time). Fix test: handleTransientError test was stepping the clock by transientDegradedThreshold (2m, the sub-controller constant) instead of aggregatedTransientDegradedThreshold (2m30s), so the threshold was never exceeded and the degraded condition was never set. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Nolan Brubaker <nolan@nbrubaker.com>
There was a problem hiding this comment.
♻️ Duplicate comments (2)
pkg/controllers/cloud_config_sync_controller.go (1)
55-57:⚠️ Potential issue | 🟠 MajorMove failure-window reset after successful availability update, and route availability-update failures through transient handling.
Line 55, Line 75, Line 172, and Line 184 clear
consecutiveFailureSincebeforesetAvailableCondition(ctx)succeeds. If status update fails, the reconcile failed but the window is reset; those failures also bypasshandleTransientError.Suggested fix pattern (apply to all four branches)
- r.clearFailureWindow() - if err := r.setAvailableCondition(ctx); err != nil { - return ctrl.Result{}, fmt.Errorf("failed to set conditions for cloud config controller: %v", err) - } + if err := r.setAvailableCondition(ctx); err != nil { + return r.handleTransientError(ctx, fmt.Errorf("failed to set conditions for cloud config controller: %w", err)) + } + r.clearFailureWindow()Also applies to: 75-77, 172-174, 184-186
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controllers/cloud_config_sync_controller.go` around lines 55 - 57, The code currently calls r.clearFailureWindow() before attempting r.setAvailableCondition(ctx), which resets consecutiveFailureSince even when the availability update fails; change each branch (the blocks around r.clearFailureWindow() / r.setAvailableCondition(ctx) at the four affected sites) so that r.clearFailureWindow() is only called after r.setAvailableCondition(ctx) returns nil, and when r.setAvailableCondition(ctx) errors, do not clear the window but instead return via r.handleTransientError(ctx, err, req) (or the existing transient handling path) so availability-update failures increment the failure window and are treated as transient errors.pkg/controllers/clusteroperator_controller.go (1)
120-127:⚠️ Potential issue | 🟠 MajorRoute final status-write failures through
handleTransientErrorfor consistent transient policy.Line 120 and Line 125 still return raw errors. These are API/write-time failures and should use transient handling; otherwise they bypass consecutive-failure tracking.
Suggested patch
if err := r.setStatusAvailable(ctx, conditionOverrides); err != nil { klog.Errorf("Unable to sync cluster operator status: %s", err) - return ctrl.Result{}, err + return r.handleTransientError(ctx, conditionOverrides, err) } if err := r.clearCloudControllerOwnerCondition(ctx); err != nil { klog.Errorf("Unable to clear CloudControllerOwner condition: %s", err) - return ctrl.Result{}, err + return r.handleTransientError(ctx, conditionOverrides, err) } // successful reconcile, make sure the failure window is cleared. r.clearFailureWindow()Also applies to: 130-132
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controllers/clusteroperator_controller.go` around lines 120 - 127, Replace direct returns of raw errors from API/write operations with the controller's transient-error wrapper so consecutive-failure tracking is preserved: where r.setStatusAvailable(ctx, ...) and r.clearCloudControllerOwnerCondition(ctx) (and the similar block at lines ~130-132) currently do "return ctrl.Result{}, err", change those to call r.handleTransientError(ctx, err, "<brief operation description>") and return its result instead; ensure you pass a short descriptive string (e.g., "sync cluster operator status" or "clear CloudControllerOwner condition") so the handler has context.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/controllers/cloud_config_sync_controller.go`:
- Around line 55-57: The code currently calls r.clearFailureWindow() before
attempting r.setAvailableCondition(ctx), which resets consecutiveFailureSince
even when the availability update fails; change each branch (the blocks around
r.clearFailureWindow() / r.setAvailableCondition(ctx) at the four affected
sites) so that r.clearFailureWindow() is only called after
r.setAvailableCondition(ctx) returns nil, and when r.setAvailableCondition(ctx)
errors, do not clear the window but instead return via
r.handleTransientError(ctx, err, req) (or the existing transient handling path)
so availability-update failures increment the failure window and are treated as
transient errors.
In `@pkg/controllers/clusteroperator_controller.go`:
- Around line 120-127: Replace direct returns of raw errors from API/write
operations with the controller's transient-error wrapper so consecutive-failure
tracking is preserved: where r.setStatusAvailable(ctx, ...) and
r.clearCloudControllerOwnerCondition(ctx) (and the similar block at lines
~130-132) currently do "return ctrl.Result{}, err", change those to call
r.handleTransientError(ctx, err, "<brief operation description>") and return its
result instead; ensure you pass a short descriptive string (e.g., "sync cluster
operator status" or "clear CloudControllerOwner condition") so the handler has
context.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b80279f0-507b-412a-835d-139e68043d32
📒 Files selected for processing (3)
pkg/controllers/cloud_config_sync_controller.gopkg/controllers/clusteroperator_controller.gopkg/controllers/clusteroperator_controller_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/controllers/clusteroperator_controller_test.go
There was a problem hiding this comment.
♻️ Duplicate comments (2)
pkg/controllers/cloud_config_sync_controller.go (1)
55-57:⚠️ Potential issue | 🟠 MajorRoute availability-status write failures through transient handling, and reset the failure window only after status write succeeds.
Line 76, Line 173, and Line 185 clear
consecutiveFailureSincebeforesetAvailableCondition(ctx)succeeds. Also, errors fromsetAvailableCondition(ctx)at Line 55, Line 77, Line 174, and Line 186 return directly instead of going throughhandleTransientError, so status-write blips are not tracked by the transient policy.Suggested fix pattern
- r.clearFailureWindow() - if err := r.setAvailableCondition(ctx); err != nil { - return ctrl.Result{}, fmt.Errorf("failed to set conditions for cloud config controller: %v", err) - } + if err := r.setAvailableCondition(ctx); err != nil { + return r.handleTransientError(ctx, fmt.Errorf("failed to set conditions for cloud config controller: %w", err)) + } + r.clearFailureWindow()Apply the same error-routing pattern to the other
setAvailableConditioncall sites that currently return plain errors.Also applies to: 76-79, 173-176, 185-188
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controllers/cloud_config_sync_controller.go` around lines 55 - 57, The status-write error handling must route failures through the transient policy and only reset consecutiveFailureSince after setAvailableCondition succeeds; update all call sites using setAvailableCondition(ctx) (the ones that currently return fmt.Errorf directly and the ones that clear consecutiveFailureSince before the call) to instead call handleTransientError(ctx, req, err) when setAvailableCondition returns an error, and move any assignment that clears r.consecutiveFailureSince so it executes only after setAvailableCondition(ctx) completes successfully; ensure you update all occurrences of setAvailableCondition, consecutiveFailureSince, and the direct returns to use handleTransientError consistently.pkg/controllers/trusted_ca_bundle_controller.go (1)
64-66:⚠️ Potential issue | 🟠 MajorTreat
setAvailableConditionfailures as transient and avoid pre-clearing the failure window.
setAvailableCondition(ctx)failures at Line 64, Line 78, and Line 112 still bypasshandleTransientError. In addition, Line 111 clearsconsecutiveFailureSincebefore availability status persistence succeeds.Suggested fix pattern
- r.clearFailureWindow() - if err := r.setAvailableCondition(ctx); err != nil { - return ctrl.Result{}, fmt.Errorf("failed to set conditions for trusted CA bundle controller: %v", err) - } + if err := r.setAvailableCondition(ctx); err != nil { + return r.handleTransientError(ctx, fmt.Errorf("failed to set conditions for trusted CA bundle controller: %w", err)) + } + r.clearFailureWindow()Use the same transient-routing for all
setAvailableConditionerror returns in this reconciler.Also applies to: 78-80, 111-114
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controllers/trusted_ca_bundle_controller.go` around lines 64 - 66, The calls to setAvailableCondition should be treated as transient failures: replace direct error returns from setAvailableCondition(ctx) in the reconciler with the transient routing via r.handleTransientError(ctx, req, err, "<contextual message>") so they go through the same transient logic; additionally, do not clear the consecutiveFailureSince field before availability status is successfully persisted—move any reset/clearing of consecutiveFailureSince to after setAvailableCondition and the status update succeed (or only clear when there was no error) to avoid pre-clearing the failure window.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/controllers/cloud_config_sync_controller.go`:
- Around line 55-57: The status-write error handling must route failures through
the transient policy and only reset consecutiveFailureSince after
setAvailableCondition succeeds; update all call sites using
setAvailableCondition(ctx) (the ones that currently return fmt.Errorf directly
and the ones that clear consecutiveFailureSince before the call) to instead call
handleTransientError(ctx, req, err) when setAvailableCondition returns an error,
and move any assignment that clears r.consecutiveFailureSince so it executes
only after setAvailableCondition(ctx) completes successfully; ensure you update
all occurrences of setAvailableCondition, consecutiveFailureSince, and the
direct returns to use handleTransientError consistently.
In `@pkg/controllers/trusted_ca_bundle_controller.go`:
- Around line 64-66: The calls to setAvailableCondition should be treated as
transient failures: replace direct error returns from setAvailableCondition(ctx)
in the reconciler with the transient routing via r.handleTransientError(ctx,
req, err, "<contextual message>") so they go through the same transient logic;
additionally, do not clear the consecutiveFailureSince field before availability
status is successfully persisted—move any reset/clearing of
consecutiveFailureSince to after setAvailableCondition and the status update
succeed (or only clear when there was no error) to avoid pre-clearing the
failure window.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8d790959-33c5-4118-a06f-480421c1139e
📒 Files selected for processing (4)
pkg/controllers/cloud_config_sync_controller.gopkg/controllers/clusteroperator_controller.gopkg/controllers/clusteroperator_controller_test.gopkg/controllers/trusted_ca_bundle_controller.go
The test "config should not be updated if source and target config content are identical" called reconciler.Reconcile() directly while the manager was also running the same reconciler in a background goroutine. Both goroutines could access consecutiveFailureSince concurrently, which the Go race detector flags. Use a fresh CloudConfigReconciler instance (not registered with the manager) for the direct call. It shares the thread-safe API client but owns its own consecutiveFailureSince field, so there is no shared mutable state with the manager's copy. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| }, timeout).Should(Succeed()) | ||
| initialCMresourceVersion := syncedCloudConfigMap.ResourceVersion | ||
|
|
||
| // Introducing the consecutiveFailureWindow means that there's a field that could be racy |
There was a problem hiding this comment.
I don't love this; definitely open to alternatives.
There was a problem hiding this comment.
Using some sort of atomic timestamp or lock?
|
@nrb: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| } | ||
| return ctrl.Result{}, err | ||
| // Skip if the infrastructure resource doesn't exist. | ||
| r.clearFailureWindow() |
There was a problem hiding this comment.
To avoid having to set this on every possible happy exit path, what about creating named return variables and using a defer that checks if the error return is nil before calling this?
| }, timeout).Should(Succeed()) | ||
| initialCMresourceVersion := syncedCloudConfigMap.ResourceVersion | ||
|
|
||
| // Introducing the consecutiveFailureWindow means that there's a field that could be racy |
There was a problem hiding this comment.
Using some sort of atomic timestamp or lock?
| // Note: clearFailureWindow is intentionally NOT called here. This path did not | ||
| // exercise the full reconcile logic, so an ongoing transient failure window | ||
| // (set by a previous reconcile pass) should not be reset. |
There was a problem hiding this comment.
But we may not reconcile for some time after this right? There's no requeue.
Which means the next error will blow the timeout without actually seeing a transient error multiple times?
CloudConfigReconciler: gate transient errors behind a 2-minute window
Three related fixes to stop upgrade-time API blips from immediately setting CloudConfigControllerDegraded=True:
Infrastructure NotFound now calls setAvailableCondition (nil return) instead of setDegradedCondition, matching the main controller's existing behaviour.
Errors are classified as transient (API blips: all Get/Create/Update calls, feature-gate informer not yet synced) or permanent (config problems that won't self-heal: nil platformStatus, unsupported platform, missing user config key, nil FeatureGateAccess, transform failure).
handleTransientError() only sets degraded after consecutiveFailureSince has been set for longer than transientDegradedThreshold (2 min); handleDegradeError() sets degraded immediately and returns nil so controller-runtime does not requeue (existing watches re-trigger when the underlying config changes). clearFailureWindow() is called at every successful reconcile exit.
Summary by CodeRabbit
Bug Fixes
Tests