Skip to content

OCPBUGS-42837: Do not set Degraded=True on transient errors#436

Open
nrb wants to merge 4 commits intoopenshift:mainfrom
nrb:OCPBUGS-42837
Open

OCPBUGS-42837: Do not set Degraded=True on transient errors#436
nrb wants to merge 4 commits intoopenshift:mainfrom
nrb:OCPBUGS-42837

Conversation

@nrb
Copy link
Contributor

@nrb nrb commented Mar 3, 2026

CloudConfigReconciler: gate transient errors behind a 2-minute window

Three related fixes to stop upgrade-time API blips from immediately setting CloudConfigControllerDegraded=True:

  1. Infrastructure NotFound now calls setAvailableCondition (nil return) instead of setDegradedCondition, matching the main controller's existing behaviour.

  2. Errors are classified as transient (API blips: all Get/Create/Update calls, feature-gate informer not yet synced) or permanent (config problems that won't self-heal: nil platformStatus, unsupported platform, missing user config key, nil FeatureGateAccess, transform failure).

  3. handleTransientError() only sets degraded after consecutiveFailureSince has been set for longer than transientDegradedThreshold (2 min); handleDegradeError() sets degraded immediately and returns nil so controller-runtime does not requeue (existing watches re-trigger when the underlying config changes). clearFailureWindow() is called at every successful reconcile exit.

Summary by CodeRabbit

  • Bug Fixes

    • Introduced a ~2-minute transient-failure window so temporary errors no longer cause immediate degradation; persistent failures still mark operators/controllers degraded. Error-handling now yields clearer, more consistent availability/degraded status signals.
  • Tests

    • Expanded test coverage for transient vs persistent failures, threshold-driven degradation, and updated status-condition assertions to verify the new signaling behavior.

CloudConfigReconciler: gate transient errors behind a 2-minute window

Three related fixes to stop upgrade-time API blips from immediately
setting CloudConfigControllerDegraded=True:

1. Infrastructure NotFound now calls setAvailableCondition (nil return)
   instead of setDegradedCondition, matching the main controller's
   existing behaviour.

2. Errors are classified as transient (API blips: all Get/Create/Update
   calls, feature-gate informer not yet synced) or permanent (config
   problems that won't self-heal: nil platformStatus, unsupported
   platform, missing user config key, nil FeatureGateAccess, transform
   failure).

3. handleTransientError() only sets degraded after consecutiveFailureSince
   has been set for longer than transientDegradedThreshold (2 min);
   handleDegradeError() sets degraded immediately and returns nil so
   controller-runtime does not requeue (existing watches re-trigger when
   the underlying config changes).  clearFailureWindow() is called at
   every successful reconcile exit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Nolan Brubaker <nolan@nbrubaker.com>
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 3, 2026
@openshift-ci-robot
Copy link

@nrb: This pull request references Jira Issue OCPBUGS-42387, which is invalid:

  • expected the bug to be open, but it isn't
  • expected the bug to target either version "4.22." or "openshift-4.22.", but it targets "4.17.z" instead
  • expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is Closed (Done-Errata) instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

CloudConfigReconciler: gate transient errors behind a 2-minute window

Three related fixes to stop upgrade-time API blips from immediately setting CloudConfigControllerDegraded=True:

  1. Infrastructure NotFound now calls setAvailableCondition (nil return) instead of setDegradedCondition, matching the main controller's existing behaviour.

  2. Errors are classified as transient (API blips: all Get/Create/Update calls, feature-gate informer not yet synced) or permanent (config problems that won't self-heal: nil platformStatus, unsupported platform, missing user config key, nil FeatureGateAccess, transform failure).

  3. handleTransientError() only sets degraded after consecutiveFailureSince has been set for longer than transientDegradedThreshold (2 min); handleDegradeError() sets degraded immediately and returns nil so controller-runtime does not requeue (existing watches re-trigger when the underlying config changes). clearFailureWindow() is called at every successful reconcile exit.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link

coderabbitai bot commented Mar 3, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds per-reconciler transient failure windows and unified transient vs. immediate-degrade error handlers across multiple controllers; routes prior direct-degrade/error returns through thresholded transient handling and updates tests to assert ClusterOperator Degraded/Available transitions using a fake clock.

Changes

Cohort / File(s) Summary
Cloud config sync controller
pkg/controllers/cloud_config_sync_controller.go, pkg/controllers/cloud_config_sync_controller_test.go
Adds a 2m transient failure threshold and consecutiveFailureSince; introduces handleTransientError, handleDegradeError, clearFailureWindow; routes error paths through these helpers and updates tests to assert ClusterOperator conditions instead of expecting immediate errors.
ClusterOperator controller
pkg/controllers/clusteroperator_controller.go, pkg/controllers/clusteroperator_controller_test.go
Adds consecutiveFailureSince field and aggregatedTransientDegradedThreshold; implements clearFailureWindow, handleTransientError, handleDegradeError; replaces immediate degradation with thresholded transient handling and adds tests simulating threshold crossing with a fake clock.
Trusted CA bundle controller
pkg/controllers/trusted_ca_bundle_controller.go, pkg/controllers/trusted_ca_bundle_controller_test.go
Adds per-reconciler failure-window state and helpers (consecutiveFailureSince, clearFailureWindow, transient/degrade handlers); replaces direct degraded-setting paths with unified handlers and adds tests verifying transient → degraded transition using a fake clock.
Module / Misc
go.mod
Updated module/dependency manifest lines.

Sequence Diagram(s)

sequenceDiagram
    participant Reconciler
    participant FailureWindow
    participant Status as ClusterOperatorStatus
    participant KubeAPI as Kubernetes API

    Reconciler->>Reconciler: start reconciliation
    alt success
        Reconciler->>FailureWindow: clearFailureWindow()
        Reconciler->>Status: set Available
        Reconciler->>KubeAPI: update status (Available)
    else transient error
        Reconciler->>FailureWindow: check consecutiveFailureSince
        alt not started
            FailureWindow->>FailureWindow: set start = now
            Reconciler->>KubeAPI: requeue (return error)
        else within threshold
            Reconciler->>KubeAPI: requeue (return error)
        else threshold exceeded
            Reconciler->>Status: set Degraded
            Reconciler->>KubeAPI: update status (Degraded)
            Reconciler->>KubeAPI: no requeue (watch triggers)
        end
    else permanent error
        Reconciler->>Status: set Degraded immediately
        Reconciler->>KubeAPI: update status (Degraded)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning Test files lack proper timeout specifications on Eventually/Consistently calls, missing resource cleanup verification in AfterEach blocks, and have assertions without meaningful failure messages. Add explicit WithTimeout() to all Eventually calls, implement proper resource cleanup verification in AfterEach blocks, and include descriptive messages in all Expect() assertions following repository patterns.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: preventing Degraded=True from being set on transient errors, which is the primary objective across all modified controller files.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Stable And Deterministic Test Names ✅ Passed All test names in the three modified controller test files use static descriptive strings without dynamic content, variable interpolation, or fmt.Sprintf calls.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@nrb
Copy link
Contributor Author

nrb commented Mar 3, 2026

/retitle OCPBUGS-42837: Do not set Degraded=True on transient errors

@openshift-ci openshift-ci bot requested review from RadekManak and damdo March 3, 2026 22:05
@nrb
Copy link
Contributor Author

nrb commented Mar 3, 2026

/jira refresh

@openshift-ci-robot
Copy link

@nrb: This pull request references Jira Issue OCPBUGS-42387, which is invalid:

  • expected the bug to be open, but it isn't
  • expected the bug to target either version "4.22." or "openshift-4.22.", but it targets "4.17.z" instead
  • expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is Closed (Done-Errata) instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/controllers/cloud_config_sync_controller_test.go (1)

512-632: Add explicit tests for the 2-minute transient degradation window.

The PR’s primary behavior change is time-gating transient degradation, but there’s no direct assertion here for “before threshold stays non-degraded” and “after threshold degrades,” or for window reset after a successful reconcile.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controllers/cloud_config_sync_controller_test.go` around lines 512 - 632,
Add explicit unit tests exercising the 2-minute transient degradation window:
write tests that call reconciler.Reconcile multiple times with a missing
configmap key (using makeInfrastructureResource with
Spec.CloudConfig.Key="notfound") and assert that
cloudConfigControllerDegradedCondition remains false before the 2-minute
threshold and becomes true after advancing time past 2 minutes; also add a test
that after a successful reconcile (create the expected ConfigMap and call
reconciler.Reconcile) the transient window/reset clears so subsequent
missing-key reconciles start the timer anew. Use reconciler.Reconcile,
makeInfrastructureResource, makeInfraStatus and inspect co.Status.Conditions for
cloudConfigControllerDegradedCondition to locate and assert the condition
transitions. Ensure tests simulate time advancement (or inject a clock into
reconciler if available) so the 2-minute boundary is deterministically tested.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/controllers/cloud_config_sync_controller.go`:
- Around line 54-56: The code currently calls r.clearFailureWindow() before
r.setAvailableCondition(ctx) and returns on setAvailableCondition errors, which
resets consecutiveFailureSince even when availability update fails and also
bypasses handleTransientError; change each success-exit branch (where you
currently call r.clearFailureWindow() then setAvailableCondition) to first call
r.setAvailableCondition(ctx) and if it returns an error pass that error into
r.handleTransientError(ctx, req, err) (so transient failures are tracked by the
2-minute policy), and only after setAvailableCondition succeeds call
r.clearFailureWindow(); update all occurrences referencing clearFailureWindow(),
setAvailableCondition(ctx), handleTransientError, and consecutiveFailureSince
accordingly.

---

Nitpick comments:
In `@pkg/controllers/cloud_config_sync_controller_test.go`:
- Around line 512-632: Add explicit unit tests exercising the 2-minute transient
degradation window: write tests that call reconciler.Reconcile multiple times
with a missing configmap key (using makeInfrastructureResource with
Spec.CloudConfig.Key="notfound") and assert that
cloudConfigControllerDegradedCondition remains false before the 2-minute
threshold and becomes true after advancing time past 2 minutes; also add a test
that after a successful reconcile (create the expected ConfigMap and call
reconciler.Reconcile) the transient window/reset clears so subsequent
missing-key reconciles start the timer anew. Use reconciler.Reconcile,
makeInfrastructureResource, makeInfraStatus and inspect co.Status.Conditions for
cloudConfigControllerDegradedCondition to locate and assert the condition
transitions. Ensure tests simulate time advancement (or inject a clock into
reconciler if available) so the 2-minute boundary is deterministically tested.

ℹ️ Review info

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 9816344 and ba392a3.

📒 Files selected for processing (2)
  • pkg/controllers/cloud_config_sync_controller.go
  • pkg/controllers/cloud_config_sync_controller_test.go

Comment on lines 54 to 56
r.clearFailureWindow()
if err := r.setAvailableCondition(ctx); err != nil {
return ctrl.Result{}, fmt.Errorf("failed to set conditions for cloud config controller: %v", err)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Move failure-window reset after successful availability update, and route availability-update failures through transient handling.

Line 54, Line 74, Line 171, and Line 183 clear consecutiveFailureSince before setAvailableCondition succeeds. If setAvailableCondition fails, the reconcile is still failed, but the failure window is reset. Also, those failures currently bypass handleTransientError, so they are not tracked by the 2-minute transient policy.

💡 Proposed fix pattern (apply to all four success-exit branches)
- r.clearFailureWindow()
- if err := r.setAvailableCondition(ctx); err != nil {
- 	return ctrl.Result{}, fmt.Errorf("failed to set conditions for cloud config controller: %v", err)
- }
+ if err := r.setAvailableCondition(ctx); err != nil {
+ 	return r.handleTransientError(ctx, fmt.Errorf("failed to set conditions for cloud config controller: %w", err))
+ }
+ r.clearFailureWindow()

Also applies to: 74-76, 171-173, 183-185

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controllers/cloud_config_sync_controller.go` around lines 54 - 56, The
code currently calls r.clearFailureWindow() before r.setAvailableCondition(ctx)
and returns on setAvailableCondition errors, which resets
consecutiveFailureSince even when availability update fails and also bypasses
handleTransientError; change each success-exit branch (where you currently call
r.clearFailureWindow() then setAvailableCondition) to first call
r.setAvailableCondition(ctx) and if it returns an error pass that error into
r.handleTransientError(ctx, req, err) (so transient failures are tracked by the
2-minute policy), and only after setAvailableCondition succeeds call
r.clearFailureWindow(); update all occurrences referencing clearFailureWindow(),
setAvailableCondition(ctx), handleTransientError, and consecutiveFailureSince
accordingly.

@nrb
Copy link
Contributor Author

nrb commented Mar 3, 2026

/retitle OCPBUGS-42837: Do not set Degraded=True on transient errors

@openshift-ci openshift-ci bot changed the title OCPBUGS-42387: Do not set Degraded=True on transient errors OCPBUGS-42837: Do not set Degraded=True on transient errors Mar 3, 2026
@openshift-ci-robot openshift-ci-robot removed the jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. label Mar 3, 2026
@openshift-ci-robot
Copy link

@nrb: This pull request references Jira Issue OCPBUGS-42837, which is invalid:

  • expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

CloudConfigReconciler: gate transient errors behind a 2-minute window

Three related fixes to stop upgrade-time API blips from immediately setting CloudConfigControllerDegraded=True:

  1. Infrastructure NotFound now calls setAvailableCondition (nil return) instead of setDegradedCondition, matching the main controller's existing behaviour.

  2. Errors are classified as transient (API blips: all Get/Create/Update calls, feature-gate informer not yet synced) or permanent (config problems that won't self-heal: nil platformStatus, unsupported platform, missing user config key, nil FeatureGateAccess, transform failure).

  3. handleTransientError() only sets degraded after consecutiveFailureSince has been set for longer than transientDegradedThreshold (2 min); handleDegradeError() sets degraded immediately and returns nil so controller-runtime does not requeue (existing watches re-trigger when the underlying config changes). clearFailureWindow() is called at every successful reconcile exit.

Summary by CodeRabbit

  • Bug Fixes

  • Improved resilience to temporary infrastructure issues with a 2-minute grace period before marking as degraded.

  • Enhanced error handling to better distinguish between transient and permanent configuration issues.

  • More accurate status signaling for cluster configuration synchronization.

  • Tests

  • Updated test coverage to validate new error handling behavior and status condition reporting.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Mirror the CloudConfigReconciler pattern: transient API errors (Proxy
get, system trust bundle read, ConfigMap write) are silently requeued
and only set Degraded=True after transientDegradedThreshold (2 min)
has elapsed. Errors that indicate corrupt cert data (merge failures)
set Degraded=True immediately and return nil so controller-runtime does
not requeue; existing watches re-trigger reconciliation when the data
changes.

Also adds two direct unit tests that verify the threshold gating via a
fake clock, without running through the manager.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Nolan Brubaker <nolan@nbrubaker.com>
@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 4, 2026
@openshift-ci-robot
Copy link

@nrb: This pull request references Jira Issue OCPBUGS-42837, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sunzhaohua2

Details

In response to this:

CloudConfigReconciler: gate transient errors behind a 2-minute window

Three related fixes to stop upgrade-time API blips from immediately setting CloudConfigControllerDegraded=True:

  1. Infrastructure NotFound now calls setAvailableCondition (nil return) instead of setDegradedCondition, matching the main controller's existing behaviour.

  2. Errors are classified as transient (API blips: all Get/Create/Update calls, feature-gate informer not yet synced) or permanent (config problems that won't self-heal: nil platformStatus, unsupported platform, missing user config key, nil FeatureGateAccess, transform failure).

  3. handleTransientError() only sets degraded after consecutiveFailureSince has been set for longer than transientDegradedThreshold (2 min); handleDegradeError() sets degraded immediately and returns nil so controller-runtime does not requeue (existing watches re-trigger when the underlying config changes). clearFailureWindow() is called at every successful reconcile exit.

Summary by CodeRabbit

  • Bug Fixes

  • Added a transient-failure window (≈2 minutes) so temporary errors no longer cause immediate degradation; persistent failures still mark degraded.

  • Unified error paths to reliably distinguish transient vs permanent failures and improve status signaling for operator availability/degraded states.

  • Tests

  • Expanded tests to cover transient vs degraded behavior, threshold-driven degradation, and updated status-condition assertions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from sunzhaohua2 March 4, 2026 22:20
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
pkg/controllers/cloud_config_sync_controller.go (1)

55-57: ⚠️ Potential issue | 🟠 Major

Still resetting failure window before availability update succeeds (previously reported).

This same issue remains: clearFailureWindow() is called before setAvailableCondition() and availability-update failures are not routed through transient handling, so failure tracking can be lost on failed reconciles.

Also applies to: 75-77, 172-174, 184-186

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controllers/cloud_config_sync_controller.go` around lines 55 - 57,
clearFailureWindow() is being invoked before setAvailableCondition(), which
causes failure tracking to be reset even when the availability update fails;
change the call order so that setAvailableCondition(ctx) is called first and
only on success call clearFailureWindow(), and apply the same change to the
other occurrences (the blocks around
setAvailableCondition()/clearFailureWindow() at the other noted locations) so
transient failures are preserved and routed through the existing
failure-handling logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/controllers/clusteroperator_controller.go`:
- Around line 81-84: The failure window is being cleared before the availability
status is persisted, which can reset consecutive-failure tracking if
setStatusAvailable fails; change the flow so r.clearFailureWindow() is only
called after r.setStatusAvailable(ctx, conditionOverrides) returns no error
(i.e., move the clear call to follow successful completion of
setStatusAvailable), and apply the same fix for the other occurrence where
clearFailureWindow is invoked before a call to
setStatusAvailable/setStatusUnavailable (ensure clear is only executed after the
corresponding status setter succeeds).
- Around line 95-96: provisioningAllowed() currently returns an error that the
reconciler returns directly, bypassing the centralized retry/status logic;
instead of returning "ctrl.Result{}, err" in the branch after calling
provisioningAllowed, invoke the centralized handlers (handleTransientError or
handleDegradeError) so the error flows through the new policy path and
status/retry behavior is consistent. Locate the branch where provisioningAllowed
is called and replace the direct return with a call that forwards the error to
the appropriate handler (e.g., call handleTransientError(...) for transient
failures or handleDegradeError(...) for degraded conditions), passing the same
context, instance/request and logger so those handlers can set status and
determine the ctrl.Result and requeue behavior.

In `@pkg/controllers/trusted_ca_bundle_controller.go`:
- Around line 64-67: The code is clearing the failure window
(r.clearFailureWindow()) before successfully updating availability
(r.setAvailableCondition(ctx)), which can reset consecutiveFailureSince on a
failed status write; move the call to r.clearFailureWindow() so it only runs
after r.setAvailableCondition succeeds, and for any errors returned by
r.setAvailableCondition (the paths at the current calls around Lines 64, 77, and
111) route them through the controller's transient error handling path instead
of returning plain errors—i.e., wrap/return the setAvailableCondition error via
the existing transient-handling helper used in this controller so it increments
the failure window (consecutiveFailureSince) appropriately and triggers
transient requeue behavior.

---

Duplicate comments:
In `@pkg/controllers/cloud_config_sync_controller.go`:
- Around line 55-57: clearFailureWindow() is being invoked before
setAvailableCondition(), which causes failure tracking to be reset even when the
availability update fails; change the call order so that
setAvailableCondition(ctx) is called first and only on success call
clearFailureWindow(), and apply the same change to the other occurrences (the
blocks around setAvailableCondition()/clearFailureWindow() at the other noted
locations) so transient failures are preserved and routed through the existing
failure-handling logic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 06897dbd-8d86-4ff7-9f18-2668da2179ed

📥 Commits

Reviewing files that changed from the base of the PR and between ba392a3 and c6b57a9.

📒 Files selected for processing (5)
  • pkg/controllers/cloud_config_sync_controller.go
  • pkg/controllers/clusteroperator_controller.go
  • pkg/controllers/clusteroperator_controller_test.go
  • pkg/controllers/trusted_ca_bundle_controller.go
  • pkg/controllers/trusted_ca_bundle_controller_test.go

Comment on lines 64 to 67
r.clearFailureWindow()
if err := r.setAvailableCondition(ctx); err != nil {
return ctrl.Result{}, fmt.Errorf("failed to set conditions for trusted CA bundle controller: %v", err)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Move failure-window reset after successful availability update, and track availability-update failures as transient.

Line 64 and Line 110 clear consecutiveFailureSince before setAvailableCondition succeeds. If that status write fails, the reconcile failed but the failure window is reset. Also, the setAvailableCondition error paths at Line 65, Line 77, and Line 111 return plain errors instead of going through transient handling.

Suggested fix pattern
- r.clearFailureWindow()
- if err := r.setAvailableCondition(ctx); err != nil {
-   return ctrl.Result{}, fmt.Errorf("failed to set conditions for trusted CA bundle controller: %v", err)
- }
+ if err := r.setAvailableCondition(ctx); err != nil {
+   return r.handleTransientError(ctx, fmt.Errorf("failed to set conditions for trusted CA bundle controller: %w", err))
+ }
+ r.clearFailureWindow()

Also applies to: 77-79, 110-113

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controllers/trusted_ca_bundle_controller.go` around lines 64 - 67, The
code is clearing the failure window (r.clearFailureWindow()) before successfully
updating availability (r.setAvailableCondition(ctx)), which can reset
consecutiveFailureSince on a failed status write; move the call to
r.clearFailureWindow() so it only runs after r.setAvailableCondition succeeds,
and for any errors returned by r.setAvailableCondition (the paths at the current
calls around Lines 64, 77, and 111) route them through the controller's
transient error handling path instead of returning plain errors—i.e.,
wrap/return the setAvailableCondition error via the existing transient-handling
helper used in this controller so it increments the failure window
(consecutiveFailureSince) appropriately and triggers transient requeue behavior.

@nrb nrb force-pushed the OCPBUGS-42837 branch from c6b57a9 to 08e6e52 Compare March 5, 2026 15:52
@openshift-ci-robot
Copy link

@nrb: This pull request references Jira Issue OCPBUGS-42837, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sunzhaohua2

Details

In response to this:

CloudConfigReconciler: gate transient errors behind a 2-minute window

Three related fixes to stop upgrade-time API blips from immediately setting CloudConfigControllerDegraded=True:

  1. Infrastructure NotFound now calls setAvailableCondition (nil return) instead of setDegradedCondition, matching the main controller's existing behaviour.

  2. Errors are classified as transient (API blips: all Get/Create/Update calls, feature-gate informer not yet synced) or permanent (config problems that won't self-heal: nil platformStatus, unsupported platform, missing user config key, nil FeatureGateAccess, transform failure).

  3. handleTransientError() only sets degraded after consecutiveFailureSince has been set for longer than transientDegradedThreshold (2 min); handleDegradeError() sets degraded immediately and returns nil so controller-runtime does not requeue (existing watches re-trigger when the underlying config changes). clearFailureWindow() is called at every successful reconcile exit.

Summary by CodeRabbit

  • Bug Fixes

  • Added a ~2-minute transient-failure window so temporary errors no longer cause immediate degradation; persistent failures still mark degraded.

  • Unified error handling so operator availability/degraded status are signaled more consistently (reconciliations tolerate transient issues and update conditions when appropriate).

  • Tests

  • Expanded tests to cover transient vs. persistent failure behavior, threshold-driven degradation, and updated status-condition assertions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/controllers/clusteroperator_controller.go (1)

118-126: ⚠️ Potential issue | 🟡 Minor

Route setStatusAvailable and clearCloudControllerOwnerCondition failures through transient error handling.

The failures at lines 120 and 125 return errors directly, bypassing handleTransientError. These are API calls that could fail transiently and should follow the same error-handling policy as other API failures in this reconciler for consistent behavior.

Suggested fix
 	if err := r.setStatusAvailable(ctx, conditionOverrides); err != nil {
 		klog.Errorf("Unable to sync cluster operator status: %s", err)
-		return ctrl.Result{}, err
+		return r.handleTransientError(ctx, conditionOverrides, err)
 	}

 	if err := r.clearCloudControllerOwnerCondition(ctx); err != nil {
 		klog.Errorf("Unable to clear CloudControllerOwner condition: %s", err)
-		return ctrl.Result{}, err
+		return r.handleTransientError(ctx, conditionOverrides, err)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controllers/clusteroperator_controller.go` around lines 118 - 126, The
two API-call errors from setStatusAvailable and
clearCloudControllerOwnerCondition should be routed through the reconciler's
transient error handler rather than returned directly; replace the direct
returns for errors from r.setStatusAvailable(ctx, conditionOverrides) and
r.clearCloudControllerOwnerCondition(ctx) with calls to
r.handleTransientError(ctx, err, "<contextual message>") (or the existing
handleTransientError signature used elsewhere) so the reconciler applies the
same transient retry/backoff policy as other API failures and returns the
Result/Error produced by handleTransientError.
🧹 Nitpick comments (1)
pkg/controllers/clusteroperator_controller.go (1)

241-294: Consider unifying error handling in provisioningAllowed.

The provisioningAllowed method internally calls setStatusDegraded before returning errors (e.g., lines 245-249, 277-281). These errors are then routed through handleTransientError at line 94, which may call setStatusDegraded again after the threshold. While not incorrect (the later call updates the condition), this creates a dual-path status-setting pattern.

This is existing behavior preserved by the PR, so it's not a blocking concern—just a potential future refactoring opportunity to centralize all status updates through the new transient/degrade handlers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controllers/clusteroperator_controller.go` around lines 241 - 294,
provisioningAllowed currently calls setStatusDegraded before returning errors
which duplicates status updates later in handleTransientError; remove the early
calls to setStatusDegraded inside provisioningAllowed (specifically the ones
after checkControllerConditions error handling and after
cloudprovider.IsCloudProviderExternal error handling) so that
provisioningAllowed simply returns the error, and let the caller flow (which
invokes handleTransientError) centralize calling
setStatusDegraded/handleTransientError. Keep the existing returns and logs but
eliminate the direct setStatusDegraded invocations to avoid dual-path status
updates; references: provisioningAllowed, setStatusDegraded,
handleTransientError, checkControllerConditions,
cloudprovider.IsCloudProviderExternal.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/controllers/clusteroperator_controller.go`:
- Around line 79-89: The early-return path when the Infrastructure object is not
found (the r.Get call using client.ObjectKey{Name: infrastructureResourceName}
into infra) currently calls setStatusAvailable and returns without clearing any
prior transient failure window; update the branch so that after
setStatusAvailable succeeds you call r.clearFailureWindow(ctx,
conditionOverrides) (or the appropriate receiver method clearFailureWindow)
before returning ctrl.Result{}, nil, ensuring the failure window is reset; keep
existing error handling that routes through handleTransientError when
setStatusAvailable fails.

---

Outside diff comments:
In `@pkg/controllers/clusteroperator_controller.go`:
- Around line 118-126: The two API-call errors from setStatusAvailable and
clearCloudControllerOwnerCondition should be routed through the reconciler's
transient error handler rather than returned directly; replace the direct
returns for errors from r.setStatusAvailable(ctx, conditionOverrides) and
r.clearCloudControllerOwnerCondition(ctx) with calls to
r.handleTransientError(ctx, err, "<contextual message>") (or the existing
handleTransientError signature used elsewhere) so the reconciler applies the
same transient retry/backoff policy as other API failures and returns the
Result/Error produced by handleTransientError.

---

Nitpick comments:
In `@pkg/controllers/clusteroperator_controller.go`:
- Around line 241-294: provisioningAllowed currently calls setStatusDegraded
before returning errors which duplicates status updates later in
handleTransientError; remove the early calls to setStatusDegraded inside
provisioningAllowed (specifically the ones after checkControllerConditions error
handling and after cloudprovider.IsCloudProviderExternal error handling) so that
provisioningAllowed simply returns the error, and let the caller flow (which
invokes handleTransientError) centralize calling
setStatusDegraded/handleTransientError. Keep the existing returns and logs but
eliminate the direct setStatusDegraded invocations to avoid dual-path status
updates; references: provisioningAllowed, setStatusDegraded,
handleTransientError, checkControllerConditions,
cloudprovider.IsCloudProviderExternal.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7f5730af-7843-4b8d-9a94-135e4d83673a

📥 Commits

Reviewing files that changed from the base of the PR and between c6b57a9 and 08e6e52.

📒 Files selected for processing (3)
  • pkg/controllers/cloud_config_sync_controller.go
  • pkg/controllers/clusteroperator_controller.go
  • pkg/controllers/clusteroperator_controller_test.go

@damdo
Copy link
Member

damdo commented Mar 5, 2026

/assign @mdbooth @JoelSpeed @damdo

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 5, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from damdo. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Add handleTransientError/handleDegradeError methods to CloudOperatorReconciler
with an aggregatedTransientDegradedThreshold of 2m30s (longer than the
sub-controller threshold of 2m, to accommodate sub-controller recovery time).

Fix test: handleTransientError test was stepping the clock by transientDegradedThreshold
(2m, the sub-controller constant) instead of aggregatedTransientDegradedThreshold
(2m30s), so the threshold was never exceeded and the degraded condition was
never set.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Nolan Brubaker <nolan@nbrubaker.com>
@nrb nrb force-pushed the OCPBUGS-42837 branch from 0ec02f6 to 8e7f3f9 Compare March 5, 2026 20:19
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
pkg/controllers/cloud_config_sync_controller.go (1)

55-57: ⚠️ Potential issue | 🟠 Major

Move failure-window reset after successful availability update, and route availability-update failures through transient handling.

Line 55, Line 75, Line 172, and Line 184 clear consecutiveFailureSince before setAvailableCondition(ctx) succeeds. If status update fails, the reconcile failed but the window is reset; those failures also bypass handleTransientError.

Suggested fix pattern (apply to all four branches)
- r.clearFailureWindow()
- if err := r.setAvailableCondition(ctx); err != nil {
-   return ctrl.Result{}, fmt.Errorf("failed to set conditions for cloud config controller: %v", err)
- }
+ if err := r.setAvailableCondition(ctx); err != nil {
+   return r.handleTransientError(ctx, fmt.Errorf("failed to set conditions for cloud config controller: %w", err))
+ }
+ r.clearFailureWindow()

Also applies to: 75-77, 172-174, 184-186

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controllers/cloud_config_sync_controller.go` around lines 55 - 57, The
code currently calls r.clearFailureWindow() before attempting
r.setAvailableCondition(ctx), which resets consecutiveFailureSince even when the
availability update fails; change each branch (the blocks around
r.clearFailureWindow() / r.setAvailableCondition(ctx) at the four affected
sites) so that r.clearFailureWindow() is only called after
r.setAvailableCondition(ctx) returns nil, and when r.setAvailableCondition(ctx)
errors, do not clear the window but instead return via
r.handleTransientError(ctx, err, req) (or the existing transient handling path)
so availability-update failures increment the failure window and are treated as
transient errors.
pkg/controllers/clusteroperator_controller.go (1)

120-127: ⚠️ Potential issue | 🟠 Major

Route final status-write failures through handleTransientError for consistent transient policy.

Line 120 and Line 125 still return raw errors. These are API/write-time failures and should use transient handling; otherwise they bypass consecutive-failure tracking.

Suggested patch
 if err := r.setStatusAvailable(ctx, conditionOverrides); err != nil {
 	klog.Errorf("Unable to sync cluster operator status: %s", err)
-	return ctrl.Result{}, err
+	return r.handleTransientError(ctx, conditionOverrides, err)
 }

 if err := r.clearCloudControllerOwnerCondition(ctx); err != nil {
 	klog.Errorf("Unable to clear CloudControllerOwner condition: %s", err)
-	return ctrl.Result{}, err
+	return r.handleTransientError(ctx, conditionOverrides, err)
 }

 // successful reconcile, make sure the failure window is cleared.
 r.clearFailureWindow()

Also applies to: 130-132

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controllers/clusteroperator_controller.go` around lines 120 - 127,
Replace direct returns of raw errors from API/write operations with the
controller's transient-error wrapper so consecutive-failure tracking is
preserved: where r.setStatusAvailable(ctx, ...) and
r.clearCloudControllerOwnerCondition(ctx) (and the similar block at lines
~130-132) currently do "return ctrl.Result{}, err", change those to call
r.handleTransientError(ctx, err, "<brief operation description>") and return its
result instead; ensure you pass a short descriptive string (e.g., "sync cluster
operator status" or "clear CloudControllerOwner condition") so the handler has
context.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/controllers/cloud_config_sync_controller.go`:
- Around line 55-57: The code currently calls r.clearFailureWindow() before
attempting r.setAvailableCondition(ctx), which resets consecutiveFailureSince
even when the availability update fails; change each branch (the blocks around
r.clearFailureWindow() / r.setAvailableCondition(ctx) at the four affected
sites) so that r.clearFailureWindow() is only called after
r.setAvailableCondition(ctx) returns nil, and when r.setAvailableCondition(ctx)
errors, do not clear the window but instead return via
r.handleTransientError(ctx, err, req) (or the existing transient handling path)
so availability-update failures increment the failure window and are treated as
transient errors.

In `@pkg/controllers/clusteroperator_controller.go`:
- Around line 120-127: Replace direct returns of raw errors from API/write
operations with the controller's transient-error wrapper so consecutive-failure
tracking is preserved: where r.setStatusAvailable(ctx, ...) and
r.clearCloudControllerOwnerCondition(ctx) (and the similar block at lines
~130-132) currently do "return ctrl.Result{}, err", change those to call
r.handleTransientError(ctx, err, "<brief operation description>") and return its
result instead; ensure you pass a short descriptive string (e.g., "sync cluster
operator status" or "clear CloudControllerOwner condition") so the handler has
context.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b80279f0-507b-412a-835d-139e68043d32

📥 Commits

Reviewing files that changed from the base of the PR and between 08e6e52 and 0ec02f6.

📒 Files selected for processing (3)
  • pkg/controllers/cloud_config_sync_controller.go
  • pkg/controllers/clusteroperator_controller.go
  • pkg/controllers/clusteroperator_controller_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/controllers/clusteroperator_controller_test.go

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
pkg/controllers/cloud_config_sync_controller.go (1)

55-57: ⚠️ Potential issue | 🟠 Major

Route availability-status write failures through transient handling, and reset the failure window only after status write succeeds.

Line 76, Line 173, and Line 185 clear consecutiveFailureSince before setAvailableCondition(ctx) succeeds. Also, errors from setAvailableCondition(ctx) at Line 55, Line 77, Line 174, and Line 186 return directly instead of going through handleTransientError, so status-write blips are not tracked by the transient policy.

Suggested fix pattern
- r.clearFailureWindow()
- if err := r.setAvailableCondition(ctx); err != nil {
-   return ctrl.Result{}, fmt.Errorf("failed to set conditions for cloud config controller: %v", err)
- }
+ if err := r.setAvailableCondition(ctx); err != nil {
+   return r.handleTransientError(ctx, fmt.Errorf("failed to set conditions for cloud config controller: %w", err))
+ }
+ r.clearFailureWindow()

Apply the same error-routing pattern to the other setAvailableCondition call sites that currently return plain errors.

Also applies to: 76-79, 173-176, 185-188

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controllers/cloud_config_sync_controller.go` around lines 55 - 57, The
status-write error handling must route failures through the transient policy and
only reset consecutiveFailureSince after setAvailableCondition succeeds; update
all call sites using setAvailableCondition(ctx) (the ones that currently return
fmt.Errorf directly and the ones that clear consecutiveFailureSince before the
call) to instead call handleTransientError(ctx, req, err) when
setAvailableCondition returns an error, and move any assignment that clears
r.consecutiveFailureSince so it executes only after setAvailableCondition(ctx)
completes successfully; ensure you update all occurrences of
setAvailableCondition, consecutiveFailureSince, and the direct returns to use
handleTransientError consistently.
pkg/controllers/trusted_ca_bundle_controller.go (1)

64-66: ⚠️ Potential issue | 🟠 Major

Treat setAvailableCondition failures as transient and avoid pre-clearing the failure window.

setAvailableCondition(ctx) failures at Line 64, Line 78, and Line 112 still bypass handleTransientError. In addition, Line 111 clears consecutiveFailureSince before availability status persistence succeeds.

Suggested fix pattern
- r.clearFailureWindow()
- if err := r.setAvailableCondition(ctx); err != nil {
-   return ctrl.Result{}, fmt.Errorf("failed to set conditions for trusted CA bundle controller: %v", err)
- }
+ if err := r.setAvailableCondition(ctx); err != nil {
+   return r.handleTransientError(ctx, fmt.Errorf("failed to set conditions for trusted CA bundle controller: %w", err))
+ }
+ r.clearFailureWindow()

Use the same transient-routing for all setAvailableCondition error returns in this reconciler.

Also applies to: 78-80, 111-114

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controllers/trusted_ca_bundle_controller.go` around lines 64 - 66, The
calls to setAvailableCondition should be treated as transient failures: replace
direct error returns from setAvailableCondition(ctx) in the reconciler with the
transient routing via r.handleTransientError(ctx, req, err, "<contextual
message>") so they go through the same transient logic; additionally, do not
clear the consecutiveFailureSince field before availability status is
successfully persisted—move any reset/clearing of consecutiveFailureSince to
after setAvailableCondition and the status update succeed (or only clear when
there was no error) to avoid pre-clearing the failure window.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/controllers/cloud_config_sync_controller.go`:
- Around line 55-57: The status-write error handling must route failures through
the transient policy and only reset consecutiveFailureSince after
setAvailableCondition succeeds; update all call sites using
setAvailableCondition(ctx) (the ones that currently return fmt.Errorf directly
and the ones that clear consecutiveFailureSince before the call) to instead call
handleTransientError(ctx, req, err) when setAvailableCondition returns an error,
and move any assignment that clears r.consecutiveFailureSince so it executes
only after setAvailableCondition(ctx) completes successfully; ensure you update
all occurrences of setAvailableCondition, consecutiveFailureSince, and the
direct returns to use handleTransientError consistently.

In `@pkg/controllers/trusted_ca_bundle_controller.go`:
- Around line 64-66: The calls to setAvailableCondition should be treated as
transient failures: replace direct error returns from setAvailableCondition(ctx)
in the reconciler with the transient routing via r.handleTransientError(ctx,
req, err, "<contextual message>") so they go through the same transient logic;
additionally, do not clear the consecutiveFailureSince field before availability
status is successfully persisted—move any reset/clearing of
consecutiveFailureSince to after setAvailableCondition and the status update
succeed (or only clear when there was no error) to avoid pre-clearing the
failure window.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8d790959-33c5-4118-a06f-480421c1139e

📥 Commits

Reviewing files that changed from the base of the PR and between 0ec02f6 and 8e7f3f9.

📒 Files selected for processing (4)
  • pkg/controllers/cloud_config_sync_controller.go
  • pkg/controllers/clusteroperator_controller.go
  • pkg/controllers/clusteroperator_controller_test.go
  • pkg/controllers/trusted_ca_bundle_controller.go

The test "config should not be updated if source and target config
content are identical" called reconciler.Reconcile() directly while
the manager was also running the same reconciler in a background
goroutine. Both goroutines could access consecutiveFailureSince
concurrently, which the Go race detector flags.

Use a fresh CloudConfigReconciler instance (not registered with the
manager) for the direct call. It shares the thread-safe API client
but owns its own consecutiveFailureSince field, so there is no shared
mutable state with the manager's copy.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
}, timeout).Should(Succeed())
initialCMresourceVersion := syncedCloudConfigMap.ResourceVersion

// Introducing the consecutiveFailureWindow means that there's a field that could be racy
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this; definitely open to alternatives.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using some sort of atomic timestamp or lock?

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 6, 2026

@nrb: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

}
return ctrl.Result{}, err
// Skip if the infrastructure resource doesn't exist.
r.clearFailureWindow()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid having to set this on every possible happy exit path, what about creating named return variables and using a defer that checks if the error return is nil before calling this?

}, timeout).Should(Succeed())
initialCMresourceVersion := syncedCloudConfigMap.ResourceVersion

// Introducing the consecutiveFailureWindow means that there's a field that could be racy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using some sort of atomic timestamp or lock?

Comment on lines +81 to +83
// Note: clearFailureWindow is intentionally NOT called here. This path did not
// exercise the full reconcile logic, so an ongoing transient failure window
// (set by a previous reconcile pass) should not be reset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we may not reconcile for some time after this right? There's no requeue.

Which means the next error will blow the timeout without actually seeing a transient error multiple times?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants