Skip to content

Conversation

@ccardenosa
Copy link
Contributor

Summary

This PR adds a workaround to the acm-mch step to handle a race condition in the OCM cluster-manager controller that causes MultiClusterHub deployments to fail intermittently.

Related Issues

Issue Repository Status
Upstream Fix open-cluster-management-io/ocm#1309 🔄 Open

Problem

The cluster-manager controller has a race condition where it may create CRDs (ClusterManagementAddOn, ManagedClusterAddOn) before the cert rotation controller creates the CA bundle ConfigMap. When this happens:

  1. CRDs are created with caBundle: cGxhY2Vob2xkZXI= (base64 of literal string "placeholder")
  2. Webhook conversion fails with InvalidCABundle error
  3. CRDs remain in Established: False state
  4. API endpoints are not registered
  5. MCH fails with: "no matches for kind 'ClusterManagementAddOn' in version 'addon.open-cluster-management.io/v1alpha1'"

Evidence from Failed Prow Jobs

Job Run Date ACM Version MCE Version
#2005051399989104640 Dec 27, 2025 2.16.0-113 2.11.0-142
#2005219283428184064 Dec 28, 2025 2.16.0-114 2.11.0-143

Solution

This PR adds a workaround that only triggers if the initial 30-minute wait for MCH fails:

Normal Flow (upstream fix merged):
  Apply MCH → Wait 30min → Success ✓

Workaround Flow (race condition hit):
  Apply MCH → Wait 30min → Fail → Detect race condition → Apply workaround → Wait 30min → Success ✓

Workaround Steps

  1. Detect - Check if CRDs have the placeholder CA bundle (cGxhY2Vob2xkZXI=)
  2. Patch Services - Add service.beta.openshift.io/serving-cert-secret-name annotation to webhook services
  3. Wait for Secrets - Let service-ca-operator create TLS certificates
  4. Patch CRDs - Extract real CA bundle from secrets and update CRDs
  5. Force Reconciliation - Restart MCE operator to pick up changes
  6. Retry Wait - Wait again for MCH to reach Running status

Design Decisions

Decision Rationale
Workaround only on failure Doesn't add latency to normal deployments
Specific detection Only triggers for this exact issue (placeholder CA bundle)
Dead code after fix Once upstream PR #1309 is merged, detection returns false and workaround never runs
Clear documentation Functions are well-commented with links to upstream PR

Cleanup Path

Once ocm#1309 is merged and released in ACM/MCE:

  1. The first 30min wait will always succeed
  2. The workaround functions become dead code
  3. They can be removed in a future cleanup PR

Testing

  • Bash syntax check passes
  • Workaround successfully applied manually on live cluster (sno-vhub-0)
  • MCH reached Running status after workaround

Changes

  • ci-operator/step-registry/acm/mch/acm-mch-commands.sh
    • Added workaround functions for OCM CA bundle race condition
    • Modified main wait logic to detect and remediate the issue on failure

/cc @openshift/openshift-team-edge-ztp

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 28, 2025

@ccardenosa: GitHub didn't allow me to request PR reviews from the following users: openshift/openshift-team-edge-ztp.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

Summary

This PR adds a workaround to the acm-mch step to handle a race condition in the OCM cluster-manager controller that causes MultiClusterHub deployments to fail intermittently.

Related Issues

Issue Repository Status
Upstream Fix open-cluster-management-io/ocm#1309 🔄 Open

Problem

The cluster-manager controller has a race condition where it may create CRDs (ClusterManagementAddOn, ManagedClusterAddOn) before the cert rotation controller creates the CA bundle ConfigMap. When this happens:

  1. CRDs are created with caBundle: cGxhY2Vob2xkZXI= (base64 of literal string "placeholder")
  2. Webhook conversion fails with InvalidCABundle error
  3. CRDs remain in Established: False state
  4. API endpoints are not registered
  5. MCH fails with: "no matches for kind 'ClusterManagementAddOn' in version 'addon.open-cluster-management.io/v1alpha1'"

Evidence from Failed Prow Jobs

Job Run Date ACM Version MCE Version
#2005051399989104640 Dec 27, 2025 2.16.0-113 2.11.0-142
#2005219283428184064 Dec 28, 2025 2.16.0-114 2.11.0-143

Solution

This PR adds a workaround that only triggers if the initial 30-minute wait for MCH fails:

Normal Flow (upstream fix merged):
 Apply MCH → Wait 30min → Success ✓

Workaround Flow (race condition hit):
 Apply MCH → Wait 30min → Fail → Detect race condition → Apply workaround → Wait 30min → Success ✓

Workaround Steps

  1. Detect - Check if CRDs have the placeholder CA bundle (cGxhY2Vob2xkZXI=)
  2. Patch Services - Add service.beta.openshift.io/serving-cert-secret-name annotation to webhook services
  3. Wait for Secrets - Let service-ca-operator create TLS certificates
  4. Patch CRDs - Extract real CA bundle from secrets and update CRDs
  5. Force Reconciliation - Restart MCE operator to pick up changes
  6. Retry Wait - Wait again for MCH to reach Running status

Design Decisions

Decision Rationale
Workaround only on failure Doesn't add latency to normal deployments
Specific detection Only triggers for this exact issue (placeholder CA bundle)
Dead code after fix Once upstream PR #1309 is merged, detection returns false and workaround never runs
Clear documentation Functions are well-commented with links to upstream PR

Cleanup Path

Once ocm#1309 is merged and released in ACM/MCE:

  1. The first 30min wait will always succeed
  2. The workaround functions become dead code
  3. They can be removed in a future cleanup PR

Testing

  • Bash syntax check passes
  • Workaround successfully applied manually on live cluster (sno-vhub-0)
  • MCH reached Running status after workaround

Changes

  • ci-operator/step-registry/acm/mch/acm-mch-commands.sh
  • Added workaround functions for OCM CA bundle race condition
  • Modified main wait logic to detect and remediate the issue on failure

/cc @openshift/openshift-team-edge-ztp

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ccardenosa
Copy link
Contributor Author

/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-ztp-left-shifting-kpi-ci-4.21-telcov10n-virtualised-single-node-hub-ztp

@openshift-ci-robot
Copy link
Contributor

@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@ccardenosa
Copy link
Contributor Author

/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-ci-4.21-telcov10n-metal-single-node-hub-ztp

@openshift-ci-robot
Copy link
Contributor

@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@ccardenosa ccardenosa force-pushed the fix/clustermanager-cabundle-race-condition-workaround branch from 7cf2ff7 to 915f710 Compare December 28, 2025 16:26
@ccardenosa
Copy link
Contributor Author

/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-ci-4.21-telcov10n-metal-single-node-hub-ztp

@openshift-ci-robot
Copy link
Contributor

@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@ccardenosa
Copy link
Contributor Author

/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-ztp-left-shifting-kpi-ci-4.21-telcov10n-virtualised-single-node-hub-ztp

@openshift-ci-robot
Copy link
Contributor

@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@ccardenosa
Copy link
Contributor Author

✅ Workaround Verified Working

The rehearsal job confirms the workaround successfully resolves the OCM CA bundle race condition:

Successful run: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/72976/rehearse-72976-periodic-ci-openshift-kni-eco-ci-cd-ztp-left-shifting-kpi-ci-4.21-telcov10n-virtualised-single-node-hub-ztp/2005316456039845888

Execution Summary

MCH did not reach Running status in the first attempt.
Checking for known issues and applying workarounds if needed...

============================================================
Applying OCM CA Bundle Race Condition Workaround
Upstream fix: https://github.com/open-cluster-management-io/ocm/pull/1309
============================================================

Checking for OCM CA bundle race condition (PR #1309)...
DETECTED: clustermanagementaddons CRD has placeholder CA bundle

Step 1/6: Patching webhook services with serving-cert-secret-name annotation...
  ✓ All 3 webhook services patched

Step 2/6: Waiting for service-ca-operator to create secrets...
  ✓ All 3 secrets created by service-ca-operator

Step 3/6: Creating ca-bundle-configmap from serving cert...
  ✓ ConfigMap created with real CA bundle

Step 4/6: Patching CRDs with real CA bundles...
  (CRDs auto-updated after configmap creation)

Step 5/6: Verifying CRDs are now Established...
  ✓ clustermanagementaddons.addon.open-cluster-management.io: Established=True
  ✓ managedclusteraddons.addon.open-cluster-management.io: Established=True

Step 6/6: Restarting cluster-manager and forcing reconciliation...
  ✓ cluster-manager deployment restarted
  ✓ multicluster-engine-operator restarted
  ✓ multiclusterengine annotated for reconciliation

============================================================
Workaround applied successfully!
============================================================

multiclusterhub.operator.open-cluster-management.io/multiclusterhub condition met
MCH reached Running status after applying workaround!
Success! ACM 2.16.0-114 is Running

This workaround will be needed until the upstream fix (open-cluster-management-io/ocm#1309) is merged and released in a future ACM/MCE version.

@ccardenosa ccardenosa force-pushed the fix/clustermanager-cabundle-race-condition-workaround branch from 915f710 to 79ab2e2 Compare December 28, 2025 18:39
@ccardenosa
Copy link
Contributor Author

/assign @sg-rh

Could you please review this workaround?

@ccardenosa
Copy link
Contributor Author

/pj-rehearse ack

@openshift-ci-robot
Copy link
Contributor

@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-ci-robot openshift-ci-robot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Dec 28, 2025
@ccardenosa
Copy link
Contributor Author

/assign @vboulos

Could you please review this workaround?

@ccardenosa
Copy link
Contributor Author

/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-ci-4.21-telcov10n-metal-single-node-hub-ztp

@openshift-ci-robot
Copy link
Contributor

@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@ccardenosa ccardenosa force-pushed the fix/clustermanager-cabundle-race-condition-workaround branch from 79ab2e2 to d5b6d56 Compare December 29, 2025 09:57
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 29, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ccardenosa
Once this PR has been reviewed and has the lgtm label, please ask for approval from sg-rh. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ccardenosa
Copy link
Contributor Author

/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-ci-4.21-telcov10n-metal-single-node-hub-ztp

@openshift-ci-robot
Copy link
Contributor

@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-ci-robot openshift-ci-robot removed the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Dec 29, 2025
@ccardenosa
Copy link
Contributor Author

/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-ci-4.21-telcov10n-metal-single-node-hub-ztp

@openshift-ci-robot
Copy link
Contributor

@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

This adds a workaround for the cluster-manager controller race condition
that causes CRDs to be created with an invalid "placeholder" CA bundle.

Upstream fix: open-cluster-management-io/ocm#1309

Problem:
The cluster-manager controller may create ClusterManagementAddOn and
ManagedClusterAddOn CRDs before the cert rotation controller creates the
CA bundle ConfigMap. When this happens, the CRDs are created with
caBundle: cGxhY2Vob2xkZXI= (base64 of "placeholder"), causing:
  1. Webhook conversion fails with "InvalidCABundle"
  2. CRDs not becoming Established
  3. API endpoints not registered
  4. MCH fails: "no matches for kind ClusterManagementAddOn"

Additionally, the cluster-manager controller reads CA from ca-bundle-configmap.
If this ConfigMap doesn't exist or is empty, it keeps re-applying CRDs with
the placeholder CA, overwriting any manual patches.

Workaround (6 steps):
When MCH fails to reach Running status, detect the race condition by
checking for placeholder CA bundles in the CRDs, then:
  1. Patch webhook services with serving-cert-secret-name annotation
  2. Wait for service-ca-operator to create TLS secrets
  3. Create ca-bundle-configmap from the serving cert secret
  4. Extract real CA bundle from secrets and patch CRDs
  5. Verify CRDs become Established
  6. Restart cluster-manager and force MCE operator reconciliation

Design:
- The workaround only triggers if the first 30min wait fails
- Detection is specific: checks for the exact placeholder value
- Once upstream fix is merged, this becomes dead code (detection
  returns false) and can be removed in a future cleanup

Tested on sno-vhub-0: MCE reached Available status and MCH progressed
normally with 20/22 components ready after workaround.

Discovered in Prow jobs:
- periodic-ci-...-telcov10n-virtualised-single-node-hub-ztp/2005051399989104640
- periodic-ci-...-telcov10n-virtualised-single-node-hub-ztp/2005219283428184064

Signed-off-by: Carlos Cardenosa <ccardeno@redhat.com>
Co-authored-by: Claude <noreply@anthropic.com>
@ccardenosa ccardenosa force-pushed the fix/clustermanager-cabundle-race-condition-workaround branch from d5b6d56 to 5b57089 Compare December 29, 2025 13:11
@openshift-ci-robot
Copy link
Contributor

[REHEARSALNOTIFIER]
@ccardenosa: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
periodic-ci-openshift-kni-eco-ci-cd-ztp-left-shifting-kpi-ci-4.21-telcov10n-virtualised-single-node-hub-ztp N/A periodic Registry content changed
periodic-ci-stolostron-policy-collection-main-ocp4.21-interop-opp-vsphere N/A periodic Registry content changed
periodic-ci-stolostron-acmqe-autotest-main-acm-ocp4.14-lp-interop-acm-interop-aws N/A periodic Registry content changed
periodic-ci-stolostron-acmqe-autotest-main-acm-ocp4.16-lp-interop-acm-interop-aws N/A periodic Registry content changed
periodic-ci-stolostron-acmqe-autotest-main-acm-ocp4.17-lp-interop-acm-interop-aws N/A periodic Registry content changed
periodic-ci-RedHatQE-interop-testing-master-acm-cnv-ocp4.19-p2p-interop-acm-cnv-p2p-aws-419 N/A periodic Registry content changed
periodic-ci-openshift-kni-eco-ci-cd-main-ci-4.21-telcov10n-metal-single-node-hub-ztp N/A periodic Registry content changed
periodic-ci-stolostron-policy-collection-main-ocp4.21-interop-opp-aws N/A periodic Registry content changed
periodic-ci-stolostron-policy-collection-main-ocp4.20-interop-opp-aws N/A periodic Registry content changed
periodic-ci-stolostron-acmqe-autotest-main-acm-ocp4.15-lp-interop-acm-interop-aws N/A periodic Registry content changed
Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@ccardenosa
Copy link
Contributor Author

/pj-rehearse ack

@openshift-ci-robot
Copy link
Contributor

@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-ci-robot openshift-ci-robot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Dec 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rehearsals-ack Signifies that rehearsal jobs have been acknowledged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants