Add workaround for OCM CA bundle race condition in acm-mch step #72976

ccardenosa · 2025-12-28T13:34:22Z

Summary

This PR adds a workaround to the acm-mch step to handle a race condition in the OCM cluster-manager controller that causes MultiClusterHub deployments to fail intermittently.

Related Issues

Issue	Repository	Status
Upstream Fix	open-cluster-management-io/ocm#1309	🔄 Open

Problem

The cluster-manager controller has a race condition where it may create CRDs (ClusterManagementAddOn, ManagedClusterAddOn) before the cert rotation controller creates the CA bundle ConfigMap. When this happens:

CRDs are created with caBundle: cGxhY2Vob2xkZXI= (base64 of literal string "placeholder")
Webhook conversion fails with InvalidCABundle error
CRDs remain in Established: False state
API endpoints are not registered
MCH fails with: "no matches for kind 'ClusterManagementAddOn' in version 'addon.open-cluster-management.io/v1alpha1'"

Evidence from Failed Prow Jobs

Job Run	Date	ACM Version	MCE Version
#2005051399989104640	Dec 27, 2025	2.16.0-113	2.11.0-142
#2005219283428184064	Dec 28, 2025	2.16.0-114	2.11.0-143

Solution

This PR adds a workaround that only triggers if the initial 30-minute wait for MCH fails:

Normal Flow (upstream fix merged):
  Apply MCH → Wait 30min → Success ✓

Workaround Flow (race condition hit):
  Apply MCH → Wait 30min → Fail → Detect race condition → Apply workaround → Wait 30min → Success ✓

Workaround Steps

Detect - Check if CRDs have the placeholder CA bundle (cGxhY2Vob2xkZXI=)
Patch Services - Add service.beta.openshift.io/serving-cert-secret-name annotation to webhook services
Wait for Secrets - Let service-ca-operator create TLS certificates
Patch CRDs - Extract real CA bundle from secrets and update CRDs
Force Reconciliation - Restart MCE operator to pick up changes
Retry Wait - Wait again for MCH to reach Running status

Design Decisions

Decision	Rationale
Workaround only on failure	Doesn't add latency to normal deployments
Specific detection	Only triggers for this exact issue (placeholder CA bundle)
Dead code after fix	Once upstream PR #1309 is merged, detection returns false and workaround never runs
Clear documentation	Functions are well-commented with links to upstream PR

Cleanup Path

Once ocm#1309 is merged and released in ACM/MCE:

The first 30min wait will always succeed
The workaround functions become dead code
They can be removed in a future cleanup PR

Testing

Bash syntax check passes
Workaround successfully applied manually on live cluster (sno-vhub-0)
MCH reached Running status after workaround

Changes

ci-operator/step-registry/acm/mch/acm-mch-commands.sh
- Added workaround functions for OCM CA bundle race condition
- Modified main wait logic to detect and remediate the issue on failure

/cc @openshift/openshift-team-edge-ztp

openshift-ci · 2025-12-28T13:34:27Z

@ccardenosa: GitHub didn't allow me to request PR reviews from the following users: openshift/openshift-team-edge-ztp.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

Summary

This PR adds a workaround to the acm-mch step to handle a race condition in the OCM cluster-manager controller that causes MultiClusterHub deployments to fail intermittently.

Related Issues

Issue Repository Status

Upstream Fix open-cluster-management-io/ocm#1309 🔄 Open

Problem

The cluster-manager controller has a race condition where it may create CRDs (ClusterManagementAddOn, ManagedClusterAddOn) before the cert rotation controller creates the CA bundle ConfigMap. When this happens:

CRDs are created with caBundle: cGxhY2Vob2xkZXI= (base64 of literal string "placeholder")

Webhook conversion fails with InvalidCABundle error

CRDs remain in Established: False state

API endpoints are not registered

MCH fails with: "no matches for kind 'ClusterManagementAddOn' in version 'addon.open-cluster-management.io/v1alpha1'"

Evidence from Failed Prow Jobs

Job Run Date ACM Version MCE Version

#2005051399989104640 Dec 27, 2025 2.16.0-113 2.11.0-142

#2005219283428184064 Dec 28, 2025 2.16.0-114 2.11.0-143

Solution

This PR adds a workaround that only triggers if the initial 30-minute wait for MCH fails:
Normal Flow (upstream fix merged):
 Apply MCH → Wait 30min → Success ✓

Workaround Flow (race condition hit):
 Apply MCH → Wait 30min → Fail → Detect race condition → Apply workaround → Wait 30min → Success ✓
Workaround Steps

Detect - Check if CRDs have the placeholder CA bundle (cGxhY2Vob2xkZXI=)

Patch Services - Add service.beta.openshift.io/serving-cert-secret-name annotation to webhook services

Wait for Secrets - Let service-ca-operator create TLS certificates

Patch CRDs - Extract real CA bundle from secrets and update CRDs

Force Reconciliation - Restart MCE operator to pick up changes

Retry Wait - Wait again for MCH to reach Running status

Design Decisions

Decision Rationale

Workaround only on failure Doesn't add latency to normal deployments

Specific detection Only triggers for this exact issue (placeholder CA bundle)

Dead code after fix Once upstream PR #1309 is merged, detection returns false and workaround never runs

Clear documentation Functions are well-commented with links to upstream PR

Cleanup Path

Once ocm#1309 is merged and released in ACM/MCE:

The first 30min wait will always succeed

The workaround functions become dead code

They can be removed in a future cleanup PR

Testing

Bash syntax check passes

Workaround successfully applied manually on live cluster (sno-vhub-0)

MCH reached Running status after workaround

Changes

ci-operator/step-registry/acm/mch/acm-mch-commands.sh

Added workaround functions for OCM CA bundle race condition

Modified main wait logic to detect and remediate the issue on failure

/cc @openshift/openshift-team-edge-ztp

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ccardenosa · 2025-12-28T13:44:48Z

/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-ztp-left-shifting-kpi-ci-4.21-telcov10n-virtualised-single-node-hub-ztp

openshift-ci-robot · 2025-12-28T13:44:51Z

@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

ccardenosa · 2025-12-28T13:45:19Z

/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-ci-4.21-telcov10n-metal-single-node-hub-ztp

openshift-ci-robot · 2025-12-28T13:45:22Z

@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

ccardenosa · 2025-12-28T16:28:09Z

/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-ci-4.21-telcov10n-metal-single-node-hub-ztp

openshift-ci-robot · 2025-12-28T16:28:12Z

@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

ccardenosa · 2025-12-28T16:31:15Z

/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-ztp-left-shifting-kpi-ci-4.21-telcov10n-virtualised-single-node-hub-ztp

openshift-ci-robot · 2025-12-28T16:31:18Z

@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

ccardenosa · 2025-12-28T18:35:07Z

✅ Workaround Verified Working

The rehearsal job confirms the workaround successfully resolves the OCM CA bundle race condition:

Successful run: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/72976/rehearse-72976-periodic-ci-openshift-kni-eco-ci-cd-ztp-left-shifting-kpi-ci-4.21-telcov10n-virtualised-single-node-hub-ztp/2005316456039845888

Execution Summary

MCH did not reach Running status in the first attempt.
Checking for known issues and applying workarounds if needed...

============================================================
Applying OCM CA Bundle Race Condition Workaround
Upstream fix: https://github.com/open-cluster-management-io/ocm/pull/1309
============================================================

Checking for OCM CA bundle race condition (PR #1309)...
DETECTED: clustermanagementaddons CRD has placeholder CA bundle

Step 1/6: Patching webhook services with serving-cert-secret-name annotation...
  ✓ All 3 webhook services patched

Step 2/6: Waiting for service-ca-operator to create secrets...
  ✓ All 3 secrets created by service-ca-operator

Step 3/6: Creating ca-bundle-configmap from serving cert...
  ✓ ConfigMap created with real CA bundle

Step 4/6: Patching CRDs with real CA bundles...
  (CRDs auto-updated after configmap creation)

Step 5/6: Verifying CRDs are now Established...
  ✓ clustermanagementaddons.addon.open-cluster-management.io: Established=True
  ✓ managedclusteraddons.addon.open-cluster-management.io: Established=True

Step 6/6: Restarting cluster-manager and forcing reconciliation...
  ✓ cluster-manager deployment restarted
  ✓ multicluster-engine-operator restarted
  ✓ multiclusterengine annotated for reconciliation

============================================================
Workaround applied successfully!
============================================================

multiclusterhub.operator.open-cluster-management.io/multiclusterhub condition met
MCH reached Running status after applying workaround!
Success! ACM 2.16.0-114 is Running

This workaround will be needed until the upstream fix (open-cluster-management-io/ocm#1309) is merged and released in a future ACM/MCE version.

ccardenosa · 2025-12-28T18:44:25Z

/assign @sg-rh

Could you please review this workaround?

ccardenosa · 2025-12-28T18:50:59Z

/pj-rehearse ack

openshift-ci-robot · 2025-12-28T18:51:01Z

@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

ccardenosa · 2025-12-28T18:53:46Z

/assign @vboulos

Could you please review this workaround?

ccardenosa · 2025-12-29T08:36:04Z

/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-ci-4.21-telcov10n-metal-single-node-hub-ztp

openshift-ci-robot · 2025-12-29T08:36:07Z

@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

openshift-ci · 2025-12-29T09:58:34Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ccardenosa
Once this PR has been reviewed and has the lgtm label, please ask for approval from sg-rh. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

ci-operator/step-registry/acm/mch/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ccardenosa · 2025-12-29T09:59:11Z

/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-ci-4.21-telcov10n-metal-single-node-hub-ztp

openshift-ci-robot · 2025-12-29T09:59:14Z

@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

ccardenosa · 2025-12-29T10:20:15Z

/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-ci-4.21-telcov10n-metal-single-node-hub-ztp

openshift-ci-robot · 2025-12-29T10:20:18Z

@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

This adds a workaround for the cluster-manager controller race condition that causes CRDs to be created with an invalid "placeholder" CA bundle. Upstream fix: open-cluster-management-io/ocm#1309 Problem: The cluster-manager controller may create ClusterManagementAddOn and ManagedClusterAddOn CRDs before the cert rotation controller creates the CA bundle ConfigMap. When this happens, the CRDs are created with caBundle: cGxhY2Vob2xkZXI= (base64 of "placeholder"), causing: 1. Webhook conversion fails with "InvalidCABundle" 2. CRDs not becoming Established 3. API endpoints not registered 4. MCH fails: "no matches for kind ClusterManagementAddOn" Additionally, the cluster-manager controller reads CA from ca-bundle-configmap. If this ConfigMap doesn't exist or is empty, it keeps re-applying CRDs with the placeholder CA, overwriting any manual patches. Workaround (6 steps): When MCH fails to reach Running status, detect the race condition by checking for placeholder CA bundles in the CRDs, then: 1. Patch webhook services with serving-cert-secret-name annotation 2. Wait for service-ca-operator to create TLS secrets 3. Create ca-bundle-configmap from the serving cert secret 4. Extract real CA bundle from secrets and patch CRDs 5. Verify CRDs become Established 6. Restart cluster-manager and force MCE operator reconciliation Design: - The workaround only triggers if the first 30min wait fails - Detection is specific: checks for the exact placeholder value - Once upstream fix is merged, this becomes dead code (detection returns false) and can be removed in a future cleanup Tested on sno-vhub-0: MCE reached Available status and MCH progressed normally with 20/22 components ready after workaround. Discovered in Prow jobs: - periodic-ci-...-telcov10n-virtualised-single-node-hub-ztp/2005051399989104640 - periodic-ci-...-telcov10n-virtualised-single-node-hub-ztp/2005219283428184064 Signed-off-by: Carlos Cardenosa <ccardeno@redhat.com> Co-authored-by: Claude <noreply@anthropic.com>

openshift-ci-robot · 2025-12-29T13:14:44Z

[REHEARSALNOTIFIER]
@ccardenosa: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name	Repo	Type	Reason
periodic-ci-openshift-kni-eco-ci-cd-ztp-left-shifting-kpi-ci-4.21-telcov10n-virtualised-single-node-hub-ztp	N/A	periodic	Registry content changed
periodic-ci-stolostron-policy-collection-main-ocp4.21-interop-opp-vsphere	N/A	periodic	Registry content changed
periodic-ci-stolostron-acmqe-autotest-main-acm-ocp4.14-lp-interop-acm-interop-aws	N/A	periodic	Registry content changed
periodic-ci-stolostron-acmqe-autotest-main-acm-ocp4.16-lp-interop-acm-interop-aws	N/A	periodic	Registry content changed
periodic-ci-stolostron-acmqe-autotest-main-acm-ocp4.17-lp-interop-acm-interop-aws	N/A	periodic	Registry content changed
periodic-ci-RedHatQE-interop-testing-master-acm-cnv-ocp4.19-p2p-interop-acm-cnv-p2p-aws-419	N/A	periodic	Registry content changed
periodic-ci-openshift-kni-eco-ci-cd-main-ci-4.21-telcov10n-metal-single-node-hub-ztp	N/A	periodic	Registry content changed
periodic-ci-stolostron-policy-collection-main-ocp4.21-interop-opp-aws	N/A	periodic	Registry content changed
periodic-ci-stolostron-policy-collection-main-ocp4.20-interop-opp-aws	N/A	periodic	Registry content changed
periodic-ci-stolostron-acmqe-autotest-main-acm-ocp4.15-lp-interop-acm-interop-aws	N/A	periodic	Registry content changed

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

ccardenosa · 2025-12-29T13:41:03Z

/pj-rehearse ack

openshift-ci-robot · 2025-12-29T13:41:05Z

@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

ccardenosa force-pushed the fix/clustermanager-cabundle-race-condition-workaround branch from 7cf2ff7 to 915f710 Compare December 28, 2025 16:26

ccardenosa force-pushed the fix/clustermanager-cabundle-race-condition-workaround branch from 915f710 to 79ab2e2 Compare December 28, 2025 18:39

openshift-ci bot assigned sg-rh Dec 28, 2025

openshift-ci-robot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Dec 28, 2025

openshift-ci bot assigned vboulos Dec 28, 2025

ccardenosa force-pushed the fix/clustermanager-cabundle-race-condition-workaround branch from 79ab2e2 to d5b6d56 Compare December 29, 2025 09:57

openshift-ci-robot removed the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Dec 29, 2025

ccardenosa force-pushed the fix/clustermanager-cabundle-race-condition-workaround branch from d5b6d56 to 5b57089 Compare December 29, 2025 13:11

openshift-ci-robot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Dec 29, 2025

Add workaround for OCM CA bundle race condition in acm-mch step #72976

Are you sure you want to change the base?

Add workaround for OCM CA bundle race condition in acm-mch step #72976

Uh oh!

Conversation

ccardenosa commented Dec 28, 2025

Summary

Related Issues

Problem

Evidence from Failed Prow Jobs

Solution

Workaround Steps

Design Decisions

Cleanup Path

Testing

Changes

Uh oh!

openshift-ci bot commented Dec 28, 2025

Summary

Related Issues

Problem

Evidence from Failed Prow Jobs

Solution

Workaround Steps

Design Decisions

Cleanup Path

Testing

Changes

Uh oh!

ccardenosa commented Dec 28, 2025

Uh oh!

openshift-ci-robot commented Dec 28, 2025

Uh oh!

ccardenosa commented Dec 28, 2025

Uh oh!

openshift-ci-robot commented Dec 28, 2025

Uh oh!

ccardenosa commented Dec 28, 2025

Uh oh!

openshift-ci-robot commented Dec 28, 2025

Uh oh!

ccardenosa commented Dec 28, 2025

Uh oh!

openshift-ci-robot commented Dec 28, 2025

Uh oh!

ccardenosa commented Dec 28, 2025

✅ Workaround Verified Working

Execution Summary

Uh oh!

ccardenosa commented Dec 28, 2025

Uh oh!

ccardenosa commented Dec 28, 2025

Uh oh!

openshift-ci-robot commented Dec 28, 2025

Uh oh!

ccardenosa commented Dec 28, 2025

Uh oh!

ccardenosa commented Dec 29, 2025

Uh oh!

openshift-ci-robot commented Dec 29, 2025

Uh oh!

openshift-ci bot commented Dec 29, 2025

Uh oh!

ccardenosa commented Dec 29, 2025

Uh oh!

openshift-ci-robot commented Dec 29, 2025

Uh oh!

ccardenosa commented Dec 29, 2025

Uh oh!

openshift-ci-robot commented Dec 29, 2025

Uh oh!

openshift-ci-robot commented Dec 29, 2025

Uh oh!

ccardenosa commented Dec 29, 2025

Uh oh!

openshift-ci-robot commented Dec 29, 2025

Uh oh!

Reviewers

Assignees

Labels