-
Notifications
You must be signed in to change notification settings - Fork 2k
Add workaround for OCM CA bundle race condition in acm-mch step #72976
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add workaround for OCM CA bundle race condition in acm-mch step #72976
Conversation
|
@ccardenosa: GitHub didn't allow me to request PR reviews from the following users: openshift/openshift-team-edge-ztp. Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-ztp-left-shifting-kpi-ci-4.21-telcov10n-virtualised-single-node-hub-ztp |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-ci-4.21-telcov10n-metal-single-node-hub-ztp |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
7cf2ff7 to
915f710
Compare
|
/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-ci-4.21-telcov10n-metal-single-node-hub-ztp |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-ztp-left-shifting-kpi-ci-4.21-telcov10n-virtualised-single-node-hub-ztp |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
✅ Workaround Verified WorkingThe rehearsal job confirms the workaround successfully resolves the OCM CA bundle race condition: Execution SummaryThis workaround will be needed until the upstream fix (open-cluster-management-io/ocm#1309) is merged and released in a future ACM/MCE version. |
915f710 to
79ab2e2
Compare
|
/assign @sg-rh Could you please review this workaround? |
|
/pj-rehearse ack |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/assign @vboulos Could you please review this workaround? |
|
/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-ci-4.21-telcov10n-metal-single-node-hub-ztp |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
79ab2e2 to
d5b6d56
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ccardenosa The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-ci-4.21-telcov10n-metal-single-node-hub-ztp |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse periodic-ci-openshift-kni-eco-ci-cd-main-ci-4.21-telcov10n-metal-single-node-hub-ztp |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
This adds a workaround for the cluster-manager controller race condition that causes CRDs to be created with an invalid "placeholder" CA bundle. Upstream fix: open-cluster-management-io/ocm#1309 Problem: The cluster-manager controller may create ClusterManagementAddOn and ManagedClusterAddOn CRDs before the cert rotation controller creates the CA bundle ConfigMap. When this happens, the CRDs are created with caBundle: cGxhY2Vob2xkZXI= (base64 of "placeholder"), causing: 1. Webhook conversion fails with "InvalidCABundle" 2. CRDs not becoming Established 3. API endpoints not registered 4. MCH fails: "no matches for kind ClusterManagementAddOn" Additionally, the cluster-manager controller reads CA from ca-bundle-configmap. If this ConfigMap doesn't exist or is empty, it keeps re-applying CRDs with the placeholder CA, overwriting any manual patches. Workaround (6 steps): When MCH fails to reach Running status, detect the race condition by checking for placeholder CA bundles in the CRDs, then: 1. Patch webhook services with serving-cert-secret-name annotation 2. Wait for service-ca-operator to create TLS secrets 3. Create ca-bundle-configmap from the serving cert secret 4. Extract real CA bundle from secrets and patch CRDs 5. Verify CRDs become Established 6. Restart cluster-manager and force MCE operator reconciliation Design: - The workaround only triggers if the first 30min wait fails - Detection is specific: checks for the exact placeholder value - Once upstream fix is merged, this becomes dead code (detection returns false) and can be removed in a future cleanup Tested on sno-vhub-0: MCE reached Available status and MCH progressed normally with 20/22 components ready after workaround. Discovered in Prow jobs: - periodic-ci-...-telcov10n-virtualised-single-node-hub-ztp/2005051399989104640 - periodic-ci-...-telcov10n-virtualised-single-node-hub-ztp/2005219283428184064 Signed-off-by: Carlos Cardenosa <ccardeno@redhat.com> Co-authored-by: Claude <noreply@anthropic.com>
d5b6d56 to
5b57089
Compare
|
[REHEARSALNOTIFIER]
Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
/pj-rehearse ack |
|
@ccardenosa: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
Summary
This PR adds a workaround to the
acm-mchstep to handle a race condition in the OCM cluster-manager controller that causes MultiClusterHub deployments to fail intermittently.Related Issues
Problem
The cluster-manager controller has a race condition where it may create CRDs (
ClusterManagementAddOn,ManagedClusterAddOn) before the cert rotation controller creates the CA bundle ConfigMap. When this happens:caBundle: cGxhY2Vob2xkZXI=(base64 of literal string "placeholder")InvalidCABundleerrorEstablished: Falsestate"no matches for kind 'ClusterManagementAddOn' in version 'addon.open-cluster-management.io/v1alpha1'"Evidence from Failed Prow Jobs
Solution
This PR adds a workaround that only triggers if the initial 30-minute wait for MCH fails:
Workaround Steps
cGxhY2Vob2xkZXI=)service.beta.openshift.io/serving-cert-secret-nameannotation to webhook servicesservice-ca-operatorcreate TLS certificatesDesign Decisions
Cleanup Path
Once ocm#1309 is merged and released in ACM/MCE:
Testing
Changes
ci-operator/step-registry/acm/mch/acm-mch-commands.sh/cc @openshift/openshift-team-edge-ztp