[release-4.19] OCPBUGS-77367,OCPBUGS-77844: Fix ignition-server pod restarts#7858
Conversation
|
@jparrill: This pull request references Jira Issue OCPBUGS-77367, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. This pull request references Jira Issue OCPBUGS-77844, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@jparrill: This pull request references Jira Issue OCPBUGS-77367, which is invalid:
Comment This pull request references Jira Issue OCPBUGS-77844, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jparrill The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/jira refresh |
|
@jparrill: This pull request references Jira Issue OCPBUGS-77367, which is invalid:
Comment This pull request references Jira Issue OCPBUGS-77844, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
… cache Implement a comprehensive mirror availability cache to prevent MIRRORED_RELEASE_IMAGE environment variable flapping that was causing massive deployment regeneration (observed generation: 1,068,097). The cache prevents non-deterministic mirror selection by: - Caching mirror availability results with differential TTL: * Available mirrors: 5 minutes (longer for stability) * Unavailable mirrors: 1 minute (shorter for faster recovery) - Using 15-second timeout for mirror verification - Thread-safe operations with sync.RWMutex - Automatic cleanup of expired entries This resolves the MIRRORED_RELEASE_IMAGE flapping issue where the SeekOverride function was performing real-time availability checks on each call, leading to inconsistent mirror selection between deployments. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
…ble flapping Remove the MIRRORED_RELEASE_IMAGE environment variable from the ignition-server deployment. This env var was set at deploy-time via SeekOverride, which performs live registry connectivity checks with 15-second timeouts. Network conditions caused it to return different mirror URLs on each reconciliation, triggering non-deterministic deployment updates and pod restarts. MIRRORED_RELEASE_IMAGE is not consumed at runtime by the ignition-server binary. Mirror resolution is already handled through a separate mechanism: ReleaseProvider.Lookup() uses OPENSHIFT_IMG_OVERRIDES at runtime with automatic fallback to the original registry when mirrors are unavailable. The release image for ignition payloads is delivered through token secrets (nodePool.Spec.Release.Image), not through deployment environment variables. Fixes: OCPBUGS-60185 Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com> Commit-Message-Assisted-by: Claude (via Claude Code) Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
…tation Remove the getRegistryOverrides() function which performed live HTTP registry connectivity checks (via LookupMappedImage/GetMetadata) during every CPO reconciliation cycle. These checks returned non-deterministic results depending on network conditions, causing the --registry-overrides container argument to change between reconciliations and triggering unnecessary deployment rollouts and pod restarts. The three per-image overrides computed by getRegistryOverrides() for cluster-config-api, machine-config-operator, and cluster-authentication-operator are redundant: the ignition-server already resolves mirrors at runtime via OPENSHIFT_IMG_OVERRIDES + ImageMetadataProvider.GetOverride()/SeekOverride() in LocalIgnitionProvider.GetPayload(). Additionally, cluster-authentication-operator is not used by the ignition-server at all. Replace with ign.releaseProvider.GetRegistryOverrides() which returns static registry-level overrides from the HostedCluster spec, consistent with how other cpov2 components handle registry overrides. Fixes: OCPBUGS-60185 Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
Adapt the changes from the previous commits (removal of MIRRORED_RELEASE_IMAGE and non-deterministic getRegistryOverrides/ LookupMappedImage logic) to the CPOv1 ignition server reconciliation path. Changes: - Remove MIRRORED_RELEASE_IMAGE env var from reconcileDeployment - Remove mirroredReleaseImage parameter from ReconcileIgnitionServer and reconcileDeployment signatures - Remove LookupMappedImage-based registry override computation that performed live HTTP registry checks causing non-deterministic results - Use static registryOverrides from ReleaseProvider directly - Clean up unused imports (common, registryclient) This ensures functional parity between CPOv1 and CPOv2 for the ignition-server deployment reconciliation. Fixes: OCPBUGS-77367 Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
|
@jparrill: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/verified by @xiuwang |
|
@xiuwang: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
|
||
| // Cache miss - verify mirror availability with 15s timeout | ||
| verifyCtx, cancel := context.WithTimeout(ctx, 15*time.Second) | ||
| defer cancel() |
There was a problem hiding this comment.
nit: it might make sense to make the change from #7846 so we don't keep propagating the bug.
There was a problem hiding this comment.
Make sense to me, I will propagate the fix ASAP, but we need this to be in 4.18 also ASAP. So the plan I have in my mind is to push this until 4.18 and then backport the #7846. This is because the customer needs these changes urgently.
|
/lgtm |
|
/jira refresh |
|
@jparrill: This pull request references Jira Issue OCPBUGS-77367, which is valid. The bug has been moved to the POST state. 7 validation(s) were run on this bug
Requesting review from QA contact: This pull request references Jira Issue OCPBUGS-77844, which is valid. The bug has been moved to the POST state. 7 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
ab16d86
into
openshift:release-4.19
|
@jparrill: Jira Issue OCPBUGS-77367: Some pull requests linked via external trackers have merged: The following pull request, linked via external tracker, has not merged:
All associated pull requests must be merged or unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with Jira Issue OCPBUGS-77367 has not been moved to the MODIFIED state. This PR is marked as verified. If the remaining PRs listed above are marked as verified before merging, the issue will automatically be moved to VERIFIED after all of the changes from the PRs are available in an accepted nightly payload. Jira Issue Verification Checks: Jira Issue OCPBUGS-77844 Jira Issue OCPBUGS-77844 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira backport release-4.18 |
|
@jparrill: The following backport issues have been created:
Queuing cherrypicks to the requested branches to be created after this PR merges: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@openshift-ci-robot: #7858 failed to apply on top of branch "release-4.18": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Fix included in accepted release 4.19.0-0.nightly-2026-03-07-080023 |
|
/jira refresh |
|
@jparrill: Jira Issue OCPBUGS-77367: Some pull requests linked via external trackers have merged: The following pull request, linked via external tracker, has not merged:
All associated pull requests must be merged or unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with Jira Issue OCPBUGS-77367 has not been moved to the MODIFIED state. This PR is marked as verified. If the remaining PRs listed above are marked as verified before merging, the issue will automatically be moved to VERIFIED after all of the changes from the PRs are available in an accepted nightly payload. Jira Issue OCPBUGS-77844 is in an unrecognized state (Verified) and will not be moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Summary
MIRRORED_RELEASE_IMAGEenv var flapping that caused massive deployment regenerationMIRRORED_RELEASE_IMAGEfrom the ignition-server deployment (dead code not consumed at runtime)getRegistryOverrides()which performed non-deterministic live HTTP registry checks during every CPO reconciliation, replacing it with static registry-level overrides from the HostedCluster specFixes
Backport adaptations
The following changes were made beyond the pure cherry-picks to adapt to release-4.19:
support/util/imagemetadata_test.go: Removed unused"fmt"and"os"imports that were residual from the original commit but not referenced in the test code.control-plane-operator/controllers/hostedcontrolplane/v2/ignitionserver/deployment_test.go: Replacedt.Context()withcontext.Background()(adding the"context"import), sincet.Context()requires Go 1.24 and release-4.19 uses Go 1.23.All merge conflicts were resolved by accepting the incoming commit version (removal of
getRegistryOverrides()and thecontroller-runtime/pkg/clientimport).Test plan
MIRRORED_RELEASE_IMAGEenv vargo test ./support/util/... ./control-plane-operator/controllers/hostedcontrolplane/v2/ignitionserver/...🤖 Generated with Claude Code