-
Notifications
You must be signed in to change notification settings - Fork 250
NO-JIRA: Fix static pod pruning logic for non-contiguous set of revisions #2060
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
NO-JIRA: Fix static pod pruning logic for non-contiguous set of revisions #2060
Conversation
|
@grandeit: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Hi @grandeit. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: grandeit The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Friendly ping to some active reviewers: |
|
@grandeit Correct me if I'm wrong, but, doesn't the static pod controller here upgrade through sequential versions? Therefore if I have protected version 5, this implies to me that some pod is still stuck on this version, meanwhile, others are on a later version say 100. That stuck pod needs to upgrade through every version 5 through 100 to catch up. Therefore, if we were to prune the intermediate versions, that pod would become perma-stuck as we would no longer have the intermediate versions for it to iterate through? |
Hey @JoelSpeed The controller uses library-go/pkg/operator/staticpod/controller/installer/installer_controller.go Lines 882 to 905 in 289839b
In your example, the node with I think the idea behind keeping the failed revisions is to have a debug possibility later on. Even before my proposed fix, the intermediate revisions are pruned if there is no failed revision upgrade from a very early point in time which is now still hanging around and preventing the pruning because it imho incorrectly triggers a shortcut. library-go/pkg/operator/staticpod/controller/prune/prune_controller.go Lines 100 to 106 in 289839b
Hope this helps :) |
|
/ok-to-test |
|
@grandeit: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Fix static pod pruning logic for non-contiguous set of revisions
Problem
The
PruneControllercontains a logic bug inrevisionsToKeep()that prevents pruning when the protected revision set is non-contiguous but spans from revision1toLatestAvailableRevision.Scenario that triggers the bug:
Node has very old
LastFailedRevision: 5Cluster is now at
LatestAvailableRevision: 100Limits are
failedRevisionLimit: 5,succeededRevisionLimit: 5Protected set becomes {1,2,3,4,5,96,97,98,99,100} (10 revisions)
The buggy logic sees:
First element:
1Last element:
100Returns
keepAll = true-> No pruning happens.This causes a lot of
revision-status-* ConfigMaps(and their ownedConfigMaps) to accumulate until a later failed revision eventually removes the first revision from the set.Solution
Check if the set has exactly LatestAvailableRevision elements before triggering the keepAll optimization. This ensures that the set has no gaps and is in-fact contiguous.
Testing
Added test case: "prunes non-contiguous set (keeps 1-10 and 96-100, prunes 11-95)" that verifies:
Two nodes with
LastFailedRevision: 5andLastFailedRevision: 10CurrentRevision: 100on both nodesLatestAvailableRevision: 100Protected set:
{1,2,3,4,5,6,7,8,9,10,96,97,98,99,100}(15 revisions)Revisions
11 - 95are pruned.keepAlloptimization does not trigger.