NO-JIRA: Fix static pod pruning logic for non-contiguous set of revisions #2060

grandeit · 2025-11-27T07:50:53Z

Fix static pod pruning logic for non-contiguous set of revisions

Problem

The PruneController contains a logic bug in revisionsToKeep() that prevents pruning when the protected revision set is non-contiguous but spans from revision 1 to LatestAvailableRevision.

Scenario that triggers the bug:
Node has very old LastFailedRevision: 5
Cluster is now at LatestAvailableRevision: 100
Limits are failedRevisionLimit: 5, succeededRevisionLimit: 5
Protected set becomes {1,2,3,4,5,96,97,98,99,100} (10 revisions)

The buggy logic sees:
First element: 1
Last element: 100
Returns keepAll = true -> No pruning happens.
This causes a lot of revision-status-* ConfigMaps (and their owned ConfigMaps) to accumulate until a later failed revision eventually removes the first revision from the set.

Solution

Check if the set has exactly LatestAvailableRevision elements before triggering the keepAll optimization. This ensures that the set has no gaps and is in-fact contiguous.

Testing

Added test case: "prunes non-contiguous set (keeps 1-10 and 96-100, prunes 11-95)" that verifies:

Two nodes with LastFailedRevision: 5 and LastFailedRevision: 10
CurrentRevision: 100 on both nodes
LatestAvailableRevision: 100

Protected set: {1,2,3,4,5,6,7,8,9,10,96,97,98,99,100} (15 revisions)
Revisions 11 - 95 are pruned.
keepAll optimization does not trigger.

openshift-ci-robot · 2025-11-27T07:50:57Z

@grandeit: This pull request explicitly references no jira issue.

Details

In response to this:

Fix static pod pruning logic for non-contiguous set of revisions

Problem

The PruneController contains a logic bug in revisionsToKeep() that prevents pruning when the protected revision set is non-contiguous but spans from revision 1 to LatestAvailableRevision.

Scenario that triggers the bug:
Node has very old LastFailedRevision: 5
Cluster is now at LatestAvailableRevision: 100
Limits are failedRevisionLimit: 5, succeededRevisionLimit: 5
Protected set becomes {1,2,3,4,5,96,97,98,99,100} (10 revisions)

The buggy logic sees:
First element: 1
Last element: 100
Returns keepAll = true -> No pruning happens.
This causes a lot of revision-status-* ConfigMaps (and their owned ConfigMaps) to accumulate until a later failed revision eventually removes the first revision from the set.

Solution

Check if the set has exactly LatestAvailableRevision elements before triggering the keepAll optimization. This ensures that the set has no gaps and is in-fact contiguous.

Testing

Added test case: "prunes non-contiguous set (keeps 1-10 and 96-100, prunes 11-95)" that verifies:

Two nodes with LastFailedRevision: 5 and LastFailedRevision: 10
CurrentRevision: 100 on both nodes
LatestAvailableRevision: 100

Protected set: {1,2,3,4,5,6,7,8,9,10,96,97,98,99,100} (15 revisions)
Revisions 11 - 95 are pruned.
keepAll optimization does not trigger.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2025-11-27T07:51:19Z

Hi @grandeit. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2025-11-27T07:51:43Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: grandeit
Once this PR has been reviewed and has the lgtm label, please assign dgrisonnet for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

pkg/operator/staticpod/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

grandeit · 2025-12-31T14:13:19Z

Friendly ping to some active reviewers:
@p0lyn0mial @JoelSpeed @damdo 👋

JoelSpeed · 2026-01-02T10:27:25Z

@grandeit Correct me if I'm wrong, but, doesn't the static pod controller here upgrade through sequential versions?

Therefore if I have protected version 5, this implies to me that some pod is still stuck on this version, meanwhile, others are on a later version say 100. That stuck pod needs to upgrade through every version 5 through 100 to catch up. Therefore, if we were to prune the intermediate versions, that pod would become perma-stuck as we would no longer have the intermediate versions for it to iterate through?

grandeit · 2026-01-02T11:59:42Z

@grandeit Correct me if I'm wrong, but, doesn't the static pod controller here upgrade through sequential versions?

Therefore if I have protected version 5, this implies to me that some pod is still stuck on this version, meanwhile, others are on a later version say 100. That stuck pod needs to upgrade through every version 5 through 100 to catch up. Therefore, if we were to prune the intermediate versions, that pod would become perma-stuck as we would no longer have the intermediate versions for it to iterate through?

Hey @JoelSpeed
The static pods do not upgrade sequentially through the versions. Each node jumps directly to the target revision.

The controller uses getRevisionToStart to determine which revision to install on a certain node. When the previous node (that was just upgraded) has a newer revision and the current node did not fail upgrading to it, it will set this revision as new upgrade target (around line 900):

library-go/pkg/operator/staticpod/controller/installer/installer_controller.go

Lines 882 to 905 in 289839b

    
           // getRevisionToStart returns the revision we need to start or zero if none 
        
           func (c *InstallerController) getRevisionToStart(currNodeState, prevNodeState *operatorv1.NodeStatus, operatorStatus *operatorv1.StaticPodOperatorStatus) int32 { 
        
           	if prevNodeState == nil { 
        
           		currentAtLatest := currNodeState.CurrentRevision == operatorStatus.LatestAvailableRevision 
        
           		if !currentAtLatest { 
        
           			return operatorStatus.LatestAvailableRevision 
        
           		} 
        
           		return 0 
        
           	} 
        
           	prevFinished := prevNodeState.TargetRevision == 0 
        
           	prevInTransition := prevNodeState.CurrentRevision != prevNodeState.TargetRevision 
        
           	if prevInTransition && !prevFinished { 
        
           		return 0 
        
           	} 
        
           	prevAhead := prevNodeState.CurrentRevision > currNodeState.CurrentRevision 
        
           	failedAtPrev := currNodeState.LastFailedRevision == prevNodeState.CurrentRevision 
        
           	if prevAhead && !failedAtPrev { 
        
           		return prevNodeState.CurrentRevision 
        
           	} 
        
           	return 0 
        
           }

In your example, the node with currentRevision=5 will jump to revision 100.

I think the idea behind keeping the failed revisions is to have a debug possibility later on.

Even before my proposed fix, the intermediate revisions are pruned if there is no failed revision upgrade from a very early point in time which is now still hanging around and preventing the pruning because it imho incorrectly triggers a shortcut.
There are some details of the current behaviour described in the comment of the revisonsToKeep function:

library-go/pkg/operator/staticpod/controller/prune/prune_controller.go

Lines 100 to 106 in 289839b

    
           // revisionsToKeep approximates the set of revisions to keep: spec.failedRevisionsLimit for failed revisions, 
        
           // spec.succeededRevisionsLimit for succeed revisions (for all nodes). The approximation goes by: 
        
           // - don't prune LatestAvailableRevision and the max(spec.failedRevisionLimit, spec.succeededRevisionLimit) - 1 revisions before it. 
        
           // - don't prune a node's CurrentRevision and the spec.succeededRevisionLimit - 1 revisions before it. 
        
           // - don't prune a node's TargetRevision and the spec.failedRevisionLimit - 1 revisions before it. 
        
           // - don't prune a node's LastFailedRevision and the spec.failedRevisionLimit - 1 revisions before it. 
        
           func (c *PruneController) revisionsToKeep(status *operatorv1.StaticPodOperatorStatus, failedLimit, succeededLimit int) (all bool, keep sets.Set[int32]) {

Hope this helps :)

JoelSpeed · 2026-01-02T12:03:35Z

/ok-to-test

openshift-ci · 2026-01-02T12:17:31Z

@grandeit: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Fix staticpod pruning logic for non-contiguous set of revisions

7f89dc2

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 27, 2025

openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 27, 2025

openshift-ci bot requested review from dgrisonnet and p0lyn0mial November 27, 2025 07:51

openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NO-JIRA: Fix static pod pruning logic for non-contiguous set of revisions #2060

NO-JIRA: Fix static pod pruning logic for non-contiguous set of revisions #2060

grandeit commented Nov 27, 2025

Uh oh!

openshift-ci-robot commented Nov 27, 2025

Fix static pod pruning logic for non-contiguous set of revisions

Problem

Solution

Testing

Uh oh!

openshift-ci bot commented Nov 27, 2025

Uh oh!

openshift-ci bot commented Nov 27, 2025

Uh oh!

grandeit commented Dec 31, 2025

Uh oh!

JoelSpeed commented Jan 2, 2026

Uh oh!

grandeit commented Jan 2, 2026

Uh oh!

JoelSpeed commented Jan 2, 2026

Uh oh!

openshift-ci bot commented Jan 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NO-JIRA: Fix static pod pruning logic for non-contiguous set of revisions #2060

Are you sure you want to change the base?

NO-JIRA: Fix static pod pruning logic for non-contiguous set of revisions #2060

Conversation

grandeit commented Nov 27, 2025

Fix static pod pruning logic for non-contiguous set of revisions

Problem

Solution

Testing

Uh oh!

openshift-ci-robot commented Nov 27, 2025

Fix static pod pruning logic for non-contiguous set of revisions

Problem

Solution

Testing

Uh oh!

openshift-ci bot commented Nov 27, 2025

Uh oh!

openshift-ci bot commented Nov 27, 2025

Uh oh!

grandeit commented Dec 31, 2025

Uh oh!

JoelSpeed commented Jan 2, 2026

Uh oh!

grandeit commented Jan 2, 2026

Uh oh!

JoelSpeed commented Jan 2, 2026

Uh oh!

openshift-ci bot commented Jan 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants