fix: bug fixes in rollout controller and work-generator #379

zhiying-lin · 2025-12-15T12:47:16Z

Description of your changes

fix the 1MB test failure, https://github.com/kubefleet-dev/kubefleet/actions/runs/20092594675/job/57649235364

we made the wrong assumption when getting the resourceSnapshot master, https://github.com/kubefleet-dev/kubefleet/blob/main/pkg/controllers/workgenerator/controller.go#L459-L471 we used the cached client, so the master resourceSnapshot is not found.

The inconsistency could happen whenever rollout controller rollouts new changes.

if areAllWorkSynced(existingWorks, resourceBinding, resourceOverrideSnapshotHash, clusterResourceOverrideSnapshotHash) {
				klog.V(2).InfoS("All the works are synced with the resourceBinding even if the resource snapshot index is removed", "resourceBinding", resourceBindingRef)
				return true, updateAny.Load(), nil
			}

The existing work is empty. So it returned true and binding was updated as available and applied, which was wrong.

The fix is to use the cached client for both rolllout controller and work-generator when querying the resourceSnapshot.

Fixes #

I have:

Run make reviewable to ensure this PR is ready for review.

How has this code been tested

Added unit tests and ran the e2e tests multiple times.

Special notes for your reviewer

Signed-off-by: Zhiying Lin <zhiyingl456@gmail.com>

codecov · 2025-12-15T13:23:28Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

pkg/controllers/rollout/controller.go

ryanzhang-oss · 2025-12-16T07:01:27Z

pkg/controllers/workgenerator/controller.go

+	// Even for the case where the resource snapshot has no selected resources,
+	// there should be one work created for the empty resource list.
+	if len(existingWorks) == 0 {
+		return false
+	}


looks like this function is only called when the resource snapshot is missing.

IIRC, we only create an empty work for eveloped cases. I was thinking of removing it

looks like this function is only called when the resource snapshot is missing.

yeah, it's possible when rollout controller updates the binding using the resourceSnapshot while it is deleted when work-generator queries this snapshot.

IIRC, we only create an empty work for eveloped cases. I was thinking of removing it

I validated it in my fleet.
We'll create the empty work even for the normal case,

kubectl get work crp-empty-work -n fleet-member-aks-member-5 -o yaml apiVersion: placement.kubernetes-fleet.io/v1 kind: Work metadata: annotations: kubernetes-fleet.io/parent-cluster-resource-override-snapshot-hash: 74234e98afe7498fb5daf1f36ac2d78acc339464f950703b8c019892f982b90b kubernetes-fleet.io/parent-resource-override-snapshot-hash: 74234e98afe7498fb5daf1f36ac2d78acc339464f950703b8c019892f982b90b kubernetes-fleet.io/parent-resource-snapshot-name: crp-empty-0-snapshot creationTimestamp: "2025-12-16T08:04:43Z" finalizers: - kubernetes-fleet.io/work-cleanup generation: 1 labels: kubernetes-fleet.io/parent-CRP: crp-empty kubernetes-fleet.io/parent-resource-binding: crp-empty-aks-member-5-30b4685b kubernetes-fleet.io/parent-resource-snapshot-index: "0" name: crp-empty-work namespace: fleet-member-aks-member-5 resourceVersion: "440189780" uid: ce9960fc-f4e0-4b3f-b46c-846e0fb9c8ea spec: applyStrategy: comparisonOption: PartialComparison type: ClientSideApply whenToApply: Always whenToTakeOver: Always workload: {} status: conditions: - lastTransitionTime: "2025-12-16T08:04:43Z" message: All the specified manifests have been applied observedGeneration: 1 reason: AllManifestsApplied status: "True" type: Applied - lastTransitionTime: "2025-12-16T08:04:43Z" message: All of the applied manifests are available observedGeneration: 1 reason: AllManifestsAvailable status: "True" type: Available

yeah, we was thinking of removing this behavior few times, but we have to specially handle this case in multiple controllers. prefer to keep it internally.

The original complains was that it's not obvious from the CRP condition when selecting nothing. We can improve the external user experience/messages separately.

Signed-off-by: Zhiying Lin <zhiyingl456@gmail.com>

fix: bug fixes in rollout controller and work-generator

3065b39

Signed-off-by: Zhiying Lin <zhiyingl456@gmail.com>

zhiying-lin marked this pull request as draft December 15, 2025 12:47

zhiying-lin marked this pull request as ready for review December 16, 2025 02:12

ryanzhang-oss reviewed Dec 16, 2025

View reviewed changes

zhiying-lin added 2 commits December 16, 2025 16:09

address comments

ca29c4b

Signed-off-by: Zhiying Lin <zhiyingl456@gmail.com>

fix the comment

47fb1dd

Signed-off-by: Zhiying Lin <zhiyingl456@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: bug fixes in rollout controller and work-generator #379

fix: bug fixes in rollout controller and work-generator #379

Uh oh!

zhiying-lin commented Dec 15, 2025 •

edited

Loading

Uh oh!

codecov bot commented Dec 15, 2025

Uh oh!

Uh oh!

ryanzhang-oss Dec 16, 2025

Uh oh!

ryanzhang-oss Dec 16, 2025

Uh oh!

zhiying-lin Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: bug fixes in rollout controller and work-generator #379

Are you sure you want to change the base?

fix: bug fixes in rollout controller and work-generator #379

Uh oh!

Conversation

zhiying-lin commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of your changes

How has this code been tested

Special notes for your reviewer

Uh oh!

codecov bot commented Dec 15, 2025

Codecov Report

Uh oh!

Uh oh!

ryanzhang-oss Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

ryanzhang-oss Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

zhiying-lin Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhiying-lin commented Dec 15, 2025 •

edited

Loading