Skip to content

Conversation

@zhiying-lin
Copy link
Collaborator

@zhiying-lin zhiying-lin commented Dec 15, 2025

Description of your changes

fix the 1MB test failure, https://github.com/kubefleet-dev/kubefleet/actions/runs/20092594675/job/57649235364

we made the wrong assumption when getting the resourceSnapshot master, https://github.com/kubefleet-dev/kubefleet/blob/main/pkg/controllers/workgenerator/controller.go#L459-L471 we used the cached client, so the master resourceSnapshot is not found.

The inconsistency could happen whenever rollout controller rollouts new changes.

if areAllWorkSynced(existingWorks, resourceBinding, resourceOverrideSnapshotHash, clusterResourceOverrideSnapshotHash) {
				klog.V(2).InfoS("All the works are synced with the resourceBinding even if the resource snapshot index is removed", "resourceBinding", resourceBindingRef)
				return true, updateAny.Load(), nil
			}

The existing work is empty. So it returned true and binding was updated as available and applied, which was wrong.

The fix is to use the cached client for both rolllout controller and work-generator when querying the resourceSnapshot.

Fixes #

I have:

  • Run make reviewable to ensure this PR is ready for review.

How has this code been tested

Added unit tests and ran the e2e tests multiple times.

Special notes for your reviewer

Signed-off-by: Zhiying Lin <zhiyingl456@gmail.com>
@zhiying-lin zhiying-lin marked this pull request as draft December 15, 2025 12:47
@codecov
Copy link

codecov bot commented Dec 15, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@zhiying-lin zhiying-lin marked this pull request as ready for review December 16, 2025 02:12
Comment on lines +677 to +681
// Even for the case where the resource snapshot has no selected resources,
// there should be one work created for the empty resource list.
if len(existingWorks) == 0 {
return false
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this function is only called when the resource snapshot is missing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, we only create an empty work for eveloped cases. I was thinking of removing it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this function is only called when the resource snapshot is missing.

yeah, it's possible when rollout controller updates the binding using the resourceSnapshot while it is deleted when work-generator queries this snapshot.

IIRC, we only create an empty work for eveloped cases. I was thinking of removing it

I validated it in my fleet.
We'll create the empty work even for the normal case,

 kubectl get work crp-empty-work -n fleet-member-aks-member-5 -o yaml
apiVersion: placement.kubernetes-fleet.io/v1
kind: Work
metadata:
  annotations:
    kubernetes-fleet.io/parent-cluster-resource-override-snapshot-hash: 74234e98afe7498fb5daf1f36ac2d78acc339464f950703b8c019892f982b90b
    kubernetes-fleet.io/parent-resource-override-snapshot-hash: 74234e98afe7498fb5daf1f36ac2d78acc339464f950703b8c019892f982b90b
    kubernetes-fleet.io/parent-resource-snapshot-name: crp-empty-0-snapshot
  creationTimestamp: "2025-12-16T08:04:43Z"
  finalizers:
  - kubernetes-fleet.io/work-cleanup
  generation: 1
  labels:
    kubernetes-fleet.io/parent-CRP: crp-empty
    kubernetes-fleet.io/parent-resource-binding: crp-empty-aks-member-5-30b4685b
    kubernetes-fleet.io/parent-resource-snapshot-index: "0"
  name: crp-empty-work
  namespace: fleet-member-aks-member-5
  resourceVersion: "440189780"
  uid: ce9960fc-f4e0-4b3f-b46c-846e0fb9c8ea
spec:
  applyStrategy:
    comparisonOption: PartialComparison
    type: ClientSideApply
    whenToApply: Always
    whenToTakeOver: Always
  workload: {}
status:
  conditions:
  - lastTransitionTime: "2025-12-16T08:04:43Z"
    message: All the specified manifests have been applied
    observedGeneration: 1
    reason: AllManifestsApplied
    status: "True"
    type: Applied
  - lastTransitionTime: "2025-12-16T08:04:43Z"
    message: All of the applied manifests are available
    observedGeneration: 1
    reason: AllManifestsAvailable
    status: "True"
    type: Available

yeah, we was thinking of removing this behavior few times, but we have to specially handle this case in multiple controllers. prefer to keep it internally.

The original complains was that it's not obvious from the CRP condition when selecting nothing. We can improve the external user experience/messages separately.

Signed-off-by: Zhiying Lin <zhiyingl456@gmail.com>
Signed-off-by: Zhiying Lin <zhiyingl456@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants