-
Notifications
You must be signed in to change notification settings - Fork 21
fix: bug fixes in rollout controller and work-generator #379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Zhiying Lin <zhiyingl456@gmail.com>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
| // Even for the case where the resource snapshot has no selected resources, | ||
| // there should be one work created for the empty resource list. | ||
| if len(existingWorks) == 0 { | ||
| return false | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this function is only called when the resource snapshot is missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, we only create an empty work for eveloped cases. I was thinking of removing it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this function is only called when the resource snapshot is missing.
yeah, it's possible when rollout controller updates the binding using the resourceSnapshot while it is deleted when work-generator queries this snapshot.
IIRC, we only create an empty work for eveloped cases. I was thinking of removing it
I validated it in my fleet.
We'll create the empty work even for the normal case,
kubectl get work crp-empty-work -n fleet-member-aks-member-5 -o yaml
apiVersion: placement.kubernetes-fleet.io/v1
kind: Work
metadata:
annotations:
kubernetes-fleet.io/parent-cluster-resource-override-snapshot-hash: 74234e98afe7498fb5daf1f36ac2d78acc339464f950703b8c019892f982b90b
kubernetes-fleet.io/parent-resource-override-snapshot-hash: 74234e98afe7498fb5daf1f36ac2d78acc339464f950703b8c019892f982b90b
kubernetes-fleet.io/parent-resource-snapshot-name: crp-empty-0-snapshot
creationTimestamp: "2025-12-16T08:04:43Z"
finalizers:
- kubernetes-fleet.io/work-cleanup
generation: 1
labels:
kubernetes-fleet.io/parent-CRP: crp-empty
kubernetes-fleet.io/parent-resource-binding: crp-empty-aks-member-5-30b4685b
kubernetes-fleet.io/parent-resource-snapshot-index: "0"
name: crp-empty-work
namespace: fleet-member-aks-member-5
resourceVersion: "440189780"
uid: ce9960fc-f4e0-4b3f-b46c-846e0fb9c8ea
spec:
applyStrategy:
comparisonOption: PartialComparison
type: ClientSideApply
whenToApply: Always
whenToTakeOver: Always
workload: {}
status:
conditions:
- lastTransitionTime: "2025-12-16T08:04:43Z"
message: All the specified manifests have been applied
observedGeneration: 1
reason: AllManifestsApplied
status: "True"
type: Applied
- lastTransitionTime: "2025-12-16T08:04:43Z"
message: All of the applied manifests are available
observedGeneration: 1
reason: AllManifestsAvailable
status: "True"
type: Available
yeah, we was thinking of removing this behavior few times, but we have to specially handle this case in multiple controllers. prefer to keep it internally.
The original complains was that it's not obvious from the CRP condition when selecting nothing. We can improve the external user experience/messages separately.
Signed-off-by: Zhiying Lin <zhiyingl456@gmail.com>
Signed-off-by: Zhiying Lin <zhiyingl456@gmail.com>
Description of your changes
fix the 1MB test failure, https://github.com/kubefleet-dev/kubefleet/actions/runs/20092594675/job/57649235364
we made the wrong assumption when getting the resourceSnapshot master, https://github.com/kubefleet-dev/kubefleet/blob/main/pkg/controllers/workgenerator/controller.go#L459-L471 we used the cached client, so the master resourceSnapshot is not found.
The inconsistency could happen whenever rollout controller rollouts new changes.
The existing work is empty. So it returned true and binding was updated as available and applied, which was wrong.
The fix is to use the cached client for both rolllout controller and work-generator when querying the resourceSnapshot.
Fixes #
I have:
make reviewableto ensure this PR is ready for review.How has this code been tested
Added unit tests and ran the e2e tests multiple times.
Special notes for your reviewer