Skip to content

Conversation

@michaelawyu
Copy link
Collaborator

Description of your changes

As a performance improvement, set the work applier to skip status updates if possible.

I have:

  • Run make reviewable to ensure this PR is ready for review.

How has this code been tested

  • Integration tests

Special notes for your reviewer

Signed-off-by: michaelawyu <chenyu1@microsoft.com>
@michaelawyu
Copy link
Collaborator Author

Note: as anticipated after this PR the status update logic in work applier has reached the linter code complexity limit; to unblock progress that specific method is now exempted from the linter, will submit separate PRs to refactor the method.

@michaelawyu
Copy link
Collaborator Author

This PR precedes #118.

Copy link
Contributor

@ryanzhang-oss ryanzhang-oss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can we make sure that a work's status in cache is actually the same in etcd? If we just compare with the cache and stop updating, we run the risk of the status never gets updated to the correct state (although the chance is small).

@michaelawyu
Copy link
Collaborator Author

how can we make sure that a work's status in cache is actually the same in etcd? If we just compare with the cache and stop updating, we run the risk of the status never gets updated to the correct state (although the chance is small).

Hi Ryan! For the work applier we already requeue periodically (with back-off) so unless the client-side cache becomes significantly out-of-sync with the API server (in this case all writes will be rejected with optimistic locks), any inconsistent status updates will be overwritten within at max. 15 minutes.

return controller.NewAPIServerError(false, err)

// Skip the status update if no change found.
if equality.Semantic.DeepEqual(originalStatus, &appliedWork.Status) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have an idea of how many calls this generates?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ryan! Do you mean the # of status updates this setup will skip? Or do you mean the time complexity about the DeepEqual calls?

For the former, in the target perf test environment each member agent was generating roughly 950K status updates per 24h (though this number might be a bit biased due to the agents being restarted occasionally).

For the latter, deep-equaling is usu. expensive, esp. when the object is complex. I don't have specifics on deep-equaling Work object status data now, but I could do a mini benchmarking if you are interested.

}

func shouldSkipStatusUpdate(isDriftedOrDiffed, isStatusBackReportingOn bool, originalStatus, currentStatus *fleetv1beta1.WorkStatus) bool {
if isDriftedOrDiffed || isStatusBackReportingOn {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will reduce the effectiveness of this PR by a lot, is there anyway to soften the blow a bit further?

Copy link
Collaborator Author

@michaelawyu michaelawyu Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ryan! We might need a flag to omit observation timestamps if needed? Though with exponential backoff, as long as the number of drifted/diffed placements + status back-reporting placements is low, and the system does not restart often, the # of calls should be relatively limited (~96 writes per placement) when everything stablizes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants