fix(reporter):added idempotency gate to prevent API server flooding by LightCreator1007 · Pull Request #263 · kubernetes-sigs/node-readiness-controller

LightCreator1007 · 2026-05-20T14:45:47Z

Description

Adds an idempotency gate to updateNodeCondition in the readiness-condition-reporter.
UpdateStatus is now bypassed if the Status, Reason, and Message are unchanged, preventing unnecessary API server flooding and etcd write amplification on every tick.

Related Issue

NONE

Type of Change

/kind bug

Testing

Modified main_test.go locally to assert against the fake.Clientset's tracked actions, confirming that the UpdateStatus call is definitively bypassed when the condition state is unchanged.

Checklist

make test passes
make lint passes

Does this PR introduce a user-facing change?

NONE

netlify · 2026-05-20T14:45:56Z

✅ Deploy Preview for node-readiness-controller canceled.

Name	Link
🔨 Latest commit	`cb80ba2`
🔍 Latest deploy log	https://app.netlify.com/projects/node-readiness-controller/deploys/6a14893a0a928600084f9145

k8s-ci-robot · 2026-05-20T14:45:56Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: LightCreator1007
Once this PR has been reviewed and has the lgtm label, please assign dchen1107 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2026-05-20T14:45:59Z

Hi @LightCreator1007. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

sahitya-chandra · 2026-05-20T16:18:29Z

Hey @LightCreator1007, PR looks good but the diff only touches main.go, so the skip path isn't covered. Could you add a test where the node already has the matching condition and assert no update is sent?

Also, a comment noting that skipping the write stops refreshing the heartbeat (so it no longer signals the reporter is alive), would be good

LightCreator1007 · 2026-05-21T10:19:11Z

skip path isn't covered. Could you add a test where the node already has the matching condition and assert no update is sent

Yes!
I have added the same in the newer commits to cover both the idempotency and the 5 min refresh cases :)

Thanks for the help!!

Priyankasaggu11929 · 2026-05-21T13:53:32Z

I ran the reporter locally with a 10s check interval, to track how many writes actually went to the node.

from my testing, it's one write, per tick (10s interval), per node.

For a component that exists to periodically report node health, that's exactly what you would expect.

So, in my opinion, it is not a problem to optimise away with 1 write/tick. Is there a specific scale scenario motivating this?

LightCreator1007 · 2026-05-22T09:07:11Z

I ran the reporter locally with a 10s check interval, to track how many writes actually went to the node.

from my testing, it's one write, per tick (10s interval), per node.

For a component that exists to periodically report node health, that's exactly what you would expect.

So, in my opinion, it is not a problem to optimise away with 1 write/tick. Is there a specific scale scenario motivating this?

Hi @Priyankasaggu11929 , thanks for testing this!

While 1 update per 10 seconds is fine locally, the project targets a scale of say 5,000 nodes.

At 5,000 nodes, a 10-second interval would create 500 requests per second hitting the server just for health reporting. Processing this constant stream of "no change" updates may quickly overwhelm the API server.

To prevent this, the reporter should cache its state locally and only send an update to the API server when its health status actually changes.

Priyankasaggu11929 · 2026-05-25T06:31:33Z

@LightCreator1007, thanks for the response.

To be clear, I'm trying to make sure I understand the actual problem.

At 5,000 nodes, a 10-second interval would create 500 requests per second hitting the server just for health reporting. Processing this constant stream of "no change" updates may quickly overwhelm the API server.

500 req/s at 5000 nodes with 10s interval - this rate is well within normal baseline and not something I'd frame as API server flooding.

To prevent this, the reporter should cache its state locally and only send an update to the API server when its health status actually changes.

There's no local caching. Even with this PR, we are hitting apiserver with same amount of get requests every tick.
The state comparision is against the API server response, not a local cache.

The number of requests/second is not the problem actually.
It is what each write is actually adding to etcd storage. etcd stores the full node object on every write, so at scale, even updating a single condition field writes the entire node status blob.
The changes in the PR will help with this bit (although we should check what that number actually looks like in our scalability testing cc: @Karthik-K-N)

For this reason ^, defaulting to a 5-minute forced heartbeat is fine, it aligns with other controllers as well (like Node Problem Detector)

One request though - make the heartbeat period configurable.
The 5-minute default is ok, but people may still have valid reasons to sync earlier, and hardcoding it removes that option.
(For example - LastHeartbeatTime is the only signal that the reporter is still alive and so having an option to allow people to tune how long a dead reporter goes undetected, is good.)

cc: @ajaysundark

Priyankasaggu11929 · 2026-05-25T06:50:45Z

/ok-to-test

…cases

LightCreator1007 · 2026-05-25T18:24:49Z

hello @Priyankasaggu11929 ,
Thank you for taking out the time for going through my PR!

There's no local caching. Even with this PR, we are hitting apiserver with same amount of get requests every tick. The state comparision is against the API server response, not a local cache.

Thank you for correcting me there, I overlooked the fact that it still does send a GET request.

The number of requests/second is not the problem actually.
It is what each write is actually adding to etcd storage. etcd stores the full node object on every write, so at scale, even updating a single condition field writes the entire node status blob.

I was too fixated on the API request rate to realize that the actual problem would be the etcd storage, which is the bottleneck this PR will help out with. I was completely unaware of this, so thanks a lot for pointing that out :)

I have made the requested changes and have made the heartbeat time configurable with a default fallback to 5min as suggested.

ajaysundark · 2026-05-26T02:19:09Z

+	envCheckInterval       = "CHECK_INTERVAL"
+	envImpersonateNode     = "IMPERSONATE_NODE"
+	envHeartbeatPeriod     = "HEARTBEAT_PERIOD"
+	defaultCheckInterval   = 30 * time.Second


if the intent is to reduce the frequency of readiness updates, why it cannot be set with a different check interval?

Could we first clarify the concept of 'heart-beat'? Would there be a case when the component health check be checked often but not update the API server know if it degraded?

if the intent is to reduce the frequency of readiness updates, why it cannot be set with a different check interval?

A different check interval (in this case a larger one) would mean that if the node status changes we won't be able to detect it immediately. We need to check the node status frequently for fast detection, but if the state stays stable/healthy for a long time, we may want to skip writing that identical state to the API server on every tick. Skipping those redundant writes prevents etcd write amplification.

Could we first clarify the concept of 'heart-beat'?

The heartbeat here would be a liveness proof, It ensures that if the component stays perfectly healthy for a long time, we still write an update every 5 minutes just to bump the LastHeartbeatTime, proving to the API server that the reporter hasn't crashed.

Would there be a case when the component health check be checked often but not update the API server know if it degraded?

I do not think there is a case where a degraded state would be missed or delayed. If the health check degrades (Status, Reason, or Message changes), the idempotency gate would instantly open and the API server is updated on that exact tick.

fix(reporter):added idempotency gate to prevent API server flooding

0eb7ec2

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 20, 2026

k8s-ci-robot requested review from dchen1107 and tallclair May 20, 2026 14:45

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 20, 2026

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label May 20, 2026

AvineshTripathi reviewed May 20, 2026

View reviewed changes

Comment thread cmd/readiness-condition-reporter/main.go

LightCreator1007 added 2 commits May 21, 2026 15:34

updated idempotency gate to include 5min heartbeat

80cc5f5

test: add idempotency and heartbeat coverage for status updates

6bea061

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 21, 2026

fixed linting errors

ea637a9

Priyankasaggu11929 requested changes May 25, 2026

View reviewed changes

Comment thread cmd/readiness-condition-reporter/main.go

Comment thread cmd/readiness-condition-reporter/main_test.go Outdated

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 25, 2026

LightCreator1007 added 2 commits May 25, 2026 23:07

addedconfigurable heartbeat period and logging in case of skip scenario

4669574

test: extended the existing TestUpdateNodeCondition to cover the new …

cb80ba2

…cases

ajaysundark reviewed May 26, 2026

View reviewed changes

ajaysundark removed the request for review from dchen1107 May 26, 2026 02:20

ajaysundark removed the request for review from tallclair May 26, 2026 02:20

Conversation

LightCreator1007 commented May 20, 2026

Description

Related Issue

Type of Change

Testing

Checklist

Does this PR introduce a user-facing change?

Uh oh!

netlify Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for node-readiness-controller canceled.

Uh oh!

k8s-ci-robot commented May 20, 2026

Uh oh!

k8s-ci-robot commented May 20, 2026

Uh oh!

sahitya-chandra commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

LightCreator1007 commented May 21, 2026

Uh oh!

Priyankasaggu11929 commented May 21, 2026

Uh oh!

LightCreator1007 commented May 22, 2026

Uh oh!

Priyankasaggu11929 commented May 25, 2026

Uh oh!

Uh oh!

Uh oh!

Priyankasaggu11929 commented May 25, 2026

Uh oh!

LightCreator1007 commented May 25, 2026

Uh oh!

ajaysundark May 26, 2026

Choose a reason for hiding this comment

Uh oh!

LightCreator1007 May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

netlify Bot commented May 20, 2026 •

edited

Loading

sahitya-chandra commented May 20, 2026 •

edited

Loading