fix(reporter):added idempotency gate to prevent API server flooding#263
fix(reporter):added idempotency gate to prevent API server flooding#263LightCreator1007 wants to merge 6 commits into
Conversation
✅ Deploy Preview for node-readiness-controller canceled.
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: LightCreator1007 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @LightCreator1007. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Hey @LightCreator1007, PR looks good but the diff only touches Also, a comment noting that skipping the write stops refreshing the heartbeat (so it no longer signals the reporter is alive), would be good |
Yes! Thanks for the help!! |
|
I ran the reporter locally with a 10s check interval, to track how many writes actually went to the node. from my testing, it's one write, per tick (10s interval), per node. For a component that exists to periodically report node health, that's exactly what you would expect. So, in my opinion, it is not a problem to optimise away with 1 write/tick. Is there a specific scale scenario motivating this? |
Hi @Priyankasaggu11929 , thanks for testing this! While 1 update per 10 seconds is fine locally, the project targets a scale of say 5,000 nodes. At 5,000 nodes, a 10-second interval would create 500 requests per second hitting the server just for health reporting. Processing this constant stream of "no change" updates may quickly overwhelm the API server. To prevent this, the reporter should cache its state locally and only send an update to the API server when its health status actually changes. |
|
@LightCreator1007, thanks for the response. To be clear, I'm trying to make sure I understand the actual problem.
500 req/s at 5000 nodes with 10s interval - this rate is well within normal baseline and not something I'd frame as API server flooding.
There's no local caching. Even with this PR, we are hitting apiserver with same amount of get requests every tick. The number of requests/second is not the problem actually. For this reason ^, defaulting to a 5-minute forced heartbeat is fine, it aligns with other controllers as well (like Node Problem Detector) One request though - make the heartbeat period configurable. cc: @ajaysundark |
|
/ok-to-test |
|
hello @Priyankasaggu11929 ,
Thank you for correcting me there, I overlooked the fact that it still does send a GET request.
I was too fixated on the API request rate to realize that the actual problem would be the etcd storage, which is the bottleneck this PR will help out with. I was completely unaware of this, so thanks a lot for pointing that out :) I have made the requested changes and have made the heartbeat time configurable with a default fallback to 5min as suggested. |
| envCheckInterval = "CHECK_INTERVAL" | ||
| envImpersonateNode = "IMPERSONATE_NODE" | ||
| envHeartbeatPeriod = "HEARTBEAT_PERIOD" | ||
| defaultCheckInterval = 30 * time.Second |
There was a problem hiding this comment.
if the intent is to reduce the frequency of readiness updates, why it cannot be set with a different check interval?
Could we first clarify the concept of 'heart-beat'? Would there be a case when the component health check be checked often but not update the API server know if it degraded?
There was a problem hiding this comment.
if the intent is to reduce the frequency of readiness updates, why it cannot be set with a different check interval?
A different check interval (in this case a larger one) would mean that if the node status changes we won't be able to detect it immediately. We need to check the node status frequently for fast detection, but if the state stays stable/healthy for a long time, we may want to skip writing that identical state to the API server on every tick. Skipping those redundant writes prevents etcd write amplification.
Could we first clarify the concept of 'heart-beat'?
The heartbeat here would be a liveness proof, It ensures that if the component stays perfectly healthy for a long time, we still write an update every 5 minutes just to bump the LastHeartbeatTime, proving to the API server that the reporter hasn't crashed.
Would there be a case when the component health check be checked often but not update the API server know if it degraded?
I do not think there is a case where a degraded state would be missed or delayed. If the health check degrades (Status, Reason, or Message changes), the idempotency gate would instantly open and the API server is updated on that exact tick.
Description
Adds an idempotency gate to
updateNodeConditionin thereadiness-condition-reporter.UpdateStatusis now bypassed if theStatus,Reason, andMessageare unchanged, preventing unnecessary API server flooding andetcdwrite amplification on every tick.Related Issue
NONE
Type of Change
/kind bug
Testing
main_test.golocally to assert against thefake.Clientset's tracked actions, confirming that theUpdateStatuscall is definitively bypassed when the condition state is unchanged.Checklist
make testpassesmake lintpassesDoes this PR introduce a user-facing change?