fix: skip NodeEvaluation upsert when evaluation did not run by sahitya-chandra · Pull Request #218 · kubernetes-sigs/node-readiness-controller

sahitya-chandra · 2026-05-06T02:30:29Z

Description

The bug: When adding or removing a node's taint fails with a non-retryable error (for example an RBAC denial or a persistent API error), evaluateRuleForNode returns early without recording an evaluation for that node. processNodeAgainstAllRules then still tried to save a NodeEvaluation for it, and with nothing recorded it saved an empty one. Every required field on that type uses omitempty, so an empty value serializes to {} and the API server rejects the entire status patch with a 422 (Required value for nodeName, conditionResults, taintStatus, and lastEvaluationTime). The empty {} NodeEvaluation validation failure occurs when the node has no cached NodeEvaluation. If an existing cached entry is present, the old code avoids the empty object but can still overwrite fresher persisted status with stale cached evaluation data. The FailedNodes update rides in that same patch, so it is dropped too. The failure then never appears in kubectl get nodereadinessrule -o yaml, and the only trace is a controller log line.

The fix: Save the NodeEvaluation only when the evaluation succeeded. On the failure path, leave any existing NodeEvaluation untouched. Since FailedNodes rides in the same Status().Patch() call, omitting the invalid empty NodeEvaluation prevents the 422 that was silently dropping it.

Related Issue

Fixes #217

Type of Change

/kind bug

Testing

make test passes locally
make lint passes locally
Regression test 1: a fake client fails Patch on the node, runs NodeReconciler.Reconcile, and asserts that the FailedNodes entry lands, no empty NodeEvaluation slipped in, and an unrelated pre-existing NodeEvaluation was preserved. Without the fix, the test fails on the empty-NodeName assertion
Regression test 2 covers the stale-cache case: persisted status has TaintStatus=Absent, the cached rule snapshot has stale TaintStatus=Present, evaluation fails, and the persisted entry must stay Absent

Checklist

make test passes
make lint passes

Does this PR introduce a user-facing change?

Fix a bug where a transient taint patch failure on a node could drop the corresponding FailedNodes entry from a NodeReadinessRule's status, or overwrite a fresh persisted NodeEvaluation with a stale one from the controller's rule cache

netlify · 2026-05-06T02:30:34Z

✅ Deploy Preview for node-readiness-controller canceled.

Name	Link
🔨 Latest commit	`169d820`
🔍 Latest deploy log	https://app.netlify.com/projects/node-readiness-controller/deploys/6a03ec32018e130007a24a93

k8s-ci-robot · 2026-05-06T02:30:35Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sahitya-chandra
Once this PR has been reviewed and has the lgtm label, please assign ajaysundark for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2026-05-06T02:30:38Z

Hi @sahitya-chandra. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

processNodeAgainstAllRules unconditionally wrote a NodeEvaluation entry for the node it just processed, even when evaluateRuleForNode returned an error before updateNodeEvaluationStatus could populate one. The zero-value NodeEvaluation either clobbered a valid prior entry or appended one with an empty NodeName. The CRD requires NodeName MinLength=1, so the API server rejected the whole status patch with 422, and the FailedNodes update bundled into the same patch was lost along with it. Skip the upsert when the in-memory rule has no evaluation for this node, and let the FailedNodes update through on its own. Add a regression test that fails Patch on the node, asserts FailedNodes is recorded, and asserts no empty NodeEvaluation slips into status.

sahitya-chandra · 2026-05-19T14:30:21Z

/cc @AvineshTripathi

could you please take a look at this PR whenever you get some free time? It has been open for quite a while now.

ajaysundark · 2026-05-21T23:48:45Z

+			// Upsert the node's evaluation only after a successful evaluation.
+			// On the failure path evaluateRuleForNode returns before recording a
+			// fresh NodeEvaluation, so this must leave any existing persisted
+			// evaluation untouched and only persist FailedNodes below.


I view this more of a UX improvement than a bug fix --

Currently if evaluateRuleForNode fails for a node, it patches the rule status by overriding with NodeEvaluation{}, erasing previous (if existing) eval result. This change prevents from doing it, but the risk is possibly leaving a stale result for that rule.

We already capture the node failures at rule.status.failedNodes which should reflects recent evaluation failures for the node.

We have discussed moving the status object to a separate 'NodeReadinessEvaluation' CRD (in v1alpha2) to capture per rule results. This flow could look like --

processNodeAgainstAllRules { NodeReadinessEvaluationCR { // record result per rule } }

instead of double looping and x-updates we are currently doing here.

Assuming we move to a 1--1 evaluation result present for each Node evaluated by NodeReadinessController, how would the rule-evaluation be structured? and how should we record per rule failures in it?

@Karthik-K-N I think we should prioritize evaluating this to establish a cohesive observability for NRC ideally aligning it earlier with the mentorship timeline as there are some observability focus in it.

If we have multiple rules per node and one among the rule evaluation fails we capture them in the ReadinessReport: https://github.com/kubernetes-sigs/node-readiness-controller/pull/133/changes#r2857725522

Thanks @ajaysundark and @Karthik-K-N, I just went through #133, and the NodeReadinessRuleReport (nrrp) CRD makes sense to me. I agree a per-node report keyed by ruleName is cleaner for these results than the NodeEvaluations[] list inside the rule status, so I'm happy to align with that.

One small thing on the "UX or bug" point: I think it is a bit more than UX today. On the failure path, when the node has no prior cache NodeEvaluation (i.e. first evaluation), currEval is the zero value, so the patch carries NodeEvaluation{}. Its required fields are omitempty, so it serializes to {} and the apiserver rejects the whole Status().Patch with a 422. failedNodes is in that same patch, so it gets dropped too. So on that path the operator sees nothing, not even failedNodes, just a log line.

I checked this with envtest against the apiserver: on main the failed node ends with failedNodes empty, and with this change it persists. So I feel failedNodes only really captures the failure once we stop including the empty eval in the patch. They still go out in the same Status().Patch() call, but without the invalid {} entry the 422 no longer fires

On the stale result point, I agree. The idea here is just to keep the last good eval instead of erasing it, but I'm fine dropping the eval on failure instead if you prefer

Either way, I'm happy to keep this as a small v1alpha1 fix and leave the bigger change to the report work :)

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 6, 2026

k8s-ci-robot requested a review from ajaysundark May 6, 2026 02:30

k8s-ci-robot requested a review from haircommander May 6, 2026 02:30

sahitya-chandra added 2 commits May 13, 2026 08:39

fix: skip stale evaluation updates on failure

169d820

sahitya-chandra force-pushed the fix/empty-nodeevaluation-status-patch branch from 339a0ac to 169d820 Compare May 13, 2026 03:12

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 13, 2026

k8s-ci-robot requested a review from AvineshTripathi May 19, 2026 14:30

ajaysundark reviewed May 21, 2026

View reviewed changes

sahitya-chandra requested review from Karthik-K-N and ajaysundark May 26, 2026 03:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: skip NodeEvaluation upsert when evaluation did not run#218

fix: skip NodeEvaluation upsert when evaluation did not run#218
sahitya-chandra wants to merge 2 commits into
kubernetes-sigs:mainfrom
sahitya-chandra:fix/empty-nodeevaluation-status-patch

sahitya-chandra commented May 6, 2026 •

edited

Loading

Uh oh!

netlify Bot commented May 6, 2026 •

edited

Loading

Uh oh!

k8s-ci-robot commented May 6, 2026

Uh oh!

k8s-ci-robot commented May 6, 2026

Uh oh!

sahitya-chandra commented May 19, 2026

Uh oh!

ajaysundark May 21, 2026

Uh oh!

Karthik-K-N May 22, 2026

Uh oh!

sahitya-chandra May 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

sahitya-chandra commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Type of Change

Testing

Checklist

Does this PR introduce a user-facing change?

Uh oh!

netlify Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for node-readiness-controller canceled.

Uh oh!

k8s-ci-robot commented May 6, 2026

Uh oh!

k8s-ci-robot commented May 6, 2026

Uh oh!

sahitya-chandra commented May 19, 2026

Uh oh!

ajaysundark May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Karthik-K-N May 22, 2026

Choose a reason for hiding this comment

Uh oh!

sahitya-chandra May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sahitya-chandra commented May 6, 2026 •

edited

Loading

netlify Bot commented May 6, 2026 •

edited

Loading

sahitya-chandra May 22, 2026 •

edited

Loading