Skip to content

fix: skip NodeEvaluation upsert when evaluation did not run#218

Open
sahitya-chandra wants to merge 2 commits into
kubernetes-sigs:mainfrom
sahitya-chandra:fix/empty-nodeevaluation-status-patch
Open

fix: skip NodeEvaluation upsert when evaluation did not run#218
sahitya-chandra wants to merge 2 commits into
kubernetes-sigs:mainfrom
sahitya-chandra:fix/empty-nodeevaluation-status-patch

Conversation

@sahitya-chandra
Copy link
Copy Markdown
Contributor

@sahitya-chandra sahitya-chandra commented May 6, 2026

Description

The bug: When adding or removing a node's taint fails with a non-retryable error (for example an RBAC denial or a persistent API error), evaluateRuleForNode returns early without recording an evaluation for that node. processNodeAgainstAllRules then still tried to save a NodeEvaluation for it, and with nothing recorded it saved an empty one. Every required field on that type uses omitempty, so an empty value serializes to {} and the API server rejects the entire status patch with a 422 (Required value for nodeName, conditionResults, taintStatus, and lastEvaluationTime). The empty {} NodeEvaluation validation failure occurs when the node has no cached NodeEvaluation. If an existing cached entry is present, the old code avoids the empty object but can still overwrite fresher persisted status with stale cached evaluation data. The FailedNodes update rides in that same patch, so it is dropped too. The failure then never appears in kubectl get nodereadinessrule -o yaml, and the only trace is a controller log line.

The fix: Save the NodeEvaluation only when the evaluation succeeded. On the failure path, leave any existing NodeEvaluation untouched. Since FailedNodes rides in the same Status().Patch() call, omitting the invalid empty NodeEvaluation prevents the 422 that was silently dropping it.

Related Issue

Fixes #217

Type of Change

/kind bug

Testing

  • make test passes locally
  • make lint passes locally
  • Regression test 1: a fake client fails Patch on the node, runs NodeReconciler.Reconcile, and asserts that the FailedNodes entry lands, no empty NodeEvaluation slipped in, and an unrelated pre-existing NodeEvaluation was preserved. Without the fix, the test fails on the empty-NodeName assertion
  • Regression test 2 covers the stale-cache case: persisted status has TaintStatus=Absent, the cached rule snapshot has stale TaintStatus=Present, evaluation fails, and the persisted entry must stay Absent

Checklist

  • make test passes
  • make lint passes

Does this PR introduce a user-facing change?

Fix a bug where a transient taint patch failure on a node could drop the corresponding FailedNodes entry from a NodeReadinessRule's status, or overwrite a fresh persisted NodeEvaluation with a stale one from the controller's rule cache

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 6, 2026
@netlify
Copy link
Copy Markdown

netlify Bot commented May 6, 2026

Deploy Preview for node-readiness-controller canceled.

Name Link
🔨 Latest commit 169d820
🔍 Latest deploy log https://app.netlify.com/projects/node-readiness-controller/deploys/6a03ec32018e130007a24a93

@k8s-ci-robot k8s-ci-robot requested a review from ajaysundark May 6, 2026 02:30
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sahitya-chandra
Once this PR has been reviewed and has the lgtm label, please assign ajaysundark for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from haircommander May 6, 2026 02:30
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @sahitya-chandra. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 6, 2026
processNodeAgainstAllRules unconditionally wrote a NodeEvaluation entry
for the node it just processed, even when evaluateRuleForNode returned
an error before updateNodeEvaluationStatus could populate one. The
zero-value NodeEvaluation either clobbered a valid prior entry or
appended one with an empty NodeName. The CRD requires NodeName
MinLength=1, so the API server rejected the whole status patch with
422, and the FailedNodes update bundled into the same patch was lost
along with it.

Skip the upsert when the in-memory rule has no evaluation for this
node, and let the FailedNodes update through on its own. Add a
regression test that fails Patch on the node, asserts FailedNodes is
recorded, and asserts no empty NodeEvaluation slips into status.
@sahitya-chandra sahitya-chandra force-pushed the fix/empty-nodeevaluation-status-patch branch from 339a0ac to 169d820 Compare May 13, 2026 03:12
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 13, 2026
@sahitya-chandra
Copy link
Copy Markdown
Contributor Author

/cc @AvineshTripathi

could you please take a look at this PR whenever you get some free time? It has been open for quite a while now.

Comment on lines +173 to +176
// Upsert the node's evaluation only after a successful evaluation.
// On the failure path evaluateRuleForNode returns before recording a
// fresh NodeEvaluation, so this must leave any existing persisted
// evaluation untouched and only persist FailedNodes below.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I view this more of a UX improvement than a bug fix --

  1. Currently if evaluateRuleForNode fails for a node, it patches the rule status by overriding with NodeEvaluation{}, erasing previous (if existing) eval result. This change prevents from doing it, but the risk is possibly leaving a stale result for that rule.
  2. We already capture the node failures at rule.status.failedNodes which should reflects recent evaluation failures for the node.

We have discussed moving the status object to a separate 'NodeReadinessEvaluation' CRD (in v1alpha2) to capture per rule results. This flow could look like --

processNodeAgainstAllRules {
   NodeReadinessEvaluationCR {
    // record result per rule
   }
}

instead of double looping and x-updates we are currently doing here.

Assuming we move to a 1--1 evaluation result present for each Node evaluated by NodeReadinessController, how would the rule-evaluation be structured? and how should we record per rule failures in it?

@Karthik-K-N I think we should prioritize evaluating this to establish a cohesive observability for NRC ideally aligning it earlier with the mentorship timeline as there are some observability focus in it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have multiple rules per node and one among the rule evaluation fails we capture them in the ReadinessReport: https://github.com/kubernetes-sigs/node-readiness-controller/pull/133/changes#r2857725522

Copy link
Copy Markdown
Contributor Author

@sahitya-chandra sahitya-chandra May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ajaysundark and @Karthik-K-N, I just went through #133, and the NodeReadinessRuleReport (nrrp) CRD makes sense to me. I agree a per-node report keyed by ruleName is cleaner for these results than the NodeEvaluations[] list inside the rule status, so I'm happy to align with that.

One small thing on the "UX or bug" point: I think it is a bit more than UX today. On the failure path, when the node has no prior cache NodeEvaluation (i.e. first evaluation), currEval is the zero value, so the patch carries NodeEvaluation{}. Its required fields are omitempty, so it serializes to {} and the apiserver rejects the whole Status().Patch with a 422. failedNodes is in that same patch, so it gets dropped too. So on that path the operator sees nothing, not even failedNodes, just a log line.

I checked this with envtest against the apiserver: on main the failed node ends with failedNodes empty, and with this change it persists. So I feel failedNodes only really captures the failure once we stop including the empty eval in the patch. They still go out in the same Status().Patch() call, but without the invalid {} entry the 422 no longer fires

On the stale result point, I agree. The idea here is just to keep the last good eval instead of erasing it, but I'm fine dropping the eval on failure instead if you prefer

Either way, I'm happy to keep this as a small v1alpha1 fix and leave the bigger change to the report work :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

processNodeAgainstAllRules can write an empty NodeEvaluation that fails CRD validation and drops FailedNodes updates

4 participants