Rebase to Kubernetes 1.35.1#3842
Conversation
|
/test pull-kcp-verify |
|
/retest |
a7e6911 to
bdee9e5
Compare
032aad7 to
e40997d
Compare
|
/retest |
|
/test pull-kcp-test-e2e-sharded |
1 similar comment
|
/test pull-kcp-test-e2e-sharded |
|
/test pull-kcp-test-e2e-multiple-runs |
|
/test pull-kcp-test-e2e-sharded |
1 similar comment
|
/test pull-kcp-test-e2e-sharded |
427cde2 to
4c4f54b
Compare
|
/test pull-kcp-test-e2e |
|
/retest |
|
/test pull-kcp-test-integration |
1 similar comment
|
/test pull-kcp-test-integration |
Signed-off-by: Marko Mudrinić <mudrinic.mare@gmail.com>
Signed-off-by: Marko Mudrinić <mudrinic.mare@gmail.com>
Signed-off-by: Marko Mudrinić <mudrinic.mare@gmail.com>
Signed-off-by: Marko Mudrinić <mudrinic.mare@gmail.com>
Signed-off-by: Marko Mudrinić <mudrinic.mare@gmail.com>
Signed-off-by: Marko Mudrinić <mudrinic.mare@gmail.com>
Signed-off-by: Marko Mudrinić <mudrinic.mare@gmail.com>
mjudeikis-bot
left a comment
There was a problem hiding this comment.
Review: Kubernetes 1.35.1 Rebase
Overall the rebase looks clean. Most changes are mechanical API adaptations. A few things worth calling out:
🔴 controller.go — Silent error swallowing in processLoop
The old code:
utilruntime.HandleErrorWithContext(ctx, err, "Failed to process object from queue")is removed. Non-ErrFIFOClosed errors are now silently swallowed. If the queue returns a processing error (e.g. a user-provided ProcessFunc errors out), it will be completely invisible.
The ShouldResync branch had no actual behavior (just a comment), so losing that is fine. But losing the error log is a regression — at minimum errors should still be logged.
Also: the obj == nil exit condition from the old Pop return is gone. Need to confirm upstream guarantees the new c.config.Pop path never needs the nil check (likely fine if API changed, but worth a comment).
🟡 nodedeclaredfeatures admission plugin — new default-on
nodedeclaredfeatures.PluginName added to defaultOnKubePluginsInKube. This validates node-declared feature gates. In kcp, nodes are virtual — need to confirm this plugin is a no-op when no Node objects exist, otherwise it could fail admission for node-touching requests. If kcp does not expose Node resources in virtual workspaces, this is probably fine.
🟡 processDeltasInBatch — partial state on TransactionError
err := txnStore.Transaction(txns...)
if err != nil {
for _, i := range err.SuccessfulIndices {
callbacks[i]() // only fires for successful txns
}
return fmt.Errorf(...)
}If a batch partially fails, the store has partial state but only the successful callbacks fire. Event handlers will see some adds/updates but not others. Depending on whether TransactionStore is truly atomic or partially applied, this could lead to inconsistency between store state and what handlers observed. Worth a comment or test.
✅ Things that look good
- etcd separate context (
etcdCtxin bothcmd/kcp/kcp.goandcmd/cache-server/main.go): Correct fix — etcd now outlives the apiserver during graceful shutdown, preventing CRD watch cache lazy init from blocking on a dead etcd connection. NewRealFIFOWithOptions: Mechanical API change, correct.- WatchList semantics check: Good defensive check — disables WatchList if the ListerWatcher does not support the semantics rather than silently misbehaving.
responsewriters.Forbiddenctx removal: Mechanical upstream API change, applied consistently in all call sites.MustBaseEnvSetboolean param removal: Upstream API change, correct.min-compatibility-versionflag allowlisted: New upstream flag, benign.NewServiceAccountsControllerlogger param: Upstream API change, correct.
Questions
- The
processLoopnow uses a non-blockingselect { case <-ctx.Done(): return; default: ... }— context cancellation is only checked between pops, not mid-block. Is this intentional upstream behavior, or should the queue implement context-aware blocking? - Is
nodedeclaredfeaturesconfirmed to be a no-op in kcp's virtual workspace model before merging?
/cc @mjudeikis
mjudeikis-bot
left a comment
There was a problem hiding this comment.
Inline review comments on key changes
| return | ||
| } | ||
| } | ||
| utilruntime.HandleErrorWithContext(ctx, err, "Failed to process object from queue") |
There was a problem hiding this comment.
🔴 Silent error swallowing regression
The previous code called utilruntime.HandleErrorWithContext(ctx, err, "Failed to process object from queue") here for any non-ErrFIFOClosed error. That line has been removed.
Now all non-FIFO errors (e.g. a ProcessFunc returning an error) are silently dropped with no log, no metric, no panic. This hides real problems and makes debugging very difficult.
Suggestion: add the error logging back:
if err != nil {
if errors.Is(err, cache.ErrFIFOClosed) {
return
}
utilruntime.HandleErrorWithContext(ctx, err, "Failed to process object from queue")
}There was a problem hiding this comment.
This code has been updated to match the upstream code: https://github.com/kubernetes/kubernetes/blob/9cfdbc0d6ecfaaab403c8ce34d2967751519af12/staging/src/k8s.io/client-go/tools/cache/controller.go#L236-L261
There was a problem hiding this comment.
Thanks for the link! Confirmed — aligning with upstream here is the right call. The error handling in upstream controller.go matches this pattern, so the concern is resolved. ✅
| if err != nil { | ||
| if err == cache.ErrFIFOClosed { | ||
| return | ||
| select { |
There was a problem hiding this comment.
🟡 Context cancellation is non-blocking between pops only
The select { case <-ctx.Done(): return; default: ... } check is non-blocking — it only fires at the top of the loop, not while blocked inside Pop/PopBatch. If the queue is empty and Pop is blocking, a ctx cancellation will not be noticed until the next item is popped.
This changes the shutdown behaviour compared to the old code where cancellation could theoretically be handled differently. Worth a comment explaining this is intentional and the queue is expected to unblock on close.
There was a problem hiding this comment.
This is copy code, we need to check how its aligned with upstream
There was a problem hiding this comment.
Again, this is copied code, I wouldn't touch it as of now.
There was a problem hiding this comment.
Fair enough — if it is a verbatim copy of upstream code, keeping divergence low is the right trade-off. Worth opening a tracking issue upstream if the non-blocking ctx cancellation is ever a problem in practice. ✅
| }) | ||
| } else { | ||
| txn := cache.Transaction{ | ||
| Type: cache.TransactionTypeAdd, |
There was a problem hiding this comment.
🟡 Partial state on TransactionError — potential store/handler inconsistency
When txnStore.Transaction(txns...) returns a TransactionError, only callbacks for SuccessfulIndices are fired. This is correct if the TransactionStore is truly atomic (all-or-nothing). But if it's partially applied (some items written, some not), event handlers will be notified about some items but the store has a different set — leading to inconsistency between store state and handler view.
Please add a comment clarifying the atomicity contract expected of TransactionStore.Transaction(), or add a test covering the partial-failure path.
There was a problem hiding this comment.
need to verify with upstream lib if this same issue in there
There was a problem hiding this comment.
This code has been completely copied from the upstream.
There was a problem hiding this comment.
Acknowledged — if this is a straight upstream copy, then the partial-state concern applies equally to upstream. Worth verifying via upstream issue tracker rather than diverging here. ✅
| // This prevents CRD watch cache lazy initialization from blocking on a dead etcd. | ||
| etcdCtx, etcdCancel := context.WithCancel(context.Background()) | ||
| defer etcdCancel() | ||
| if completedConfig.EmbeddedEtcd.Config != nil { |
There was a problem hiding this comment.
✅ Good fix — etcd now outlives the apiserver during graceful shutdown via a separate context.Background()-rooted context. This prevents CRD watch cache lazy initialization from blocking on a dead etcd connection during shutdown. The defer etcdCancel() correctly cleans up when the command handler returns.
Signed-off-by: Marko Mudrinić <mudrinic.mare@gmail.com>
|
/lgtm lets see if it now blow into our faces :D :D thanks @xmudrii |
|
LGTM label has been added. DetailsGit tree hash: 23e6892c6df728d5c3cb8418eacdfacf66c699d1 |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mjudeikis The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Summary
This PR rebases kcp to Kubernetes 1.35.1. The
kcp-dev/kubernetesfork has been already updated to 1.35.1. Go has been updated to 1.25 with this rebase.The rebase applied mostly cleanly. Some tests that were very close to the timeout started hitting that timeout with this rebase and Go update, so I had to increase timeouts for those tests.
TestAPIExportAPIBindingsAccesshad to be disabled because it became very flaky with this PR, however, it has been flaky before too (#3844). This test will be handled as a follow up.#3897 should be merged either before or after this PR, it includes some additional fixes.
What Type of PR Is This?
/kind feature
Related Issue(s)
xref #3813
Release Notes