Add multicluster support for the event recorder#619
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughControllers now obtain event recorders from the multicluster client instead of the controller-manager; a MultiClusterRecorder was added to route Eventf calls to per-cluster recorders based on object GVK and routing logic, with tests for routing, fallback, and concurrency. Changes
Sequence DiagramsequenceDiagram
participant Controller as Scheduling Controller
participant MCR as MultiClusterRecorder
participant Client as Multicluster Client
participant Scheme as Home Scheme
participant Router as Cluster Router
participant ClusterRec as Per-Cluster Recorder
Controller->>MCR: Eventf(regarding, related, ...)
MCR->>Scheme: resolve GVK(regarding)
alt GVK resolved
Scheme-->>MCR: GVK
MCR->>Router: clusterForWrite(GVK)
Router-->>MCR: target cluster
MCR->>Client: lookup pre-cached recorder(target)
alt recorder found
MCR->>ClusterRec: Eventf(...)
ClusterRec-->>MCR: recorded
else fallback to home
MCR->>ClusterRec: Eventf(...) [home recorder]
ClusterRec-->>MCR: recorded
end
else resolution fails or regarding nil
MCR->>ClusterRec: Eventf(...) [home recorder]
ClusterRec-->>MCR: recorded
end
MCR-->>Controller: return
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
pkg/multicluster/recorder.go (1)
22-49: Recorder cache is static; new remote clusters will silently fall back to home events
GetEventRecorder()pre-creates a fixed recorder map, andrecorderFor()falls back to home when a cluster key is missing. If remotes are added after initialization, events for those objects are emitted in the wrong cluster.
Details
♻️ Suggested refactor (lazy recorder creation on cache miss)
import ( + "sync" "k8s.io/apimachinery/pkg/runtime" "k8s.io/client-go/tools/events" ctrl "sigs.k8s.io/controller-runtime" "sigs.k8s.io/controller-runtime/pkg/cluster" ) type MultiClusterRecorder struct { client *Client + name string homeRecorder events.EventRecorder recorders map[cluster.Cluster]events.EventRecorder + mu sync.RWMutex } func (c *Client) GetEventRecorder(name string) events.EventRecorder { homeRecorder := c.HomeCluster.GetEventRecorder(name) recorders := make(map[cluster.Cluster]events.EventRecorder) recorders[c.HomeCluster] = homeRecorder @@ return &MultiClusterRecorder{ client: c, + name: name, homeRecorder: homeRecorder, recorders: recorders, } } @@ - recorder, ok := r.recorders[cl] - if !ok { - ctrl.Log.V(1).Info("multi-cluster recorder: no pre-created recorder for cluster, using home recorder", "gvk", gvk) - return r.homeRecorder - } - - return recorder + r.mu.RLock() + recorder, ok := r.recorders[cl] + r.mu.RUnlock() + if ok { + return recorder + } + + r.mu.Lock() + defer r.mu.Unlock() + if recorder, ok = r.recorders[cl]; ok { + return recorder + } + recorder = cl.GetEventRecorder(r.name) + r.recorders[cl] = recorder + return recorder }Also applies to: 76-80
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/multicluster/recorder.go` around lines 22 - 49, GetEventRecorder currently builds a static map of recorders so later-added remote clusters silently fall back to the home recorder in recorderFor; change to lazily create and cache per-cluster recorders on demand instead: update MultiClusterRecorder.recorderFor (and any lookup path) to, under remoteClustersMu (or a dedicated mutex on MultiClusterRecorder), check if a recorder exists for the requested cluster and if not call cluster.GetEventRecorder(name) to create it and store it in the recorders map before returning it, ensuring concurrent access is synchronized and avoiding fallback to homeRecorder for newly added clusters; keep GetEventRecorder only creating the homeRecorder and initial entries if desired but rely on recorderFor for misses.pkg/multicluster/client_test.go (1)
79-97: Prefer interface-typed test recorder field to reduce test coupling
fakeRecordercan be typed asevents.EventRecorderinstead of concrete*fakeEventRecorder; this keeps the fake cluster reusable and decoupled from one specific test helper type.♻️ Minimal refactor
type fakeCluster struct { cluster.Cluster fakeClient client.Client fakeCache *fakeCache - fakeRecorder *fakeEventRecorder + fakeRecorder events.EventRecorder }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/multicluster/client_test.go` around lines 79 - 97, Change the fakeCluster.test field fakeRecorder from concrete *fakeEventRecorder to the interface type events.EventRecorder and update the GetEventRecorder method to return that interface; specifically, make the fakeCluster struct declare fakeRecorder events.EventRecorder, have GetEventRecorder(_ string) return f.fakeRecorder when non-nil, and otherwise return a new &fakeEventRecorder{} (which still implements events.EventRecorder). This reduces coupling while preserving current behavior.pkg/multicluster/recorder_test.go (1)
174-181: Use a single snapshot per assertion block.These assertions call
getCalls()multiple times, which repeatedly locks and copies slices. Taking one snapshot per recorder improves readability and avoids redundant work.♻️ Suggested refactor
- if len(remoteARecorder.getCalls()) != 0 { - t.Errorf("expected 0 calls to remote-a, got %d", len(remoteARecorder.getCalls())) - } - if len(remoteBRecorder.getCalls()) != 1 { - t.Fatalf("expected 1 call to remote-b, got %d", len(remoteBRecorder.getCalls())) - } - if remoteBRecorder.getCalls()[0].reason != "SchedulingFailed" { - t.Errorf("expected reason %q, got %q", "SchedulingFailed", remoteBRecorder.getCalls()[0].reason) - } + remoteACalls := remoteARecorder.getCalls() + if len(remoteACalls) != 0 { + t.Errorf("expected 0 calls to remote-a, got %d", len(remoteACalls)) + } + remoteBCalls := remoteBRecorder.getCalls() + if len(remoteBCalls) != 1 { + t.Fatalf("expected 1 call to remote-b, got %d", len(remoteBCalls)) + } + if remoteBCalls[0].reason != "SchedulingFailed" { + t.Errorf("expected reason %q, got %q", "SchedulingFailed", remoteBCalls[0].reason) + }- if len(homeRecorder.getCalls()) != goroutines { - t.Errorf("expected %d calls, got %d", goroutines, len(homeRecorder.getCalls())) - } + homeCalls := homeRecorder.getCalls() + if len(homeCalls) != goroutines { + t.Errorf("expected %d calls, got %d", goroutines, len(homeCalls)) + }Also applies to: 253-255
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/multicluster/recorder_test.go` around lines 174 - 181, The test repeatedly calls remoteARecorder.getCalls() and remoteBRecorder.getCalls(), causing redundant locking/copying; fix by taking a single snapshot for each recorder (e.g., callsA := remoteARecorder.getCalls() and callsB := remoteBRecorder.getCalls()) and use those snapshots for all subsequent length and content assertions (replace repeated getCalls() calls and use callsB[0].reason for the reason check); apply same change for the similar block at lines 253-255.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/multicluster/client_test.go`:
- Around line 79-97: Change the fakeCluster.test field fakeRecorder from
concrete *fakeEventRecorder to the interface type events.EventRecorder and
update the GetEventRecorder method to return that interface; specifically, make
the fakeCluster struct declare fakeRecorder events.EventRecorder, have
GetEventRecorder(_ string) return f.fakeRecorder when non-nil, and otherwise
return a new &fakeEventRecorder{} (which still implements events.EventRecorder).
This reduces coupling while preserving current behavior.
In `@pkg/multicluster/recorder_test.go`:
- Around line 174-181: The test repeatedly calls remoteARecorder.getCalls() and
remoteBRecorder.getCalls(), causing redundant locking/copying; fix by taking a
single snapshot for each recorder (e.g., callsA := remoteARecorder.getCalls()
and callsB := remoteBRecorder.getCalls()) and use those snapshots for all
subsequent length and content assertions (replace repeated getCalls() calls and
use callsB[0].reason for the reason check); apply same change for the similar
block at lines 253-255.
In `@pkg/multicluster/recorder.go`:
- Around line 22-49: GetEventRecorder currently builds a static map of recorders
so later-added remote clusters silently fall back to the home recorder in
recorderFor; change to lazily create and cache per-cluster recorders on demand
instead: update MultiClusterRecorder.recorderFor (and any lookup path) to, under
remoteClustersMu (or a dedicated mutex on MultiClusterRecorder), check if a
recorder exists for the requested cluster and if not call
cluster.GetEventRecorder(name) to create it and store it in the recorders map
before returning it, ensuring concurrent access is synchronized and avoiding
fallback to homeRecorder for newly added clusters; keep GetEventRecorder only
creating the homeRecorder and initial entries if desired but rely on recorderFor
for misses.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3a3aff50-8a0c-40d3-90df-f471a387434c
📒 Files selected for processing (9)
internal/scheduling/cinder/filter_weigher_pipeline_controller.gointernal/scheduling/machines/filter_weigher_pipeline_controller.gointernal/scheduling/manila/filter_weigher_pipeline_controller.gointernal/scheduling/nova/filter_weigher_pipeline_controller.gointernal/scheduling/pods/filter_weigher_pipeline_controller.gointernal/scheduling/reservations/failover/controller.gopkg/multicluster/client_test.gopkg/multicluster/recorder.gopkg/multicluster/recorder_test.go
There was a problem hiding this comment.
Pull request overview
Adds a multi-cluster-aware Kubernetes EventRecorder that routes events to the correct cluster (home vs. matching remote) using the same routing logic as the multicluster client write path.
Changes:
- Introduce
MultiClusterRecorderandClient.GetEventRecorder(...)to route events by the “regarding” object’s GVK and router match. - Update scheduling controllers to use
mcl.GetEventRecorder(...)instead ofmgr.GetEventRecorder(...). - Add unit tests covering home routing, remote routing, fallback behavior, and concurrency.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| pkg/multicluster/recorder.go | New multi-cluster event recorder + Client.GetEventRecorder implementation. |
| pkg/multicluster/recorder_test.go | New tests validating routing/fallback/concurrency of the recorder. |
| pkg/multicluster/client_test.go | Extend fake cluster.Cluster to provide event recorders for tests. |
| internal/scheduling/reservations/failover/controller.go | Switch controller recorder to multicluster-aware recorder. |
| internal/scheduling/pods/filter_weigher_pipeline_controller.go | Use multicluster-aware recorder for history events. |
| internal/scheduling/nova/filter_weigher_pipeline_controller.go | Use multicluster-aware recorder for history events. |
| internal/scheduling/manila/filter_weigher_pipeline_controller.go | Use multicluster-aware recorder for history events. |
| internal/scheduling/machines/filter_weigher_pipeline_controller.go | Use multicluster-aware recorder for history events. |
| internal/scheduling/cinder/filter_weigher_pipeline_controller.go | Use multicluster-aware recorder for history events. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
docs/guides/multicluster/schedule.sh (2)
189-190: Remove trailing blank lines.Minor style nit: the file ends with extra blank lines.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/guides/multicluster/schedule.sh` around lines 189 - 190, Remove the extra trailing blank lines at the end of the shell script (schedule.sh) so the file ends immediately after the final meaningful line (no extra newlines or empty lines at EOF); open the file, delete the blank lines after the last command/comment, and ensure the file ends with a single newline only.
174-187: Consider a longer wait or polling for async operations.The
sleep 1may be insufficient for the scheduler to complete processing and for events to propagate to remote clusters. Consider increasing the delay or implementing a polling mechanism that waits for the expected resources to appear.💡 Optional: Add retry loop for history check
-sleep 1 +echo "Waiting for scheduling to complete..." +for i in {1..10}; do + if kubectl --context kind-cortex-home get histories 2>/dev/null | grep -q "$INSTANCE_UUID"; then + break + fi + sleep 1 +done🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/guides/multicluster/schedule.sh` around lines 174 - 187, The one-second pause (sleep 1) before running the kubectl checks is likely too short; replace the single sleep with either a longer delay or, preferably, a retry/poll loop that runs the kubectl --context <context> get histories and kubectl --context <context> get events --field-selector reason=SchedulingSucceeded commands for each context (kind-cortex-home, kind-cortex-remote-az-a, kind-cortex-remote-az-b) until they return the expected resources or a timeout is reached; implement a simple backoff/retry (e.g., loop with sleep intervals and a max timeout) to ensure the scheduler has time to process and events propagate before proceeding.docs/guides/multicluster/run.sh (1)
55-68: LGTM! Changes align with multicluster recorder routing logic.The added
HistoryandHistoryListGVKs andavailabilityZonelabel are consistent withHistoryResourceRouterexpectations inpkg/multicluster/routers.go.However,
docs/guides/multicluster/readme.mdstill uses the oldazlabel key at lines 110 and 118. Update it toavailabilityZonefor consistency.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/guides/multicluster/run.sh` around lines 55 - 68, The README still uses the old label key "az" which breaks compatibility with HistoryResourceRouter expecting "availabilityZone"; open the multicluster readme and replace the "az" label occurrences with "availabilityZone" in the example manifests (the two spots referenced) so the example GVK routing labels match the code and the run.sh examples that added History/HistoryList and availabilityZone.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/guides/multicluster/readme.md`:
- Around line 4-5: The README contains a truncated sentence "If you want to test
the" that should be completed or removed; edit
docs/guides/multicluster/readme.md to either finish the sentence with the
intended continuation (e.g., "If you want to test the multi-cluster setup, run
the test scripts or follow the steps below") or remove the fragment entirely so
the paragraph reads coherently, and ensure the surrounding mention of run.sh and
cleanup.sh remains accurate and links to any examples or next steps.
---
Nitpick comments:
In `@docs/guides/multicluster/run.sh`:
- Around line 55-68: The README still uses the old label key "az" which breaks
compatibility with HistoryResourceRouter expecting "availabilityZone"; open the
multicluster readme and replace the "az" label occurrences with
"availabilityZone" in the example manifests (the two spots referenced) so the
example GVK routing labels match the code and the run.sh examples that added
History/HistoryList and availabilityZone.
In `@docs/guides/multicluster/schedule.sh`:
- Around line 189-190: Remove the extra trailing blank lines at the end of the
shell script (schedule.sh) so the file ends immediately after the final
meaningful line (no extra newlines or empty lines at EOF); open the file, delete
the blank lines after the last command/comment, and ensure the file ends with a
single newline only.
- Around line 174-187: The one-second pause (sleep 1) before running the kubectl
checks is likely too short; replace the single sleep with either a longer delay
or, preferably, a retry/poll loop that runs the kubectl --context <context> get
histories and kubectl --context <context> get events --field-selector
reason=SchedulingSucceeded commands for each context (kind-cortex-home,
kind-cortex-remote-az-a, kind-cortex-remote-az-b) until they return the expected
resources or a timeout is reached; implement a simple backoff/retry (e.g., loop
with sleep intervals and a max timeout) to ensure the scheduler has time to
process and events propagate before proceeding.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 01ece521-bae4-4d04-ba03-71588d4d5e08
📒 Files selected for processing (4)
docs/guides/multicluster/readme.mddocs/guides/multicluster/run.shdocs/guides/multicluster/schedule.shdocs/guides/multicluster/test-pipeline.yaml
✅ Files skipped from review due to trivial changes (1)
- docs/guides/multicluster/test-pipeline.yaml
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/guides/multicluster/schedule.sh`:
- Around line 167-173: The script currently parses RESPONSE into HTTP_CODE and
BODY but only prints the code; update the logic after extracting HTTP_CODE and
BODY to detect non-2xx responses and exit immediately with a non-zero status:
print a clear error message including "HTTP $HTTP_CODE" and the BODY
(pretty-print JSON if possible) and then call exit 1 when HTTP_CODE is not in
the 200–299 range so the script fails fast; use the existing HTTP_CODE and BODY
variables and ensure successful responses continue to print and proceed as
before.
- Around line 190-191: The script currently always blocks on the interactive
prompt (echo followed by read -r), which stalls CI/piped runs; wrap the
prompt/read so it only executes when stdin is a TTY (e.g., check [ -t 0 ] or
test for /dev/tty) and skip it in non-interactive runs, leaving the echo message
or a non-blocking log in place when skipping; update the block around the
existing echo and read -r to perform that TTY check before calling read so CI
pipelines won’t hang.
- Line 194: Replace the hardcoded history name with the INSTANCE_UUID variable
so the describe uses the current instance; update the kubectl command that
currently references nova-cortex-test-instance-001 to use the INSTANCE_UUID
shell variable (e.g., kubectl --context kind-cortex-remote-az-b describe history
$INSTANCE_UUID) so subsequent steps remain consistent when INSTANCE_UUID
changes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7b1a0168-f17d-461b-9d57-45dbc6e0780c
📒 Files selected for processing (2)
docs/guides/multicluster/readme.mddocs/guides/multicluster/schedule.sh
docs/guides/multicluster/schedule.sh
Outdated
| echo "Press enter to describe the History CRD in cortex-remote-az-b and see the details of the scheduling result" | ||
| read -r |
There was a problem hiding this comment.
Avoid blocking on prompt in non-interactive runs.
Line 191 always waits for input. In CI or piped runs, this can stall the script indefinitely.
Proposed fix
echo "---"
echo "Press enter to describe the History CRD in cortex-remote-az-b and see the details of the scheduling result"
-read -r
+if [[ -t 0 ]]; then
+ read -r
+fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| echo "Press enter to describe the History CRD in cortex-remote-az-b and see the details of the scheduling result" | |
| read -r | |
| echo "Press enter to describe the History CRD in cortex-remote-az-b and see the details of the scheduling result" | |
| if [[ -t 0 ]]; then | |
| read -r | |
| fi |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/guides/multicluster/schedule.sh` around lines 190 - 191, The script
currently always blocks on the interactive prompt (echo followed by read -r),
which stalls CI/piped runs; wrap the prompt/read so it only executes when stdin
is a TTY (e.g., check [ -t 0 ] or test for /dev/tty) and skip it in
non-interactive runs, leaving the echo message or a non-blocking log in place
when skipping; update the block around the existing echo and read -r to perform
that TTY check before calling read so CI pipelines won’t hang.
Test Coverage ReportTest Coverage 📊: 68.5% |
Changes