GCP-368: add GCP CCM v2 e2e tests#7840
GCP-368: add GCP CCM v2 e2e tests#7840openshift-merge-bot[bot] merged 4 commits intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@cristianoveiga: This pull request references GCP-368 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Important Review skippedAuto reviews are limited based on label configuration. 🚫 Review skipped — only excluded labels are configured. (1)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository YAML (base), Organization UI (inherited) Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughAdds lazy-initialized guest cluster client accessor to TestContext, registers GCP cloud controller manager workload in control plane, and introduces comprehensive cloud integration test suite for GCP validating CCM functionality, node initialization, topology labeling, taint removal, and LoadBalancer provisioning. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@cristianoveiga: This pull request references GCP-368 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
|
@cristianoveiga: This pull request references GCP-368 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/v2/internal/test_context.go`:
- Around line 85-117: Replace the one-time initialization that uses
guestClientOnce with a mutex-based retryable init: change the field
guestClientOnce (sync.Once) to guestClientMu (sync.Mutex), then in the
GetGuestClient (method containing the closure) first return tc.guestClient if
non-nil, otherwise lock tc.guestClientMu, re-check tc.guestClient (to avoid
races), then attempt to load the hosted cluster kubeconfig, create the REST
config and crclient as before, and only set tc.guestClient when client creation
succeeds; always defer tc.guestClientMu.Unlock() after locking so failed
attempts don't permanently block retries. Use the existing symbols hc,
kubeconfigSecret, clientcmd.RESTConfigFromKubeConfig, and crclient.New to locate
the initialization logic to modify.
In `@test/e2e/v2/tests/cloud_integration_test.go`:
- Around line 130-133: Replace the fixed testServiceName constant to generate a
per-test unique name (e.g., using the test's name or a UUID) instead of
"ccm-lb-test" and use that generated serviceName variable wherever the service
is created; keep testNamespace as "default". Also update the cleanup logic to
delete the service by that exact generated serviceName so leftover resources
don't collide across retries/parallel runs. Ensure all references that
previously used testServiceName are updated to the new variable.
- Around line 65-127: The tests perform immediate assertions on node state (in
the It blocks that call testCtx.GetGuestClient(), list nodes into nodes :=
&corev1.NodeList{}, and iterate nodes.Items) which can race with CCM
convergence; change each test ("should set providerID...", "should set zone and
region...", "should remove the uninitialized taint...") to wrap the node
validations inside a Gomega Eventually that repeatedly lists nodes and asserts
all nodes satisfy the required conditions (providerID format checks referencing
hc.Spec.Platform.GCP.Project, topology.kubernetes.io/zone/region presence and
non-empty, and absence of taint key
node.cloudprovider.kubernetes.io/uninitialized) until success or timeout; ensure
the closure re-fetches nodes via guestClient.List and returns no failure until
every node passes so transient failures are retried.
ℹ️ Review info
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
📒 Files selected for processing (3)
test/e2e/v2/internal/test_context.gotest/e2e/v2/internal/workload_registry.gotest/e2e/v2/tests/cloud_integration_test.go
Register gcp-cloud-controller-manager in the v2 workload registry, add a guest cluster client to TestContext, and add GCP CCM tests to control_plane_workloads_test.go validating node initialization (providerID, topology labels, taint removal). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/test e2e-v2-aws |
|
/test ? |
|
/test e2e-v2-gke |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
cblecker
left a comment
There was a problem hiding this comment.
Review Summary
Overall the PR is well-structured and follows established codebase patterns. The workload registry entry correctly mirrors other platform-specific CCMs (AWS, Azure, KubeVirt), and registering gcp-cloud-controller-manager automatically enables 10+ existing workload compliance tests (resource requests, pull policy, read-only root filesystem, safe-to-evict, etc.) — significant coverage gain. The 3 behavioral tests validate the core CCM node initialization contract (providerID, topology labels, taint removal). CI is passing on both AWS (properly skipped) and GKE (executed).
Items
Should fix:
GetGuestClient()docstring says "returns nil" but actually panics on most error paths — update to document actual behaviorcontext.Background()used instead oftc.Context, inconsistent withGetHostedCluster()— won't respect test timeoutshc.Spec.Platform.GCP.Projectaccessed without nil-check on theGCPpointer field — add defensive Expectworkload_registry.gofile header claims "generated, do not edit manually" but is routinely hand-edited — remove or fix the header
Suggestions:
- Repeated setup boilerplate across all 3
Itblocks could be lifted into aBeforeEach - Assertion message "guest client is required" could include diagnostic context
- providerID error message format could match the inline comment format
| // GetGuestClient returns a controller-runtime client for the guest cluster. | ||
| // It reads the kubeconfig from the secret referenced by the HostedCluster status. | ||
| // The client is lazily initialized and cached. | ||
| // Returns nil if the guest client cannot be created (e.g., HostedCluster not ready). |
There was a problem hiding this comment.
This docstring is inaccurate. The method panics on most error paths (secret fetch failure, missing kubeconfig key, REST config creation, client creation) — it only returns nil when hc == nil or hc.Status.KubeConfig == nil.
Suggest updating to match the actual behavior (and mirror the GetHostedCluster() pattern):
// Returns nil if the HostedCluster is not available or its KubeConfig status is not set.
// Panics on any other initialization failure (e.g., kubeconfig secret not found, invalid kubeconfig data).There was a problem hiding this comment.
Good catch - this was left over from the initial implementation. I will update it.
| } | ||
|
|
||
| var kubeconfigSecret corev1.Secret | ||
| err := tc.MgmtClient.Get(context.Background(), crclient.ObjectKey{ |
There was a problem hiding this comment.
GetHostedCluster() uses tc.Context for its API call (line 57), but this uses context.Background(). This means the kubeconfig secret fetch won't respect test timeout/cancellation.
Suggest using tc.Context for consistency:
err := tc.MgmtClient.Get(tc.Context, crclient.ObjectKey{There was a problem hiding this comment.
Good call - updated it.
| Expect(nodes.Items).NotTo(BeEmpty(), "cluster should have nodes") | ||
|
|
||
| hc := testCtx.GetHostedCluster() | ||
| gcpProject := hc.Spec.Platform.GCP.Project |
There was a problem hiding this comment.
GCP is a pointer field (*GCPPlatformSpec). While the BeforeEach guard checks Platform.Type == GCPPlatform, a nil GCP field would cause a raw nil pointer panic here with no useful diagnostic. Adding a defensive check produces a clear failure message:
Expect(hc.Spec.Platform.GCP).NotTo(BeNil(), "GCP platform spec must be set for GCP HostedCluster %s/%s", hc.Namespace, hc.Name)
gcpProject := hc.Spec.Platform.GCP.ProjectThere was a problem hiding this comment.
Pre-existing issue, but worth fixing in this PR since the file is being edited: the file header (lines 3-4) says "This file is generated. Do not edit manually." and references a script at /tmp/generate_workloads.go that doesn't exist in the repository. The output filename referenced (generated_workloads.go) also doesn't match the actual filename (workload_registry.go). The file has been manually edited in multiple commits including this one.
Suggest removing those two lines or replacing with something accurate like:
// This file defines the control plane workload registry.
// Add new workload entries here when onboarding new components.There was a problem hiding this comment.
Done! I'm guessing this was an "one-time" generator just to get the first version of this file in place/migrated?
There was a problem hiding this comment.
That's my suspicion too from when @csrwng created it
| Context("When nodes are initialized by the CCM", func() { | ||
| It("should set providerID on all nodes", func() { | ||
| testCtx := getTestCtx() | ||
| guestClient := testCtx.GetGuestClient() | ||
| Expect(guestClient).NotTo(BeNil(), "guest client is required") |
There was a problem hiding this comment.
suggestion: The setup block (get testCtx, get guestClient, assert not nil, list nodes, assert not empty) is repeated identically in all 3 It blocks. Consider lifting the shared setup into a BeforeEach in this Context, which is the idiomatic Ginkgo pattern used by other tests in this file (e.g., SecurityContextUIDTest).
Also, if GetGuestClient() returns nil, the assertion message "guest client is required" doesn't help diagnose why. Something like "guest client is nil; HostedCluster may not have KubeConfig status set" would save debugging time.
There was a problem hiding this comment.
Done! Moved the duplicated code into BeforeEach and improved the assertion message.
| "node %s providerID should reference project %s", node.Name, gcpProject) | ||
| parts := strings.Split(node.Spec.ProviderID, "/") | ||
| Expect(parts).To(HaveLen(5), | ||
| "node %s providerID should have format gce://project/zone/instance", node.Name) |
There was a problem hiding this comment.
nit: The error message says gce://project/zone/instance but the inline comment on line 848 uses the more precise gce://<project>/<zone>/<instance-name>. Consider matching the comment format in the error message for clarity during failure triage:
"node %s providerID should have format gce://<project>/<zone>/<instance-name>", node.Name)- Fix GetGuestClient() docstring to reflect panic behavior - Use tc.Context instead of context.Background() for consistency - Add nil check on hc.Spec.Platform.GCP before accessing Project - Remove stale "generated" file header from workload_registry.go - Lift shared test setup into BeforeEach with better error message - Fix providerID error message format to match comment Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/test e2e-v2-gke |
|
Follow-up: a couple of minor pattern consistency items I noticed after looking at the updated diff more closely. 1. Labels on Context node The new test uses Context("GCP Cloud Controller Manager", Label("GCP", "CCM"), func() {No other test function in this file uses Labels on Context nodes — 2. Skip message format Existing platform-skip messages follow a consistent pattern:
The new code uses: Consider matching the existing format, e.g.: Both are minor — the functional changes all look good. |
|
One more question: the other tests registered via Was there a deliberate reason to put it in |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The labels are intentional, yes. I used them to run the tests locally (against a pre-provisioned MC that I had). They were useful to filter: --ginkgo.label-filter="GCP && CCM").
Updated the message to match the existing format.
|
|
I initially had it in a separate That said, I'm happy to move this to a new file now if you have a specific preference.
|
|
Feedback on naming: the v2 framework is a clean slate and doesn't use "guest cluster" terminology anywhere (the only "guest" reference is the AWS env var Since v2 is the chance to get this right, I'd suggest renaming:
Separately, per my earlier question about file organization — I'd recommend moving A feature-scoped file (rather than a monolithic The structure would follow the existing v2 pattern:
The GCP CCM workload registry entry should stay in |
- Rename GetGuestClient() to GetHostedClusterClient() to align with v2 framework terminology (hosted cluster, not guest cluster) - Move GCPCloudControllerManagerTest from control_plane_workloads_test.go to hosted_cluster_ccm_test.go with RegisterHostedClusterCCMTests registration pattern, separating hosted cluster validation from management cluster workload tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Both good suggestions - Implemented.
|
|
/test e2e-v2-gke |
|
/lgtm |
|
@cblecker: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Scheduling tests matching the |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cblecker, cristianoveiga The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
The HostedCluster0 conditions failure is a separate test framework issue — after all Now let me produce the final report: Test Failure Analysis Complete (Multi-Step)Job Information
Failed Step AnalysisStep:
|
| Test | Duration | Cause |
|---|---|---|
TestNodePool/HostedCluster0/Main |
0.02s | Parent of the failing subtest |
TestNodePool/HostedCluster0 |
3455.93s | Framework post-condition check failed: the EnsureHostedCluster phase expected the cluster to be in a fresh/progressing state but found it fully ready (ClusterVersionAvailable=True, ClusterVersionProgressing=False). This is a test framework expectation mismatch for 4.21 clusters, not a product bug. |
TestNodePool |
0.00s | Parent of HostedCluster0 |
Aggregated Root Cause
Failed Steps Summary
| Step | One-line Failure |
|---|---|
TestNTOPerformanceProfile |
NTO in hosted control plane failed to generate PerformanceProfile status ConfigMap within 10m timeout |
Root Cause Hypothesis
This failure is a pre-existing flaky test, unrelated to PR #7840. The Node Tuning Operator running inside the hosted control plane (e2e-clusters-b454p-node-pool-5z79c) did not create the expected PerformanceProfile status ConfigMap within 10 minutes. Contributing factors:
-
NTO processing delay: The NTO must detect the mirrored PerformanceProfile ConfigMap, process it, generate a MachineConfig, apply it to nodes, and then create a status ConfigMap reflecting the result. On a CI cluster with 20 parallel tests, this chain can be delayed by resource pressure.
-
API client rate limiting: The
client rate limiter Wait returned an error: context deadline exceededmessage indicates that the management cluster API server was under heavy load, which would affect both the test's polling and the NTO's ability to operate. -
No code change correlation: PR GCP-368: add GCP CCM v2 e2e tests #7840 adds GCP-specific CCM tests in the v2 test framework. It does not modify any controllers, APIs, or infrastructure code that could affect NTO behavior, PerformanceProfile processing, or ConfigMap creation.
Recommendations
- Retrigger the job — This is a flaky NTO timing issue, not a regression from PR GCP-368: add GCP CCM v2 e2e tests #7840.
- Consider increasing the timeout for the status ConfigMap check in
nodepool_nto_performanceprofile_test.go:159(currently 10 minutes viaEventuallyObjectsdefault). Under heavy CI load with 20 parallel tests, NTO may need more time. - Investigate NTO logs — To confirm the root cause of the NTO delay, check NTO pod logs in the HCP namespace
e2e-clusters-b454p-node-pool-5z79cfor errors during PerformanceProfile processing (available in the must-gather dump atdump-management-clusterartifacts).
Artifacts
- Test artifacts:
.work/prow-job-analyze-test-failure/2043562443174580224/logs/ - Build log:
.work/prow-job-analyze-test-failure/2043562443174580224/logs/build-log.txt - JUnit XML:
.work/prow-job-analyze-test-failure/2043562443174580224/logs/junit_operator.xml - Prow Job URL: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_hypershift/7840/pull-ci-openshift-hypershift-main-e2e-aws-4-21/2043562443174580224
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #7840 +/- ##
==========================================
+ Coverage 26.56% 29.74% +3.18%
==========================================
Files 1087 1099 +12
Lines 105042 108949 +3907
==========================================
+ Hits 27902 32409 +4507
+ Misses 74731 73853 -878
- Partials 2409 2687 +278 🚀 New features to boost your workflow:
|
|
@cristianoveiga: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Add capability-based Azure e2e tests to the shared test/e2e/v2/tests/ binary, following the GKE CCM pattern (PR openshift#7840). Tests self-select via Skip() based on cluster capabilities instead of using a separate binary. Three test groups with Ginkgo label filters for CI: - AzurePublicClusterTest (self-managed-azure-public): workload identity webhook mutation, KAS allowed CIDRs, ingress operator configuration - AzurePrivateTopologyTest (self-managed-azure-private): private-router internal LB annotation, PLS CR with alias, private endpoint IP, DNS zone - AzureOAuthLoadBalancerTest (self-managed-azure-oauth-lb): OAuth LB service creation and OAuth token flow validation Skip logic: - Platform type (AzurePlatform) - Azure topology (AzureTopologyPrivate for private tests) - OAuth publishing strategy (LoadBalancer for OAuth LB tests) Also registers Azure-specific env vars (AZURE_PRIVATE_NAT_SUBNET_ID, AZURE_PRIVATE_ADDITIONAL_ALLOWED_SUBSCRIPTIONS) in the shared env var registry. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
What this PR does / why we need it:
Adds v2 e2e tests validating GCP Cloud Controller Manager node initialization. Changes:
gcp-cloud-controller-managerdeployment so existing workload tests (resource requests, security contexts, etc.) automatically cover itGetGuestClient()toTestContextfor tests that need to inspect guest cluster stateGCPCloudControllerManagerTesttocontrol_plane_workloads_test.govalidating:gce://<project>/<zone>/<instance>)Tests are GCP-specific (skipped on other platforms via
BeforeEachguard), confirmed working on AWS CI run (properly skipped, 0 failures).Which issue(s) this PR fixes:
Fixes GCP-368
Special notes for your reviewer:
Checklist: