GCP-368: add GCP CCM v2 e2e tests by cristianoveiga · Pull Request #7840 · openshift/hypershift

cristianoveiga · 2026-03-02T16:15:44Z

What this PR does / why we need it:

Adds v2 e2e tests validating GCP Cloud Controller Manager node initialization. Changes:

Workload registry: Register gcp-cloud-controller-manager deployment so existing workload tests (resource requests, security contexts, etc.) automatically cover it
Guest cluster client: Add GetGuestClient() to TestContext for tests that need to inspect guest cluster state
GCP CCM tests: Add GCPCloudControllerManagerTest to control_plane_workloads_test.go validating:
- ProviderID assignment (gce://<project>/<zone>/<instance>)
- Zone/region topology labels
- Uninitialized taint removal

Tests are GCP-specific (skipped on other platforms via BeforeEach guard), confirmed working on AWS CI run (properly skipped, 0 failures).

Which issue(s) this PR fixes:

Fixes GCP-368

Special notes for your reviewer:

LoadBalancer service provisioning tests were descoped to a follow-up card

Checklist:

Subject and description added to both, commit and PR.
Relevant issues have been referenced.
This change includes docs.
This change includes unit tests.

openshift-ci-robot · 2026-03-02T16:15:48Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

openshift-ci-robot · 2026-03-02T16:15:49Z

@cristianoveiga: This pull request references GCP-368 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Register gcp-cloud-controller-manager in the v2 workload registry, add a guest cluster client to TestContext, and create cloud integration tests validating CCM node initialization and LoadBalancer provisioning.

Which issue(s) this PR fixes:

Fixes GCP-368

Special notes for your reviewer:

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-03-02T16:16:07Z

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Review skipped — only excluded labels are configured. (1)

do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: fc6cf417-056c-490d-8cbb-acaf3ed2f137

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds lazy-initialized guest cluster client accessor to TestContext, registers GCP cloud controller manager workload in control plane, and introduces comprehensive cloud integration test suite for GCP validating CCM functionality, node initialization, topology labeling, taint removal, and LoadBalancer provisioning.

Changes

Cohort / File(s)	Summary
Guest Client Support `test/e2e/v2/internal/test_context.go`	Implements lazy-initialized guest cluster client accessor via `GetGuestClient()` method. Retrieves kubeconfig from HostedCluster secret, builds REST configuration, and caches client using `sync.Once` for thread-safe access. Returns nil if prerequisites unmet or on errors.
Control Plane Configuration `test/e2e/v2/internal/workload_registry.go`	Adds `gcp-cloud-controller-manager` deployment as new GCP platform control plane workload with cloud-controller-manager pod selector.
Cloud Integration Tests `test/e2e/v2/tests/cloud_integration_test.go`	Introduces comprehensive GCP cloud integration test suite validating CCM-driven functionality: providerID format/extraction (gce:// format), node topology labeling (zone/region), uninitialized taint removal, and LoadBalancer service provisioning with external IP polling and cleanup.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Structure And Quality	⚠️ Warning	Test code fails quality requirements: node-state assertions lack Eventually() wrappers, LoadBalancer test uses fixed service name without unique generation, and GetGuestClient uses sync.Once causing permanent nil caching.	Wrap node validations in Eventually() blocks with timeouts, generate unique service names using timestamp/process ID, and replace sync.Once with sync.Mutex for retry support.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: adding GCP CCM v2 end-to-end tests, which aligns with all three modified files (guest client addition, workload registry update, and new test suite).
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Stable And Deterministic Test Names	✅ Passed	All test titles in cloud_integration_test.go are stable and deterministic with only static descriptive strings; no dynamic information like timestamps, UUIDs, or IP addresses present.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci-robot · 2026-03-02T16:16:25Z

@cristianoveiga: This pull request references GCP-368 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Register gcp-cloud-controller-manager in the v2 workload registry, add a guest cluster client to TestContext, and create cloud integration tests validating CCM node initialization and LoadBalancer provisioning.

Which issue(s) this PR fixes:

Fixes GCP-368

Special notes for your reviewer:

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-03-02T16:16:41Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci-robot · 2026-03-02T16:25:16Z

@cristianoveiga: This pull request references GCP-368 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Register gcp-cloud-controller-manager in the v2 workload registry, add a guest cluster client to TestContext, and create cloud integration tests validating CCM node initialization and LoadBalancer provisioning.

Which issue(s) this PR fixes:

Fixes GCP-368

Special notes for your reviewer:

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Summary by CodeRabbit

Release Notes

Tests

Added comprehensive end-to-end cloud integration test suite for GCP platform, including validation of Cloud Controller Manager node initialization with proper provider ID assignment, node topology label verification, automatic taint removal, and LoadBalancer external IP provisioning.

Enhanced testing infrastructure with improved guest cluster client access and cloud platform workload registry.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/v2/internal/test_context.go`:
- Around line 85-117: Replace the one-time initialization that uses
guestClientOnce with a mutex-based retryable init: change the field
guestClientOnce (sync.Once) to guestClientMu (sync.Mutex), then in the
GetGuestClient (method containing the closure) first return tc.guestClient if
non-nil, otherwise lock tc.guestClientMu, re-check tc.guestClient (to avoid
races), then attempt to load the hosted cluster kubeconfig, create the REST
config and crclient as before, and only set tc.guestClient when client creation
succeeds; always defer tc.guestClientMu.Unlock() after locking so failed
attempts don't permanently block retries. Use the existing symbols hc,
kubeconfigSecret, clientcmd.RESTConfigFromKubeConfig, and crclient.New to locate
the initialization logic to modify.

In `@test/e2e/v2/tests/cloud_integration_test.go`:
- Around line 130-133: Replace the fixed testServiceName constant to generate a
per-test unique name (e.g., using the test's name or a UUID) instead of
"ccm-lb-test" and use that generated serviceName variable wherever the service
is created; keep testNamespace as "default". Also update the cleanup logic to
delete the service by that exact generated serviceName so leftover resources
don't collide across retries/parallel runs. Ensure all references that
previously used testServiceName are updated to the new variable.
- Around line 65-127: The tests perform immediate assertions on node state (in
the It blocks that call testCtx.GetGuestClient(), list nodes into nodes :=
&corev1.NodeList{}, and iterate nodes.Items) which can race with CCM
convergence; change each test ("should set providerID...", "should set zone and
region...", "should remove the uninitialized taint...") to wrap the node
validations inside a Gomega Eventually that repeatedly lists nodes and asserts
all nodes satisfy the required conditions (providerID format checks referencing
hc.Spec.Platform.GCP.Project, topology.kubernetes.io/zone/region presence and
non-empty, and absence of taint key
node.cloudprovider.kubernetes.io/uninitialized) until success or timeout; ensure
the closure re-fetches nodes via guestClient.List and returns no failure until
every node passes so transient failures are retried.

ℹ️ Review info

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between cce0243 and 036769f.

📒 Files selected for processing (3)

test/e2e/v2/internal/test_context.go
test/e2e/v2/internal/workload_registry.go
test/e2e/v2/tests/cloud_integration_test.go

Register gcp-cloud-controller-manager in the v2 workload registry, add a guest cluster client to TestContext, and add GCP CCM tests to control_plane_workloads_test.go validating node initialization (providerID, topology labels, taint removal). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cristianoveiga · 2026-03-27T20:05:11Z

/test e2e-v2-aws

cristianoveiga · 2026-04-09T19:30:06Z

/test ?

cristianoveiga · 2026-04-09T19:30:33Z

/test e2e-v2-gke

cristianoveiga · 2026-04-10T14:26:43Z

@coderabbitai review

coderabbitai · 2026-04-10T14:26:52Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

cblecker

Review Summary

Overall the PR is well-structured and follows established codebase patterns. The workload registry entry correctly mirrors other platform-specific CCMs (AWS, Azure, KubeVirt), and registering gcp-cloud-controller-manager automatically enables 10+ existing workload compliance tests (resource requests, pull policy, read-only root filesystem, safe-to-evict, etc.) — significant coverage gain. The 3 behavioral tests validate the core CCM node initialization contract (providerID, topology labels, taint removal). CI is passing on both AWS (properly skipped) and GKE (executed).

Items

Should fix:

GetGuestClient() docstring says "returns nil" but actually panics on most error paths — update to document actual behavior
context.Background() used instead of tc.Context, inconsistent with GetHostedCluster() — won't respect test timeouts
hc.Spec.Platform.GCP.Project accessed without nil-check on the GCP pointer field — add defensive Expect
workload_registry.go file header claims "generated, do not edit manually" but is routinely hand-edited — remove or fix the header

Suggestions:

Repeated setup boilerplate across all 3 It blocks could be lifted into a BeforeEach
Assertion message "guest client is required" could include diagnostic context
providerID error message format could match the inline comment format

cblecker · 2026-04-10T18:54:31Z

+// GetGuestClient returns a controller-runtime client for the guest cluster.
+// It reads the kubeconfig from the secret referenced by the HostedCluster status.
+// The client is lazily initialized and cached.
+// Returns nil if the guest client cannot be created (e.g., HostedCluster not ready).


This docstring is inaccurate. The method panics on most error paths (secret fetch failure, missing kubeconfig key, REST config creation, client creation) — it only returns nil when hc == nil or hc.Status.KubeConfig == nil.

Suggest updating to match the actual behavior (and mirror the GetHostedCluster() pattern):

// Returns nil if the HostedCluster is not available or its KubeConfig status is not set. // Panics on any other initialization failure (e.g., kubeconfig secret not found, invalid kubeconfig data).

Good catch - this was left over from the initial implementation. I will update it.

cblecker · 2026-04-10T18:54:32Z

+		}
+
+		var kubeconfigSecret corev1.Secret
+		err := tc.MgmtClient.Get(context.Background(), crclient.ObjectKey{


GetHostedCluster() uses tc.Context for its API call (line 57), but this uses context.Background(). This means the kubeconfig secret fetch won't respect test timeout/cancellation.

Suggest using tc.Context for consistency:

err := tc.MgmtClient.Get(tc.Context, crclient.ObjectKey{

Good call - updated it.

cblecker · 2026-04-10T18:54:34Z

+				Expect(nodes.Items).NotTo(BeEmpty(), "cluster should have nodes")
+
+				hc := testCtx.GetHostedCluster()
+				gcpProject := hc.Spec.Platform.GCP.Project


GCP is a pointer field (*GCPPlatformSpec). While the BeforeEach guard checks Platform.Type == GCPPlatform, a nil GCP field would cause a raw nil pointer panic here with no useful diagnostic. Adding a defensive check produces a clear failure message:

Expect(hc.Spec.Platform.GCP).NotTo(BeNil(), "GCP platform spec must be set for GCP HostedCluster %s/%s", hc.Namespace, hc.Name) gcpProject := hc.Spec.Platform.GCP.Project

cblecker · 2026-04-10T18:54:35Z

Pre-existing issue, but worth fixing in this PR since the file is being edited: the file header (lines 3-4) says "This file is generated. Do not edit manually." and references a script at /tmp/generate_workloads.go that doesn't exist in the repository. The output filename referenced (generated_workloads.go) also doesn't match the actual filename (workload_registry.go). The file has been manually edited in multiple commits including this one.

Suggest removing those two lines or replacing with something accurate like:

// This file defines the control plane workload registry. // Add new workload entries here when onboarding new components.

Done! I'm guessing this was an "one-time" generator just to get the first version of this file in place/migrated?

That's my suspicion too from when @csrwng created it

cblecker · 2026-04-10T18:54:37Z

+		Context("When nodes are initialized by the CCM", func() {
+			It("should set providerID on all nodes", func() {
+				testCtx := getTestCtx()
+				guestClient := testCtx.GetGuestClient()
+				Expect(guestClient).NotTo(BeNil(), "guest client is required")


suggestion: The setup block (get testCtx, get guestClient, assert not nil, list nodes, assert not empty) is repeated identically in all 3 It blocks. Consider lifting the shared setup into a BeforeEach in this Context, which is the idiomatic Ginkgo pattern used by other tests in this file (e.g., SecurityContextUIDTest).

Also, if GetGuestClient() returns nil, the assertion message "guest client is required" doesn't help diagnose why. Something like "guest client is nil; HostedCluster may not have KubeConfig status set" would save debugging time.

Done! Moved the duplicated code into BeforeEach and improved the assertion message.

cblecker · 2026-04-10T18:54:38Z

+						"node %s providerID should reference project %s", node.Name, gcpProject)
+					parts := strings.Split(node.Spec.ProviderID, "/")
+					Expect(parts).To(HaveLen(5),
+						"node %s providerID should have format gce://project/zone/instance", node.Name)


nit: The error message says gce://project/zone/instance but the inline comment on line 848 uses the more precise gce://<project>/<zone>/<instance-name>. Consider matching the comment format in the error message for clarity during failure triage:

"node %s providerID should have format gce://<project>/<zone>/<instance-name>", node.Name)

- Fix GetGuestClient() docstring to reflect panic behavior - Use tc.Context instead of context.Background() for consistency - Add nil check on hc.Spec.Platform.GCP before accessing Project - Remove stale "generated" file header from workload_registry.go - Lift shared test setup into BeforeEach with better error message - Fix providerID error message format to match comment Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cristianoveiga · 2026-04-10T19:30:39Z

/test e2e-v2-gke

cblecker · 2026-04-11T00:57:25Z

Follow-up: a couple of minor pattern consistency items I noticed after looking at the updated diff more closely.

1. Labels on Context node

The new test uses Label("GCP", "CCM") on its Context:

Context("GCP Cloud Controller Manager", Label("GCP", "CCM"), func() {

No other test function in this file uses Labels on Context nodes — PodAffinitiesAndTolerationsTest (AWS) and SecurityContextUIDTest (Azure) both use plain Context(...). Labels in this file only appear on the top-level Describe and one It block ("Informing"). Consider removing the labels for consistency, or if they're intentional for filtering, that's fine too — just flagging the deviation.

2. Skip message format

Existing platform-skip messages follow a consistent pattern:

"Pod affinities and tolerations test is only for AWS platform"
"Security context UID test is only for Azure platform"

The new code uses:

"Test requires a GCP HostedCluster"

Consider matching the existing format, e.g.: "GCP Cloud Controller Manager test is only for GCP platform"

Both are minor — the functional changes all look good.

cblecker · 2026-04-11T00:59:21Z

One more question: the other tests registered via RegisterControlPlaneWorkloadsTests all validate properties of control plane workloads (deployments/pods in the management cluster) — things like resource requests, pull policy, security contexts, etc. The GCP CCM test is different in that it validates guest cluster node state (providerID, topology labels, taints).

Was there a deliberate reason to put it in control_plane_workloads_test.go rather than a separate test file? It works fine here, just curious if you considered splitting it out since it's testing a different layer (guest cluster effects vs. management cluster workload properties).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cristianoveiga · 2026-04-11T11:14:12Z

Follow-up: a couple of minor pattern consistency items I noticed after looking at the updated diff more closely.

1. Labels on Context node

The new test uses Label("GCP", "CCM") on its Context:
Context("GCP Cloud Controller Manager", Label("GCP", "CCM"), func() {
No other test function in this file uses Labels on Context nodes — PodAffinitiesAndTolerationsTest (AWS) and SecurityContextUIDTest (Azure) both use plain Context(...). Labels in this file only appear on the top-level Describe and one It block ("Informing"). Consider removing the labels for consistency, or if they're intentional for filtering, that's fine too — just flagging the deviation.

The labels are intentional, yes. I used them to run the tests locally (against a pre-provisioned MC that I had).

They were useful to filter: --ginkgo.label-filter="GCP && CCM").

2. Skip message format

Existing platform-skip messages follow a consistent pattern:

"Pod affinities and tolerations test is only for AWS platform"

"Security context UID test is only for Azure platform"

The new code uses:
"Test requires a GCP HostedCluster"
Consider matching the existing format, e.g.: "GCP Cloud Controller Manager test is only for GCP platform"

Updated the message to match the existing format.

Both are minor — the functional changes all look good.

cristianoveiga · 2026-04-11T11:32:43Z

I initially had it in a separate cloud_integration_test.go, but I decided to include it in the existing file because I felt we didn't have similar v2 tests yet to determine the ideal file structure for guest-cluster-level validations. So I intentionally deferred that decision until we have more tests migrated to v2.

That said, I'm happy to move this to a new file now if you have a specific preference.

One more question: the other tests registered via RegisterControlPlaneWorkloadsTests all validate properties of control plane workloads (deployments/pods in the management cluster) — things like resource requests, pull policy, security contexts, etc. The GCP CCM test is different in that it validates guest cluster node state (providerID, topology labels, taints).

Was there a deliberate reason to put it in control_plane_workloads_test.go rather than a separate test file? It works fine here, just curious if you considered splitting it out since it's testing a different layer (guest cluster effects vs. management cluster workload properties).

cblecker · 2026-04-11T16:37:57Z

Feedback on naming: the v2 framework is a clean slate and doesn't use "guest cluster" terminology anywhere (the only "guest" reference is the AWS env var AWS_GUEST_INFRA_CREDENTIALS_FILE which comes from external convention). The v1 framework has WaitForGuestClient/guestClient heavily, but the project's preferred terminology is "hosted cluster" and "control plane" — see AGENTS.md which consistently uses these terms and never says "guest cluster."

Since v2 is the chance to get this right, I'd suggest renaming:

GetGuestClient() → GetHostedClusterClient()
guestClient / guestClientOnce fields → hostedClusterClient / hostedClusterClientOnce
The docstring: "guest cluster" → "hosted cluster"
Variable names in the tests: guestClient → hostedClusterClient

Separately, per my earlier question about file organization — I'd recommend moving GCPCloudControllerManagerTest out of control_plane_workloads_test.go into a new hosted_cluster_ccm_test.go file. The tests in control_plane_workloads_test.go all validate properties of workloads running in the control plane namespace (management cluster side), but the CCM tests validate node state on the hosted cluster — a different layer entirely.

A feature-scoped file (rather than a monolithic hosted_cluster_test.go) sets a good convention as more hosted-cluster-side tests get added. control_plane_workloads_test.go is already 850 lines with 13 test functions, and the v1 framework's approach of smaller domain-specific files (karpenter, autoscaling, OLM, etc.) has scaled better than large monoliths. If other platform CCM tests or the descoped LoadBalancer tests land later, they can share this file or get their own.

The structure would follow the existing v2 pattern:

RegisterHostedClusterCCMTests(getTestCtx) registration function
var _ = Describe("Hosted Cluster CCM", Label("hosted-cluster-ccm"), ...) top-level block
Platform-specific test functions nested inside with BeforeEach skip guards

The GCP CCM workload registry entry should stay in workload_registry.go — that's the right place for it. Only the behavioral test function would move.

- Rename GetGuestClient() to GetHostedClusterClient() to align with v2 framework terminology (hosted cluster, not guest cluster) - Move GCPCloudControllerManagerTest from control_plane_workloads_test.go to hosted_cluster_ccm_test.go with RegisterHostedClusterCCMTests registration pattern, separating hosted cluster validation from management cluster workload tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cristianoveiga · 2026-04-12T15:08:44Z

Both good suggestions - Implemented.

Feedback on naming: the v2 framework is a clean slate and doesn't use "guest cluster" terminology anywhere (the only "guest" reference is the AWS env var AWS_GUEST_INFRA_CREDENTIALS_FILE which comes from external convention). The v1 framework has WaitForGuestClient/guestClient heavily, but the project's preferred terminology is "hosted cluster" and "control plane" — see AGENTS.md which consistently uses these terms and never says "guest cluster."

Since v2 is the chance to get this right, I'd suggest renaming:

GetGuestClient() → GetHostedClusterClient()

guestClient / guestClientOnce fields → hostedClusterClient / hostedClusterClientOnce

The docstring: "guest cluster" → "hosted cluster"

Variable names in the tests: guestClient → hostedClusterClient

Separately, per my earlier question about file organization — I'd recommend moving GCPCloudControllerManagerTest out of control_plane_workloads_test.go into a new hosted_cluster_ccm_test.go file. The tests in control_plane_workloads_test.go all validate properties of workloads running in the control plane namespace (management cluster side), but the CCM tests validate node state on the hosted cluster — a different layer entirely.

A feature-scoped file (rather than a monolithic hosted_cluster_test.go) sets a good convention as more hosted-cluster-side tests get added. control_plane_workloads_test.go is already 850 lines with 13 test functions, and the v1 framework's approach of smaller domain-specific files (karpenter, autoscaling, OLM, etc.) has scaled better than large monoliths. If other platform CCM tests or the descoped LoadBalancer tests land later, they can share this file or get their own.

The structure would follow the existing v2 pattern:

RegisterHostedClusterCCMTests(getTestCtx) registration function

var _ = Describe("Hosted Cluster CCM", Label("hosted-cluster-ccm"), ...) top-level block

Platform-specific test functions nested inside with BeforeEach skip guards

The GCP CCM workload registry entry should stay in workload_registry.go — that's the right place for it. Only the behavioral test function would move.

cristianoveiga · 2026-04-12T15:23:19Z

/test e2e-v2-gke

cblecker · 2026-04-13T03:57:00Z

/lgtm
/approve
/verified by e2e

openshift-ci-robot · 2026-04-13T03:57:12Z

@cblecker: This PR has been marked as verified by e2e.

Details

In response to this:

/lgtm
/approve
/verified by e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-merge-bot · 2026-04-13T03:57:14Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-21
/test e2e-aws-4-21
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

openshift-ci · 2026-04-13T03:57:20Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cblecker, cristianoveiga

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [cblecker]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-merge-bot · 2026-04-13T05:30:11Z

/retest-required

Remaining retests: 0 against base HEAD 783f795 and 2 for PR HEAD f150069 in total

hypershift-jira-solve-ci · 2026-04-13T08:23:37Z

The HostedCluster0 conditions failure is a separate test framework issue — after all Main tests complete, the framework checks that the cluster is in its initial "not ready" state, but by this point the cluster version 4.21.0-0.ci-2026-04-12-082909 was fully applied. This is a known test framework expectation mismatch, not a product bug.

Now let me produce the final report:

Test Failure Analysis Complete (Multi-Step)

Job Information

Prow Job: pull-ci-openshift-hypershift-main-e2e-aws-4-21
Build ID: 2043562443174580224
Target: e2e-aws-4-21
PR: #7840 — GCP-368: add GCP CCM v2 e2e tests
Failed Steps: 1 (hypershift-aws-run-e2e-nested — test phase)
Test Results: 474 tests, 36 skipped, 4 failures (all from 1 root test)

Failed Step Analysis

Step: `hypershift-aws-run-e2e-nested` (test phase)

Root Failing Test: `TestNodePool/HostedCluster0/Main/TestNTOPerformanceProfile`

Duration: 1113.07s (18m33s)

Error

eventually.go:259: Failed to get **v1.ConfigMap: client rate limiter Wait returned an error: context deadline exceeded
nodepool_nto_performanceprofile_test.go:159: Failed to wait for performance profile status ConfigMap to exist in 10m0s: context deadline exceeded
eventually.go:384: observed invalid **v1.ConfigMap state after 10m0s
eventually.go:401:  - observed **v1.ConfigMap collection invalid: expected 1 performance profile status ConfigMaps, got 0

Summary

The test TestNTOPerformanceProfile creates a PerformanceProfile CR (via ConfigMap pp-test) and attaches it to a NodePool. It then verifies that the Node Tuning Operator (NTO), running inside the hosted control plane, processes the PerformanceProfile and creates a status ConfigMap with the label hypershift.openshift.io/nto-generated-performance-profile-status: "true" in the control plane namespace e2e-clusters-b454p-node-pool-5z79c.

The test proceeds in two phases:

Phase 1 — PerformanceProfile ConfigMap mirroring (PASSED in 3s): The hypershift-operator successfully mirrored the PerformanceProfile config into the HCP namespace with the label hypershift.openshift.io/performanceprofile-config: "true". This confirms the hypershift-operator nodepool controller is working correctly.
Phase 2 — NTO status ConfigMap creation (FAILED after 10m): The test waited 10 minutes for the NTO in the hosted control plane to process the PerformanceProfile and create a status ConfigMap. This ConfigMap was never created. During the wait, the API client also hit rate limiting (client rate limiter Wait returned an error: context deadline exceeded), indicating high API server load.

The root cause is that the Node Tuning Operator in the hosted control plane failed to generate the PerformanceProfile status ConfigMap within the 10-minute timeout. This is an NTO-side issue — the hypershift-operator code (in SetPerformanceProfileConditions() at nto.go:323) explicitly logs and tolerates the absence of this ConfigMap because "it might take some time for NTO to generate the ConfigMap with the PerformanceProfile status."

Evidence

Test passed Phase 1 — PerformanceProfile ConfigMap mirrored successfully:

nodepool_nto_performanceprofile_test.go:112: Successfully waited for performance profile ConfigMap
to exist with correct name labels and annotations in 3s

Test failed Phase 2 — NTO status ConfigMap never appeared:

nodepool_nto_performanceprofile_test.go:159: Failed to wait for performance profile status ConfigMap
to exist in 10m0s: context deadline exceeded

API client rate limiting during the wait:

eventually.go:259: Failed to get **v1.ConfigMap: client rate limiter Wait returned an error:
context deadline exceeded

Hosted cluster was otherwise healthy — 470 out of 474 tests passed, including other NTO-related tests (TestNTOMachineConfigAppliedInPlace passed in 663.18s).
PR GCP-368: add GCP CCM v2 e2e tests #7840 is not the cause — The PR only modifies 3 files (test/e2e/v2/internal/test_context.go, test/e2e/v2/internal/workload_registry.go, test/e2e/v2/tests/hosted_cluster_ccm_test.go) which add GCP CCM v2 tests. None of these files interact with NTO, PerformanceProfiles, ConfigMaps, or the NodePool controller.

Cascading Failures

The remaining 3 failures are structural cascades from the root TestNTOPerformanceProfile failure:

Test	Duration	Cause
`TestNodePool/HostedCluster0/Main`	0.02s	Parent of the failing subtest
`TestNodePool/HostedCluster0`	3455.93s	Framework post-condition check failed: the `EnsureHostedCluster` phase expected the cluster to be in a fresh/progressing state but found it fully ready (`ClusterVersionAvailable=True`, `ClusterVersionProgressing=False`). This is a test framework expectation mismatch for 4.21 clusters, not a product bug.
`TestNodePool`	0.00s	Parent of HostedCluster0

Aggregated Root Cause

Failed Steps Summary

Step	One-line Failure
`TestNTOPerformanceProfile`	NTO in hosted control plane failed to generate PerformanceProfile status ConfigMap within 10m timeout

Root Cause Hypothesis

This failure is a pre-existing flaky test, unrelated to PR #7840. The Node Tuning Operator running inside the hosted control plane (e2e-clusters-b454p-node-pool-5z79c) did not create the expected PerformanceProfile status ConfigMap within 10 minutes. Contributing factors:

NTO processing delay: The NTO must detect the mirrored PerformanceProfile ConfigMap, process it, generate a MachineConfig, apply it to nodes, and then create a status ConfigMap reflecting the result. On a CI cluster with 20 parallel tests, this chain can be delayed by resource pressure.
API client rate limiting: The client rate limiter Wait returned an error: context deadline exceeded message indicates that the management cluster API server was under heavy load, which would affect both the test's polling and the NTO's ability to operate.
No code change correlation: PR GCP-368: add GCP CCM v2 e2e tests #7840 adds GCP-specific CCM tests in the v2 test framework. It does not modify any controllers, APIs, or infrastructure code that could affect NTO behavior, PerformanceProfile processing, or ConfigMap creation.

Recommendations

Retrigger the job — This is a flaky NTO timing issue, not a regression from PR GCP-368: add GCP CCM v2 e2e tests #7840.
Consider increasing the timeout for the status ConfigMap check in nodepool_nto_performanceprofile_test.go:159 (currently 10 minutes via EventuallyObjects default). Under heavy CI load with 20 parallel tests, NTO may need more time.
Investigate NTO logs — To confirm the root cause of the NTO delay, check NTO pod logs in the HCP namespace e2e-clusters-b454p-node-pool-5z79c for errors during PerformanceProfile processing (available in the must-gather dump at dump-management-cluster artifacts).

Artifacts

Test artifacts: .work/prow-job-analyze-test-failure/2043562443174580224/logs/
Build log: .work/prow-job-analyze-test-failure/2043562443174580224/logs/build-log.txt
JUnit XML: .work/prow-job-analyze-test-failure/2043562443174580224/logs/junit_operator.xml
Prow Job URL: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_hypershift/7840/pull-ci-openshift-hypershift-main-e2e-aws-4-21/2043562443174580224

openshift-merge-bot · 2026-04-13T13:29:08Z

/retest-required

Remaining retests: 0 against base HEAD 72647a4 and 1 for PR HEAD f150069 in total

codecov · 2026-04-13T15:25:16Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 29.74%. Comparing base (c25481f) to head (f150069).
⚠️ Report is 189 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7840      +/-   ##
==========================================
+ Coverage   26.56%   29.74%   +3.18%     
==========================================
  Files        1087     1099      +12     
  Lines      105042   108949    +3907     
==========================================
+ Hits        27902    32409    +4507     
+ Misses      74731    73853     -878     
- Partials     2409     2687     +278

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

openshift-ci · 2026-04-13T15:52:08Z

@cristianoveiga: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Add capability-based Azure e2e tests to the shared test/e2e/v2/tests/ binary, following the GKE CCM pattern (PR openshift#7840). Tests self-select via Skip() based on cluster capabilities instead of using a separate binary. Three test groups with Ginkgo label filters for CI: - AzurePublicClusterTest (self-managed-azure-public): workload identity webhook mutation, KAS allowed CIDRs, ingress operator configuration - AzurePrivateTopologyTest (self-managed-azure-private): private-router internal LB annotation, PLS CR with alias, private endpoint IP, DNS zone - AzureOAuthLoadBalancerTest (self-managed-azure-oauth-lb): OAuth LB service creation and OAuth token flow validation Skip logic: - Platform type (AzurePlatform) - Azure topology (AzureTopologyPrivate for private tests) - OAuth publishing strategy (LoadBalancer for OAuth LB tests) Also registers Azure-specific env vars (AZURE_PRIVATE_NAT_SUBNET_ID, AZURE_PRIVATE_ADDITIONAL_ALLOWED_SUBSCRIPTIONS) in the shared env var registry. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 2, 2026

openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-area labels Mar 2, 2026

openshift-ci bot added area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels Mar 2, 2026

coderabbitai bot reviewed Mar 2, 2026

View reviewed changes

Comment thread test/e2e/v2/internal/test_context.go Outdated

Comment thread test/e2e/v2/tests/cloud_integration_test.go Outdated

Comment thread test/e2e/v2/tests/cloud_integration_test.go Outdated

cristianoveiga force-pushed the GCP-368 branch from 036769f to a0f6b2d Compare March 27, 2026 19:59

cristianoveiga changed the title ~~GCP-368: add GCP CCM v2 e2e tests (GCP-368)~~ GCP-368: feat(e2e): add GCP CCM v2 e2e tests Mar 27, 2026

cristianoveiga changed the title ~~GCP-368: feat(e2e): add GCP CCM v2 e2e tests~~ GCP-368: add GCP CCM v2 e2e tests Mar 27, 2026

cristianoveiga marked this pull request as ready for review April 10, 2026 14:54

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 10, 2026

openshift-ci bot requested review from cblecker and muraee April 10, 2026 14:54

cblecker reviewed Apr 10, 2026

View reviewed changes

cristianoveiga force-pushed the GCP-368 branch from dd03e90 to f744901 Compare April 10, 2026 19:16

fix(e2e): match skip message format for GCP CCM test

e738166

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cristianoveiga force-pushed the GCP-368 branch from 2aa8a71 to f150069 Compare April 12, 2026 15:04

openshift-ci bot assigned cblecker Apr 13, 2026

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Apr 13, 2026

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 13, 2026

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 13, 2026

openshift-merge-bot bot merged commit 36ccecd into openshift:main Apr 13, 2026
30 checks passed

Conversation

cristianoveiga commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Uh oh!

openshift-ci-robot commented Mar 2, 2026

Uh oh!

openshift-ci-robot commented Mar 2, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Uh oh!

coderabbitai bot commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

openshift-ci-robot commented Mar 2, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Uh oh!

openshift-ci bot commented Mar 2, 2026

Uh oh!

openshift-ci-robot commented Mar 2, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cristianoveiga commented Mar 27, 2026

Uh oh!

cristianoveiga commented Apr 9, 2026

Uh oh!

cristianoveiga commented Apr 9, 2026

Uh oh!

cristianoveiga commented Apr 10, 2026

Uh oh!

coderabbitai bot commented Apr 10, 2026

Uh oh!

cblecker left a comment

Choose a reason for hiding this comment

Review Summary

Items

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cristianoveiga commented Mar 2, 2026 •

edited

Loading

openshift-ci-robot commented Mar 2, 2026 •

edited by openshift-ci bot

Loading

coderabbitai bot commented Mar 2, 2026 •

edited

Loading

openshift-ci-robot commented Mar 2, 2026 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Mar 2, 2026 •

edited by openshift-ci bot

Loading

hypershift-jira-solve-ci bot commented Apr 13, 2026 •

edited by openshift-ci bot

Loading

Step: `hypershift-aws-run-e2e-nested` (test phase)

Root Failing Test: `TestNodePool/HostedCluster0/Main/TestNTOPerformanceProfile`