Skip to content

GCP-368: add GCP CCM v2 e2e tests#7840

Merged
openshift-merge-bot[bot] merged 4 commits intoopenshift:mainfrom
cristianoveiga:GCP-368
Apr 13, 2026
Merged

GCP-368: add GCP CCM v2 e2e tests#7840
openshift-merge-bot[bot] merged 4 commits intoopenshift:mainfrom
cristianoveiga:GCP-368

Conversation

@cristianoveiga
Copy link
Copy Markdown
Contributor

@cristianoveiga cristianoveiga commented Mar 2, 2026

What this PR does / why we need it:

Adds v2 e2e tests validating GCP Cloud Controller Manager node initialization. Changes:

  • Workload registry: Register gcp-cloud-controller-manager deployment so existing workload tests (resource requests, security contexts, etc.) automatically cover it
  • Guest cluster client: Add GetGuestClient() to TestContext for tests that need to inspect guest cluster state
  • GCP CCM tests: Add GCPCloudControllerManagerTest to control_plane_workloads_test.go validating:
    • ProviderID assignment (gce://<project>/<zone>/<instance>)
    • Zone/region topology labels
    • Uninitialized taint removal

Tests are GCP-specific (skipped on other platforms via BeforeEach guard), confirmed working on AWS CI run (properly skipped, 0 failures).

Which issue(s) this PR fixes:

Fixes GCP-368

Special notes for your reviewer:

  • LoadBalancer service provisioning tests were descoped to a follow-up card

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

@openshift-ci-robot
Copy link
Copy Markdown

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 2, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Mar 2, 2026

@cristianoveiga: This pull request references GCP-368 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Register gcp-cloud-controller-manager in the v2 workload registry, add a guest cluster client to TestContext, and create cloud integration tests validating CCM node initialization and LoadBalancer provisioning.

Which issue(s) this PR fixes:

Fixes GCP-368

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 2, 2026

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Review skipped — only excluded labels are configured. (1)
  • do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: fc6cf417-056c-490d-8cbb-acaf3ed2f137

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds lazy-initialized guest cluster client accessor to TestContext, registers GCP cloud controller manager workload in control plane, and introduces comprehensive cloud integration test suite for GCP validating CCM functionality, node initialization, topology labeling, taint removal, and LoadBalancer provisioning.

Changes

Cohort / File(s) Summary
Guest Client Support
test/e2e/v2/internal/test_context.go
Implements lazy-initialized guest cluster client accessor via GetGuestClient() method. Retrieves kubeconfig from HostedCluster secret, builds REST configuration, and caches client using sync.Once for thread-safe access. Returns nil if prerequisites unmet or on errors.
Control Plane Configuration
test/e2e/v2/internal/workload_registry.go
Adds gcp-cloud-controller-manager deployment as new GCP platform control plane workload with cloud-controller-manager pod selector.
Cloud Integration Tests
test/e2e/v2/tests/cloud_integration_test.go
Introduces comprehensive GCP cloud integration test suite validating CCM-driven functionality: providerID format/extraction (gce:// format), node topology labeling (zone/region), uninitialized taint removal, and LoadBalancer service provisioning with external IP polling and cleanup.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning Test code fails quality requirements: node-state assertions lack Eventually() wrappers, LoadBalancer test uses fixed service name without unique generation, and GetGuestClient uses sync.Once causing permanent nil caching. Wrap node validations in Eventually() blocks with timeouts, generate unique service names using timestamp/process ID, and replace sync.Once with sync.Mutex for retry support.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding GCP CCM v2 end-to-end tests, which aligns with all three modified files (guest client addition, workload registry update, and new test suite).
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Stable And Deterministic Test Names ✅ Passed All test titles in cloud_integration_test.go are stable and deterministic with only static descriptive strings; no dynamic information like timestamps, UUIDs, or IP addresses present.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-area labels Mar 2, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Mar 2, 2026

@cristianoveiga: This pull request references GCP-368 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Register gcp-cloud-controller-manager in the v2 workload registry, add a guest cluster client to TestContext, and create cloud integration tests validating CCM node initialization and LoadBalancer provisioning.

Which issue(s) this PR fixes:

Fixes GCP-368

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 2, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels Mar 2, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Mar 2, 2026

@cristianoveiga: This pull request references GCP-368 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Register gcp-cloud-controller-manager in the v2 workload registry, add a guest cluster client to TestContext, and create cloud integration tests validating CCM node initialization and LoadBalancer provisioning.

Which issue(s) this PR fixes:

Fixes GCP-368

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

Release Notes

  • Tests
  • Added comprehensive end-to-end cloud integration test suite for GCP platform, including validation of Cloud Controller Manager node initialization with proper provider ID assignment, node topology label verification, automatic taint removal, and LoadBalancer external IP provisioning.
  • Enhanced testing infrastructure with improved guest cluster client access and cloud platform workload registry.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/v2/internal/test_context.go`:
- Around line 85-117: Replace the one-time initialization that uses
guestClientOnce with a mutex-based retryable init: change the field
guestClientOnce (sync.Once) to guestClientMu (sync.Mutex), then in the
GetGuestClient (method containing the closure) first return tc.guestClient if
non-nil, otherwise lock tc.guestClientMu, re-check tc.guestClient (to avoid
races), then attempt to load the hosted cluster kubeconfig, create the REST
config and crclient as before, and only set tc.guestClient when client creation
succeeds; always defer tc.guestClientMu.Unlock() after locking so failed
attempts don't permanently block retries. Use the existing symbols hc,
kubeconfigSecret, clientcmd.RESTConfigFromKubeConfig, and crclient.New to locate
the initialization logic to modify.

In `@test/e2e/v2/tests/cloud_integration_test.go`:
- Around line 130-133: Replace the fixed testServiceName constant to generate a
per-test unique name (e.g., using the test's name or a UUID) instead of
"ccm-lb-test" and use that generated serviceName variable wherever the service
is created; keep testNamespace as "default". Also update the cleanup logic to
delete the service by that exact generated serviceName so leftover resources
don't collide across retries/parallel runs. Ensure all references that
previously used testServiceName are updated to the new variable.
- Around line 65-127: The tests perform immediate assertions on node state (in
the It blocks that call testCtx.GetGuestClient(), list nodes into nodes :=
&corev1.NodeList{}, and iterate nodes.Items) which can race with CCM
convergence; change each test ("should set providerID...", "should set zone and
region...", "should remove the uninitialized taint...") to wrap the node
validations inside a Gomega Eventually that repeatedly lists nodes and asserts
all nodes satisfy the required conditions (providerID format checks referencing
hc.Spec.Platform.GCP.Project, topology.kubernetes.io/zone/region presence and
non-empty, and absence of taint key
node.cloudprovider.kubernetes.io/uninitialized) until success or timeout; ensure
the closure re-fetches nodes via guestClient.List and returns no failure until
every node passes so transient failures are retried.

ℹ️ Review info

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between cce0243 and 036769f.

📒 Files selected for processing (3)
  • test/e2e/v2/internal/test_context.go
  • test/e2e/v2/internal/workload_registry.go
  • test/e2e/v2/tests/cloud_integration_test.go

Comment thread test/e2e/v2/internal/test_context.go Outdated
Comment thread test/e2e/v2/tests/cloud_integration_test.go Outdated
Comment thread test/e2e/v2/tests/cloud_integration_test.go Outdated
Register gcp-cloud-controller-manager in the v2 workload registry,
add a guest cluster client to TestContext, and add GCP CCM tests to
control_plane_workloads_test.go validating node initialization
(providerID, topology labels, taint removal).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@cristianoveiga
Copy link
Copy Markdown
Contributor Author

/test e2e-v2-aws

@cristianoveiga cristianoveiga changed the title GCP-368: add GCP CCM v2 e2e tests (GCP-368) GCP-368: feat(e2e): add GCP CCM v2 e2e tests Mar 27, 2026
@cristianoveiga cristianoveiga changed the title GCP-368: feat(e2e): add GCP CCM v2 e2e tests GCP-368: add GCP CCM v2 e2e tests Mar 27, 2026
@cristianoveiga
Copy link
Copy Markdown
Contributor Author

/test ?

@cristianoveiga
Copy link
Copy Markdown
Contributor Author

/test e2e-v2-gke

@cristianoveiga
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 10, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@cristianoveiga cristianoveiga marked this pull request as ready for review April 10, 2026 14:54
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 10, 2026
@openshift-ci openshift-ci bot requested review from cblecker and muraee April 10, 2026 14:54
Copy link
Copy Markdown
Member

@cblecker cblecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Overall the PR is well-structured and follows established codebase patterns. The workload registry entry correctly mirrors other platform-specific CCMs (AWS, Azure, KubeVirt), and registering gcp-cloud-controller-manager automatically enables 10+ existing workload compliance tests (resource requests, pull policy, read-only root filesystem, safe-to-evict, etc.) — significant coverage gain. The 3 behavioral tests validate the core CCM node initialization contract (providerID, topology labels, taint removal). CI is passing on both AWS (properly skipped) and GKE (executed).

Items

Should fix:

  • GetGuestClient() docstring says "returns nil" but actually panics on most error paths — update to document actual behavior
  • context.Background() used instead of tc.Context, inconsistent with GetHostedCluster() — won't respect test timeouts
  • hc.Spec.Platform.GCP.Project accessed without nil-check on the GCP pointer field — add defensive Expect
  • workload_registry.go file header claims "generated, do not edit manually" but is routinely hand-edited — remove or fix the header

Suggestions:

  • Repeated setup boilerplate across all 3 It blocks could be lifted into a BeforeEach
  • Assertion message "guest client is required" could include diagnostic context
  • providerID error message format could match the inline comment format

Comment thread test/e2e/v2/internal/test_context.go Outdated
// GetGuestClient returns a controller-runtime client for the guest cluster.
// It reads the kubeconfig from the secret referenced by the HostedCluster status.
// The client is lazily initialized and cached.
// Returns nil if the guest client cannot be created (e.g., HostedCluster not ready).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring is inaccurate. The method panics on most error paths (secret fetch failure, missing kubeconfig key, REST config creation, client creation) — it only returns nil when hc == nil or hc.Status.KubeConfig == nil.

Suggest updating to match the actual behavior (and mirror the GetHostedCluster() pattern):

// Returns nil if the HostedCluster is not available or its KubeConfig status is not set.
// Panics on any other initialization failure (e.g., kubeconfig secret not found, invalid kubeconfig data).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - this was left over from the initial implementation. I will update it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Comment thread test/e2e/v2/internal/test_context.go Outdated
}

var kubeconfigSecret corev1.Secret
err := tc.MgmtClient.Get(context.Background(), crclient.ObjectKey{
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetHostedCluster() uses tc.Context for its API call (line 57), but this uses context.Background(). This means the kubeconfig secret fetch won't respect test timeout/cancellation.

Suggest using tc.Context for consistency:

err := tc.MgmtClient.Get(tc.Context, crclient.ObjectKey{

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call - updated it.

Expect(nodes.Items).NotTo(BeEmpty(), "cluster should have nodes")

hc := testCtx.GetHostedCluster()
gcpProject := hc.Spec.Platform.GCP.Project
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GCP is a pointer field (*GCPPlatformSpec). While the BeforeEach guard checks Platform.Type == GCPPlatform, a nil GCP field would cause a raw nil pointer panic here with no useful diagnostic. Adding a defensive check produces a clear failure message:

Expect(hc.Spec.Platform.GCP).NotTo(BeNil(), "GCP platform spec must be set for GCP HostedCluster %s/%s", hc.Namespace, hc.Name)
gcpProject := hc.Spec.Platform.GCP.Project

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-existing issue, but worth fixing in this PR since the file is being edited: the file header (lines 3-4) says "This file is generated. Do not edit manually." and references a script at /tmp/generate_workloads.go that doesn't exist in the repository. The output filename referenced (generated_workloads.go) also doesn't match the actual filename (workload_registry.go). The file has been manually edited in multiple commits including this one.

Suggest removing those two lines or replacing with something accurate like:

// This file defines the control plane workload registry.
// Add new workload entries here when onboarding new components.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! I'm guessing this was an "one-time" generator just to get the first version of this file in place/migrated?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's my suspicion too from when @csrwng created it

Comment on lines +832 to +836
Context("When nodes are initialized by the CCM", func() {
It("should set providerID on all nodes", func() {
testCtx := getTestCtx()
guestClient := testCtx.GetGuestClient()
Expect(guestClient).NotTo(BeNil(), "guest client is required")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: The setup block (get testCtx, get guestClient, assert not nil, list nodes, assert not empty) is repeated identically in all 3 It blocks. Consider lifting the shared setup into a BeforeEach in this Context, which is the idiomatic Ginkgo pattern used by other tests in this file (e.g., SecurityContextUIDTest).

Also, if GetGuestClient() returns nil, the assertion message "guest client is required" doesn't help diagnose why. Something like "guest client is nil; HostedCluster may not have KubeConfig status set" would save debugging time.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Moved the duplicated code into BeforeEach and improved the assertion message.

"node %s providerID should reference project %s", node.Name, gcpProject)
parts := strings.Split(node.Spec.ProviderID, "/")
Expect(parts).To(HaveLen(5),
"node %s providerID should have format gce://project/zone/instance", node.Name)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The error message says gce://project/zone/instance but the inline comment on line 848 uses the more precise gce://<project>/<zone>/<instance-name>. Consider matching the comment format in the error message for clarity during failure triage:

"node %s providerID should have format gce://<project>/<zone>/<instance-name>", node.Name)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

- Fix GetGuestClient() docstring to reflect panic behavior
- Use tc.Context instead of context.Background() for consistency
- Add nil check on hc.Spec.Platform.GCP before accessing Project
- Remove stale "generated" file header from workload_registry.go
- Lift shared test setup into BeforeEach with better error message
- Fix providerID error message format to match comment

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@cristianoveiga
Copy link
Copy Markdown
Contributor Author

/test e2e-v2-gke

@cblecker
Copy link
Copy Markdown
Member

Follow-up: a couple of minor pattern consistency items I noticed after looking at the updated diff more closely.

1. Labels on Context node

The new test uses Label("GCP", "CCM") on its Context:

Context("GCP Cloud Controller Manager", Label("GCP", "CCM"), func() {

No other test function in this file uses Labels on Context nodes — PodAffinitiesAndTolerationsTest (AWS) and SecurityContextUIDTest (Azure) both use plain Context(...). Labels in this file only appear on the top-level Describe and one It block ("Informing"). Consider removing the labels for consistency, or if they're intentional for filtering, that's fine too — just flagging the deviation.

2. Skip message format

Existing platform-skip messages follow a consistent pattern:

  • "Pod affinities and tolerations test is only for AWS platform"
  • "Security context UID test is only for Azure platform"

The new code uses:

"Test requires a GCP HostedCluster"

Consider matching the existing format, e.g.: "GCP Cloud Controller Manager test is only for GCP platform"

Both are minor — the functional changes all look good.

@cblecker
Copy link
Copy Markdown
Member

One more question: the other tests registered via RegisterControlPlaneWorkloadsTests all validate properties of control plane workloads (deployments/pods in the management cluster) — things like resource requests, pull policy, security contexts, etc. The GCP CCM test is different in that it validates guest cluster node state (providerID, topology labels, taints).

Was there a deliberate reason to put it in control_plane_workloads_test.go rather than a separate test file? It works fine here, just curious if you considered splitting it out since it's testing a different layer (guest cluster effects vs. management cluster workload properties).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@cristianoveiga
Copy link
Copy Markdown
Contributor Author

Follow-up: a couple of minor pattern consistency items I noticed after looking at the updated diff more closely.

1. Labels on Context node

The new test uses Label("GCP", "CCM") on its Context:

Context("GCP Cloud Controller Manager", Label("GCP", "CCM"), func() {

No other test function in this file uses Labels on Context nodes — PodAffinitiesAndTolerationsTest (AWS) and SecurityContextUIDTest (Azure) both use plain Context(...). Labels in this file only appear on the top-level Describe and one It block ("Informing"). Consider removing the labels for consistency, or if they're intentional for filtering, that's fine too — just flagging the deviation.

The labels are intentional, yes. I used them to run the tests locally (against a pre-provisioned MC that I had).

They were useful to filter: --ginkgo.label-filter="GCP && CCM").

2. Skip message format

Existing platform-skip messages follow a consistent pattern:

  • "Pod affinities and tolerations test is only for AWS platform"
  • "Security context UID test is only for Azure platform"

The new code uses:

"Test requires a GCP HostedCluster"

Consider matching the existing format, e.g.: "GCP Cloud Controller Manager test is only for GCP platform"

Updated the message to match the existing format.

Both are minor — the functional changes all look good.

@cristianoveiga
Copy link
Copy Markdown
Contributor Author

I initially had it in a separate cloud_integration_test.go, but I decided to include it in the existing file because I felt we didn't have similar v2 tests yet to determine the ideal file structure for guest-cluster-level validations. So I intentionally deferred that decision until we have more tests migrated to v2.

That said, I'm happy to move this to a new file now if you have a specific preference.

One more question: the other tests registered via RegisterControlPlaneWorkloadsTests all validate properties of control plane workloads (deployments/pods in the management cluster) — things like resource requests, pull policy, security contexts, etc. The GCP CCM test is different in that it validates guest cluster node state (providerID, topology labels, taints).

Was there a deliberate reason to put it in control_plane_workloads_test.go rather than a separate test file? It works fine here, just curious if you considered splitting it out since it's testing a different layer (guest cluster effects vs. management cluster workload properties).

@cblecker
Copy link
Copy Markdown
Member

Feedback on naming: the v2 framework is a clean slate and doesn't use "guest cluster" terminology anywhere (the only "guest" reference is the AWS env var AWS_GUEST_INFRA_CREDENTIALS_FILE which comes from external convention). The v1 framework has WaitForGuestClient/guestClient heavily, but the project's preferred terminology is "hosted cluster" and "control plane" — see AGENTS.md which consistently uses these terms and never says "guest cluster."

Since v2 is the chance to get this right, I'd suggest renaming:

  • GetGuestClient()GetHostedClusterClient()
  • guestClient / guestClientOnce fields → hostedClusterClient / hostedClusterClientOnce
  • The docstring: "guest cluster" → "hosted cluster"
  • Variable names in the tests: guestClienthostedClusterClient

Separately, per my earlier question about file organization — I'd recommend moving GCPCloudControllerManagerTest out of control_plane_workloads_test.go into a new hosted_cluster_ccm_test.go file. The tests in control_plane_workloads_test.go all validate properties of workloads running in the control plane namespace (management cluster side), but the CCM tests validate node state on the hosted cluster — a different layer entirely.

A feature-scoped file (rather than a monolithic hosted_cluster_test.go) sets a good convention as more hosted-cluster-side tests get added. control_plane_workloads_test.go is already 850 lines with 13 test functions, and the v1 framework's approach of smaller domain-specific files (karpenter, autoscaling, OLM, etc.) has scaled better than large monoliths. If other platform CCM tests or the descoped LoadBalancer tests land later, they can share this file or get their own.

The structure would follow the existing v2 pattern:

  • RegisterHostedClusterCCMTests(getTestCtx) registration function
  • var _ = Describe("Hosted Cluster CCM", Label("hosted-cluster-ccm"), ...) top-level block
  • Platform-specific test functions nested inside with BeforeEach skip guards

The GCP CCM workload registry entry should stay in workload_registry.go — that's the right place for it. Only the behavioral test function would move.

- Rename GetGuestClient() to GetHostedClusterClient() to align with
  v2 framework terminology (hosted cluster, not guest cluster)
- Move GCPCloudControllerManagerTest from control_plane_workloads_test.go
  to hosted_cluster_ccm_test.go with RegisterHostedClusterCCMTests
  registration pattern, separating hosted cluster validation from
  management cluster workload tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@cristianoveiga
Copy link
Copy Markdown
Contributor Author

Both good suggestions - Implemented.

Feedback on naming: the v2 framework is a clean slate and doesn't use "guest cluster" terminology anywhere (the only "guest" reference is the AWS env var AWS_GUEST_INFRA_CREDENTIALS_FILE which comes from external convention). The v1 framework has WaitForGuestClient/guestClient heavily, but the project's preferred terminology is "hosted cluster" and "control plane" — see AGENTS.md which consistently uses these terms and never says "guest cluster."

Since v2 is the chance to get this right, I'd suggest renaming:

  • GetGuestClient()GetHostedClusterClient()
  • guestClient / guestClientOnce fields → hostedClusterClient / hostedClusterClientOnce
  • The docstring: "guest cluster" → "hosted cluster"
  • Variable names in the tests: guestClienthostedClusterClient

Separately, per my earlier question about file organization — I'd recommend moving GCPCloudControllerManagerTest out of control_plane_workloads_test.go into a new hosted_cluster_ccm_test.go file. The tests in control_plane_workloads_test.go all validate properties of workloads running in the control plane namespace (management cluster side), but the CCM tests validate node state on the hosted cluster — a different layer entirely.

A feature-scoped file (rather than a monolithic hosted_cluster_test.go) sets a good convention as more hosted-cluster-side tests get added. control_plane_workloads_test.go is already 850 lines with 13 test functions, and the v1 framework's approach of smaller domain-specific files (karpenter, autoscaling, OLM, etc.) has scaled better than large monoliths. If other platform CCM tests or the descoped LoadBalancer tests land later, they can share this file or get their own.

The structure would follow the existing v2 pattern:

  • RegisterHostedClusterCCMTests(getTestCtx) registration function
  • var _ = Describe("Hosted Cluster CCM", Label("hosted-cluster-ccm"), ...) top-level block
  • Platform-specific test functions nested inside with BeforeEach skip guards

The GCP CCM workload registry entry should stay in workload_registry.go — that's the right place for it. Only the behavioral test function would move.

@cristianoveiga
Copy link
Copy Markdown
Contributor Author

/test e2e-v2-gke

@cblecker
Copy link
Copy Markdown
Member

/lgtm
/approve
/verified by e2e

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Apr 13, 2026
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 13, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@cblecker: This PR has been marked as verified by e2e.

Details

In response to this:

/lgtm
/approve
/verified by e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-21
/test e2e-aws-4-21
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 13, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cblecker, cristianoveiga

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 13, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD 783f795 and 2 for PR HEAD f150069 in total

@hypershift-jira-solve-ci
Copy link
Copy Markdown

hypershift-jira-solve-ci bot commented Apr 13, 2026

The HostedCluster0 conditions failure is a separate test framework issue — after all Main tests complete, the framework checks that the cluster is in its initial "not ready" state, but by this point the cluster version 4.21.0-0.ci-2026-04-12-082909 was fully applied. This is a known test framework expectation mismatch, not a product bug.

Now let me produce the final report:

Test Failure Analysis Complete (Multi-Step)

Job Information

  • Prow Job: pull-ci-openshift-hypershift-main-e2e-aws-4-21
  • Build ID: 2043562443174580224
  • Target: e2e-aws-4-21
  • PR: #7840GCP-368: add GCP CCM v2 e2e tests
  • Failed Steps: 1 (hypershift-aws-run-e2e-nested — test phase)
  • Test Results: 474 tests, 36 skipped, 4 failures (all from 1 root test)

Failed Step Analysis

Step: hypershift-aws-run-e2e-nested (test phase)

Root Failing Test: TestNodePool/HostedCluster0/Main/TestNTOPerformanceProfile

Duration: 1113.07s (18m33s)

Error

eventually.go:259: Failed to get **v1.ConfigMap: client rate limiter Wait returned an error: context deadline exceeded
nodepool_nto_performanceprofile_test.go:159: Failed to wait for performance profile status ConfigMap to exist in 10m0s: context deadline exceeded
eventually.go:384: observed invalid **v1.ConfigMap state after 10m0s
eventually.go:401:  - observed **v1.ConfigMap collection invalid: expected 1 performance profile status ConfigMaps, got 0

Summary

The test TestNTOPerformanceProfile creates a PerformanceProfile CR (via ConfigMap pp-test) and attaches it to a NodePool. It then verifies that the Node Tuning Operator (NTO), running inside the hosted control plane, processes the PerformanceProfile and creates a status ConfigMap with the label hypershift.openshift.io/nto-generated-performance-profile-status: "true" in the control plane namespace e2e-clusters-b454p-node-pool-5z79c.

The test proceeds in two phases:

  1. Phase 1 — PerformanceProfile ConfigMap mirroring (PASSED in 3s): The hypershift-operator successfully mirrored the PerformanceProfile config into the HCP namespace with the label hypershift.openshift.io/performanceprofile-config: "true". This confirms the hypershift-operator nodepool controller is working correctly.
  2. Phase 2 — NTO status ConfigMap creation (FAILED after 10m): The test waited 10 minutes for the NTO in the hosted control plane to process the PerformanceProfile and create a status ConfigMap. This ConfigMap was never created. During the wait, the API client also hit rate limiting (client rate limiter Wait returned an error: context deadline exceeded), indicating high API server load.

The root cause is that the Node Tuning Operator in the hosted control plane failed to generate the PerformanceProfile status ConfigMap within the 10-minute timeout. This is an NTO-side issue — the hypershift-operator code (in SetPerformanceProfileConditions() at nto.go:323) explicitly logs and tolerates the absence of this ConfigMap because "it might take some time for NTO to generate the ConfigMap with the PerformanceProfile status."

Evidence

  1. Test passed Phase 1 — PerformanceProfile ConfigMap mirrored successfully:

    nodepool_nto_performanceprofile_test.go:112: Successfully waited for performance profile ConfigMap
    to exist with correct name labels and annotations in 3s
    
  2. Test failed Phase 2 — NTO status ConfigMap never appeared:

    nodepool_nto_performanceprofile_test.go:159: Failed to wait for performance profile status ConfigMap
    to exist in 10m0s: context deadline exceeded
    
  3. API client rate limiting during the wait:

    eventually.go:259: Failed to get **v1.ConfigMap: client rate limiter Wait returned an error:
    context deadline exceeded
    
  4. Hosted cluster was otherwise healthy — 470 out of 474 tests passed, including other NTO-related tests (TestNTOMachineConfigAppliedInPlace passed in 663.18s).

  5. PR GCP-368: add GCP CCM v2 e2e tests #7840 is not the cause — The PR only modifies 3 files (test/e2e/v2/internal/test_context.go, test/e2e/v2/internal/workload_registry.go, test/e2e/v2/tests/hosted_cluster_ccm_test.go) which add GCP CCM v2 tests. None of these files interact with NTO, PerformanceProfiles, ConfigMaps, or the NodePool controller.

Cascading Failures

The remaining 3 failures are structural cascades from the root TestNTOPerformanceProfile failure:

Test Duration Cause
TestNodePool/HostedCluster0/Main 0.02s Parent of the failing subtest
TestNodePool/HostedCluster0 3455.93s Framework post-condition check failed: the EnsureHostedCluster phase expected the cluster to be in a fresh/progressing state but found it fully ready (ClusterVersionAvailable=True, ClusterVersionProgressing=False). This is a test framework expectation mismatch for 4.21 clusters, not a product bug.
TestNodePool 0.00s Parent of HostedCluster0

Aggregated Root Cause

Failed Steps Summary

Step One-line Failure
TestNTOPerformanceProfile NTO in hosted control plane failed to generate PerformanceProfile status ConfigMap within 10m timeout

Root Cause Hypothesis

This failure is a pre-existing flaky test, unrelated to PR #7840. The Node Tuning Operator running inside the hosted control plane (e2e-clusters-b454p-node-pool-5z79c) did not create the expected PerformanceProfile status ConfigMap within 10 minutes. Contributing factors:

  1. NTO processing delay: The NTO must detect the mirrored PerformanceProfile ConfigMap, process it, generate a MachineConfig, apply it to nodes, and then create a status ConfigMap reflecting the result. On a CI cluster with 20 parallel tests, this chain can be delayed by resource pressure.

  2. API client rate limiting: The client rate limiter Wait returned an error: context deadline exceeded message indicates that the management cluster API server was under heavy load, which would affect both the test's polling and the NTO's ability to operate.

  3. No code change correlation: PR GCP-368: add GCP CCM v2 e2e tests #7840 adds GCP-specific CCM tests in the v2 test framework. It does not modify any controllers, APIs, or infrastructure code that could affect NTO behavior, PerformanceProfile processing, or ConfigMap creation.

Recommendations

  • Retrigger the job — This is a flaky NTO timing issue, not a regression from PR GCP-368: add GCP CCM v2 e2e tests #7840.
  • Consider increasing the timeout for the status ConfigMap check in nodepool_nto_performanceprofile_test.go:159 (currently 10 minutes via EventuallyObjects default). Under heavy CI load with 20 parallel tests, NTO may need more time.
  • Investigate NTO logs — To confirm the root cause of the NTO delay, check NTO pod logs in the HCP namespace e2e-clusters-b454p-node-pool-5z79c for errors during PerformanceProfile processing (available in the must-gather dump at dump-management-cluster artifacts).

Artifacts


@openshift-merge-bot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD 72647a4 and 1 for PR HEAD f150069 in total

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 29.74%. Comparing base (c25481f) to head (f150069).
⚠️ Report is 189 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7840      +/-   ##
==========================================
+ Coverage   26.56%   29.74%   +3.18%     
==========================================
  Files        1087     1099      +12     
  Lines      105042   108949    +3907     
==========================================
+ Hits        27902    32409    +4507     
+ Misses      74731    73853     -878     
- Partials     2409     2687     +278     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 13, 2026

@cristianoveiga: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 36ccecd into openshift:main Apr 13, 2026
30 checks passed
bryan-cox added a commit to bryan-cox/hypershift that referenced this pull request Apr 13, 2026
Add capability-based Azure e2e tests to the shared test/e2e/v2/tests/
binary, following the GKE CCM pattern (PR openshift#7840). Tests self-select via
Skip() based on cluster capabilities instead of using a separate binary.

Three test groups with Ginkgo label filters for CI:
- AzurePublicClusterTest (self-managed-azure-public): workload identity
  webhook mutation, KAS allowed CIDRs, ingress operator configuration
- AzurePrivateTopologyTest (self-managed-azure-private): private-router
  internal LB annotation, PLS CR with alias, private endpoint IP, DNS zone
- AzureOAuthLoadBalancerTest (self-managed-azure-oauth-lb): OAuth LB
  service creation and OAuth token flow validation

Skip logic:
- Platform type (AzurePlatform)
- Azure topology (AzureTopologyPrivate for private tests)
- OAuth publishing strategy (LoadBalancer for OAuth LB tests)

Also registers Azure-specific env vars (AZURE_PRIVATE_NAT_SUBNET_ID,
AZURE_PRIVATE_ADDITIONAL_ALLOWED_SUBSCRIPTIONS) in the shared env var
registry.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/testing Indicates the PR includes changes for e2e testing jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants