Skip to content

[WIP]test: add NodePool Windows ImageType e2e test#7429

Closed
wangke19 wants to merge 2 commits intoopenshift:mainfrom
wangke19:test-imagetype-scaling
Closed

[WIP]test: add NodePool Windows ImageType e2e test#7429
wangke19 wants to merge 2 commits intoopenshift:mainfrom
wangke19:test-imagetype-scaling

Conversation

@wangke19
Copy link
Copy Markdown
Contributor

@wangke19 wangke19 commented Jan 6, 2026

Summary

Add e2e test to validate that NodePool ImageType=Windows correctly provisions Windows nodes. This ensures Windows ImageType configuration works as expected for ROSA-HCP.

Related JIRA

OCPSTRAT-1949 - Windows License-Included (LI) AMI support for ROSA-HCP

What This Tests

Validates that setting ImageType=Windows on a NodePool results in Windows nodes:

  1. ✅ Create NodePool with ImageType=Windows
  2. ✅ Framework provisions nodes and waits for Ready
  3. ✅ Verify node.Status.NodeInfo.OperatingSystem == "windows"

Test Details

  • Platform: AWS only (ImageType is AWS-specific)
  • Version: 4.19+ (when ImageType was introduced)
  • Pattern: Follows standard e2e test framework (similar to NodePoolArm64CreateTest)
  • Framework: Uses BuildNodePoolManifest - framework handles all node provisioning and waiting
  • Duration: ~10-15 minutes (single Windows node provisioning)

Example Output

✓ NodePool created with ImageType=Windows, Replicas=1
✓ Node ip-10-0-1-23 is running Windows OS
✓ NodePool ImageType test passed successfully

Files Changed

  • Modified: test/e2e/nodepool_imagetype_test.go - Simplified test implementation (75 lines)

Why This Approach

Focused Validation: Tests only the ImageType functionality - that ImageType=Windows provisions Windows nodes.

Leverages Framework: Uses standard e2e framework patterns:

  • BuildNodePoolManifest() creates the NodePool configuration
  • Framework handles node provisioning and Ready waiting
  • Run() validates the Windows OS in node status

Not Testing Scaling: NodePool scaling is orthogonal to ImageType validation. The test creates a single replica NodePool to validate ImageType works correctly.

Checklist

  • Platform-specific with appropriate skip conditions
  • Version-gated for ImageType support (4.19+)
  • Follows established e2e test patterns
  • Leverages framework for node provisioning/waiting
  • Focused test scope (ImageType → Windows OS validation)
  • Simplified and maintainable (~75 lines vs 220 lines in previous approach)

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 6, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Jan 6, 2026

@wangke19: This pull request references CNTRLPLANE-753 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Add comprehensive e2e test to validate that NodePool ImageType persists correctly through various scaling operations, with proper timeout handling and node readiness validation for Windows nodes.

The test validates ImageType persistence through:

  1. Scale down to 0 replicas - verify ImageType preserved
  2. Scale up to 2 replicas - verify ImageType preserved + nodes ready
  3. Scale down to 1 replica - verify ImageType preserved + node ready
  4. Restore original replicas - verify ImageType preserved + nodes ready

Key improvements for Windows node support:

  • Extended timeout from 5min to 30min for scale-up operations (Windows nodes require 22+ minutes to provision based on test observations)
  • Added explicit node readiness checks using WaitForNReadyNodesWithOptions to ensure nodes are fully provisioned, not just replica counts matching
  • Validates both spec.replicas and status.replicas match expected values

This ensures ImageType field is immutable during scaling and Windows nodes have adequate provisioning time.

Related: CNTRLPLANE-753, OCPSTRAT-1949

Commit-Message-Assisted-by: Claude (via Claude Code)

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jan 6, 2026

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Excluded labels (none allowed) (1)
  • do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

  • 🔍 Trigger a full review

Walkthrough

Adds an orchestrated NodePool scaling test that verifies ImageType (Windows) persistence across a sequence of replica changes, updates NodePool manifest initialization and imports, and introduces helper routines for scaling, verification, and event collection.

Changes

Cohort / File(s) Summary
NodePool image-type E2E test
test/e2e/nodepool_imagetype_test.go
Removes _ "embed" and v1 alias; switches to metav1 and updated imports. Reworks BuildNodePoolManifest to set replicas=1, ImageTypeWindows, and clear creation-scoped metadata. Adds testImageTypePersistenceThroughScaling, scaleAndVerifyImageType, and checkScalingEvents. Implements scaling workflow (0 → 2 → 1 → 0 → 1) with dynamic timeouts, waiting for node readiness (WaitForReadyNodesByNodePool), predicate-based verification of spec/status replicas and ImageType, and event collection/logging for scaling-related events. Replaces prior direct status checks with repeated verification pattern and expanded logging.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from jparrill and muraee January 6, 2026 11:45
@openshift-ci openshift-ci Bot added the area/testing Indicates the PR includes changes for e2e testing label Jan 6, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jan 6, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wangke19
Once this PR has been reviewed and has the lgtm label, please assign bryan-cox for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
test/e2e/nodepool_imagetype_test.go (1)

75-209: Well-structured test with comprehensive scaling coverage.

The test thoroughly validates ImageType persistence across multiple scaling operations and includes proper node readiness checks. The explicit checkpoint logging aids debugging.

Optional: Extract repeated predicate into helper.

The predicate logic for verifying replicas + imageType is duplicated 4 times. Consider extracting it to improve maintainability:

🔎 Suggested helper function
func makeReplicaAndImageTypePredicate(wantReplicas int32, wantImageType hyperv1.AWSImageType) e2eutil.Predicate[*hyperv1.NodePool] {
	return func(np *hyperv1.NodePool) (done bool, reasons string, err error) {
		gotSpecReplicas := int32(0)
		if np.Spec.Replicas != nil {
			gotSpecReplicas = *np.Spec.Replicas
		}
		gotStatusReplicas := np.Status.Replicas
		gotImageType := np.Spec.Platform.AWS.ImageType
		replicasMatch := wantReplicas == gotSpecReplicas && wantReplicas == gotStatusReplicas
		imageTypeMatch := wantImageType == gotImageType
		return replicasMatch && imageTypeMatch, fmt.Sprintf("expected spec.replicas=%d status.replicas=%d imageType=%s, got spec.replicas=%d status.replicas=%d imageType=%s",
			wantReplicas, wantReplicas, wantImageType, gotSpecReplicas, gotStatusReplicas, gotImageType), nil
	}
}

Usage would then simplify each call site to:

[]e2eutil.Predicate[*hyperv1.NodePool]{
    makeReplicaAndImageTypePredicate(zeroReplicas, originalImageType),
},
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 941f016 and 3c08ddb.

📒 Files selected for processing (1)
  • test/e2e/nodepool_imagetype_test.go
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • test/e2e/nodepool_imagetype_test.go
🧬 Code graph analysis (1)
test/e2e/nodepool_imagetype_test.go (2)
test/e2e/util/eventually.go (3)
  • EventuallyObject (84-164)
  • WithInterval (49-53)
  • WithTimeout (56-60)
test/e2e/util/util.go (2)
  • WaitForNReadyNodesWithOptions (538-588)
  • WithClientOptions (526-530)

@wangke19 wangke19 changed the title CNTRLPLANE-753: test: add NodePool ImageType persistence through scaling e2e test [WIP]CNTRLPLANE-753: test: add NodePool ImageType persistence through scaling e2e test Jan 6, 2026
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 6, 2026
@wangke19 wangke19 force-pushed the test-imagetype-scaling branch 4 times, most recently from 384896c to 9663698 Compare January 7, 2026 12:04
@wangke19 wangke19 changed the title [WIP]CNTRLPLANE-753: test: add NodePool ImageType persistence through scaling e2e test [WIP]CNTRLPLANE-2288: test: add NodePool ImageType persistence through scaling e2e test Jan 7, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Jan 7, 2026

@wangke19: This pull request references CNTRLPLANE-2288 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Add comprehensive e2e test to validate that NodePool ImageType persists correctly through various scaling operations, with proper timeout handling and node readiness validation for Windows nodes.

The test validates ImageType persistence through:

  1. Scale down to 0 replicas - verify ImageType preserved
  2. Scale up to 2 replicas - verify ImageType preserved + nodes ready
  3. Scale down to 1 replica - verify ImageType preserved + node ready
  4. Restore original replicas - verify ImageType preserved + nodes ready

Key improvements for Windows node support:

  • Extended timeout from 5min to 30min for scale-up operations (Windows nodes require 22+ minutes to provision based on test observations)
  • Added explicit node readiness checks using WaitForNReadyNodesWithOptions to ensure nodes are fully provisioned, not just replica counts matching
  • Validates both spec.replicas and status.replicas match expected values

This ensures ImageType field is immutable during scaling and Windows nodes have adequate provisioning time.

Related: CNTRLPLANE-753, OCPSTRAT-1949

Commit-Message-Assisted-by: Claude (via Claude Code)

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@wangke19 wangke19 force-pushed the test-imagetype-scaling branch 4 times, most recently from 6d96549 to 5b1aded Compare January 8, 2026 08:48
@openshift-ci openshift-ci Bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release labels Jan 9, 2026
@wangke19
Copy link
Copy Markdown
Contributor Author

/test e2e-aws

@wangke19 wangke19 force-pushed the test-imagetype-scaling branch 3 times, most recently from 248bdc9 to 91e2114 Compare January 16, 2026 04:47
Copy link
Copy Markdown
Contributor

@jparrill jparrill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped some comments. Question, You mentioned in the PR description those cases:

The test validates ImageType persistence through:

Scale down to 0 replicas - verify ImageType preserved
Scale up to 2 replicas - verify ImageType preserved + nodes ready
Scale down to 1 replica - verify ImageType preserved + node ready
Restore original replicas - verify ImageType preserved + nodes ready

But there is not scale events, could you please clarify the PR expectations?

Comment thread test/e2e/nodepool_imagetype_test.go Outdated
e2eutil.EventuallyObject(t, ctx, fmt.Sprintf("wait for nodepool %s/%s to have a populated PlatformImage status condition", nodePool.Namespace, nodePool.Name),
// This test uses the existing default NodePool to test ImageType updates
// Following the manual test case from CNTRLPLANE-2277
t.Log("Test Case: Update Existing NodePool ImageType (CNTRLPLANE-2277)")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the Jira ID needed in the log entry?

var defaultNodePool hyperv1.NodePool
err := it.mgmtClient.Get(ctx, crclient.ObjectKey{
Namespace: it.hostedCluster.Namespace,
Name: it.hostedCluster.Name,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of the nodepool it's not already at nodePool var?

Comment thread test/e2e/nodepool_imagetype_test.go Outdated
t.Logf("✓ Checkpoint 1: Default NodePool '%s' has ImageType: %s", defaultNodePool.Name, originalImageType)

// Test 1: Update ImageType from Linux to Windows
t.Log("Test 1: Updating ImageType from Linux to Windows")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's there a possiblity the image could be originalImageType == hyperv1.ImageTypeWindows?

Comment thread test/e2e/nodepool_imagetype_test.go Outdated
g.Expect(err).NotTo(HaveOccurred(), "failed to get NodePool")

currentImageType := nodePool.Spec.Platform.AWS.ImageType
if currentImageType == "" {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant check with the above one

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jparrill Thank you for your review, the PR still is in development, I need more tests, also I will check your comments.

@wangke19 wangke19 force-pushed the test-imagetype-scaling branch from 91e2114 to 1e39c00 Compare January 26, 2026 08:41
@wangke19
Copy link
Copy Markdown
Contributor Author

The test ran passed on both 4.21 and 4.22 e2e-aws CI jobs.
It's ready for review. @jparrill Please take a look.

Comment thread test/e2e/nodepool_imagetype_test.go Outdated
// Create a new NodePool with Windows ImageType for scaling tests
nodePool := defaultNodepool.DeepCopy()
nodePool.ObjectMeta.Name = it.hostedCluster.Name + "-test-imagetype"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why these changes? does the current pattern breaks?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right - this doesn't follow the established pattern. I'll fix it.

it.scaleAndVerifyImageType(t, g, ctx, nodePool, 1, hyperv1.ImageTypeWindows)
}

func (it *NodePoolImageTypeTest) scaleAndVerifyImageType(t *testing.T, g *WithT, ctx context.Context, nodePool *hyperv1.NodePool, targetReplicas int32, expectedImageType hyperv1.ImageType) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what value is this test adding? there's nothing about scaling intrinsic to the ImageType and scaling per se is tested in multiple other tests.
When you implement BuildNodePoolManifest, the e2e wait for that nodepool to give ready nodes. That seems enough to me.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your feedback! I understand the concern about test redundancy and CI time. Let me clarify the specific value this test adds that isn't covered by existing tests, and offer a simplified option.


What Makes This Test Different from Existing Tests

1. create_cluster_test.go - API Validation Only (NOT Runtime Persistence)

Location: test/e2e/create_cluster_test.go:2240-2304

What it tests:

{
    name: "should pass when imageType is Windows with amd64 architecture",
    mutateInput: func(np *hyperv1.NodePool) {
        np.Spec.Arch = hyperv1.ArchitectureAMD64
        np.Spec.Platform.Type = hyperv1.AWSPlatform
        np.Spec.Platform.AWS.InstanceType = "m6i.large"
        np.Spec.Platform.AWS.ImageType = hyperv1.ImageTypeWindows  // ← Sets ImageType
    },
    expectedErrorSubstring: "", // ← Validates API accepts it
}

What it does NOT test:

  • ❌ NodePool creation with actual AWS resources
  • ❌ Windows nodes provisioning
  • ❌ ImageType persisting through any operations
  • ❌ Controller behavior during spec updates

Why: This is a dry-run validation test using ValidateCreate. It validates API schema rules (e.g., "Windows requires amd64"), but doesn't create real NodePools or verify runtime behavior.


2. Framework BuildNodePoolManifest - Single Creation Path Only

Location: Framework in test/e2e/nodepool_test.go and test/e2e/util/hypershift_framework.go

What the framework does:

// 1. BuildNodePoolManifest creates NodePool spec
nodePool, err := test.BuildNodePoolManifest(defaultNodepool)

// 2. Framework creates NodePool in cluster
err = mgtClient.Create(ctx, nodePool)

// 3. WaitForNodePoolCorrectStatus validates it becomes healthy
WaitForNodePoolCorrectStatus(...)

// 4. test.Run() is called with the healthy NodePool
test.Run(t, *nodePool, nodes)

What it validates:

  • ✅ NodePool with ImageType=Windows can be created
  • ✅ Windows nodes provision successfully
  • ✅ NodePool reaches healthy state

What it does NOT validate:

  • ❌ ImageType persisting through any subsequent operations
  • ❌ Scaling behavior (spec.replicas changes)
  • ❌ Controller reconciliation after modifications
  • ❌ Field stability during lifecycle operations

Why: The framework tests the creation path only. Once Run() is called, the test decides what additional operations to perform. Without our test, no test validates ImageType through scaling.


3. autoscaling_test.go - Tests Autoscaling, NOT Manual Scaling or Windows

Location: test/e2e/autoscaling_test.go

What it tests:

Test 1: TestAutoscaling (lines 85-161)

// Sets up autoscaling (cluster-autoscaler driven)
nodepool.Spec.AutoScaling = &hyperv1.NodePoolAutoScaling{
    Min: ptr.To[int32](numNodes),
    Max: max,
}
nodepool.Spec.Replicas = nil  // ← Replicas managed by autoscaler, not manual scaling
  • ✅ Cluster-autoscaler scales NodePools based on pod pressure
  • ❌ Does NOT test manual spec.replicas changes
  • ❌ Does NOT use ImageType=Windows (uses default Linux)

Test 2: TestNodePoolAutoscalingScaleFromZero (lines 390-657)

scaleFromZeroNP.Spec.Replicas = nil
scaleFromZeroNP.Spec.AutoScaling = &hyperv1.NodePoolAutoScaling{
    Min: ptr.To[int32](0),
    Max: 2,
}
// Autoscaler scales from 0→N based on workload
  • ✅ Tests autoscaler scale-from-zero functionality
  • ❌ Does NOT test manual spec.replicas updates
  • ❌ Does NOT use ImageType=Windows

Why this matters: Autoscaling (spec.autoScaling) and manual scaling (spec.replicas) are different code paths:

  • Autoscaling: Cluster-autoscaler modifies Machine/MachineDeployment directly
  • Manual scaling: NodePool controller reconciles spec.replicas → updates MachineDeployment

Our test validates the manual scaling controller path with Windows ImageType.


4. Other NodePool Tests - Establishes Pattern for AWS-Specific Fields

Example: nodepool_day2_tags_test.go (lines 47-118)

What it tests:

func (ar *NodePoolDay2TagsTest) Run(t *testing.T, nodePool hyperv1.NodePool, nodes []corev1.Node) {
    // Updates AWS resource tags after NodePool creation
    err := e2eutil.UpdateObject(t, ar.ctx, ar.mgmtClient, &nodePool, func(nodePool *hyperv1.NodePool) {
        nodePool.Spec.Platform.AWS.ResourceTags = append(...)  // ← Updates tags
    })

    // Validates tags are applied to EC2 instances
    // Validates MachineDeployment generation doesn't change (no rolling update)
}

Pattern: Tests that AWS-specific fields (ResourceTags) persist and behave correctly during updates.

Our test follows the same pattern: Tests that ImageType (another AWS-specific field) persists correctly during scaling updates.


What This Test UNIQUELY Validates

Our test is the ONLY test that validates:

  1. Manual scaling code path (spec.replicas updates) with Windows ImageType
  2. ImageType persistence through controller reconciliation loops during scaling
  3. Scale-to-zero behavior for Windows NodePools (critical for cost optimization)
  4. Scale-from-zero behavior for Windows NodePools (ensuring metadata persists when recreating nodes)
  5. Windows node provisioning after scaling operations (not just initial creation)

Why This Matters for OCPSTRAT-1949

Customer Impact

From JIRA OCPSTRAT-1949:

"Customers running Windows VMs within OCP-Virt on ROSA-HCP, are running machine-pool nodes that include the Windows LI metadata in order to be compliant with AWS EC2 LI terms."

What happens if ImageType doesn't persist:

  1. Customer creates Windows NodePool with ImageType=Windows
  2. Customer scales NodePool to 0 to save costs 💰
  3. Customer scales back to 2 for production workload 📈
  4. BUG: ImageType reverts to default (Linux) ❌
  5. Result: Linux AMI used instead of Windows LI AMI
  6. Customer violation: AWS charges for Windows licenses but nodes are Linux
  7. Compliance issue: Customer violates Windows LI licensing terms

This is NOT theoretical:

  • NodePool controller performs complex reconciliation during scaling
  • spec.replicas changes trigger MachineDeployment updates
  • Without this test, we have zero validation that ImageType survives these updates

Two Options: Full Test or Simplified Test

I can offer two approaches based on your preference:


Simplified Test

Test Duration: ~20-25 minutes (50% reduction)
Scaling Pattern: 1 → 0 → 1 (2 critical transitions)

Code change:

func (it *NodePoolImageTypeTest) testImageTypePersistenceThroughScaling(t *testing.T, g *WithT, ctx context.Context, nodePool *hyperv1.NodePool) {
	// Test critical scale-to-zero path (cost optimization)
	t.Log("Test 1: Scaling NodePool to 0 replicas (scale-to-zero)")
	it.scaleAndVerifyImageType(t, g, ctx, nodePool, 0, hyperv1.ImageTypeWindows)

	// Test critical scale-from-zero path (provisioning after scale-to-zero)
	t.Log("Test 2: Scaling NodePool back to 1 replica (scale-from-zero)")
	it.scaleAndVerifyImageType(t, g, ctx, nodePool, 1, hyperv1.ImageTypeWindows)
}

What it validates:

  • Scale-to-zero: Most common customer operation (cost savings)
  • Scale-from-zero: ImageType persists when recreating nodes
  • Framework validation: Test ends with healthy state
  • ✅ Still unique (no other test covers this)

Why these two transitions are most critical:

1→0 (Scale-to-zero):

  • Most common customer operation for cost savings
  • NodePool controller deallocates all machines
  • Tests that ImageType field isn't lost when status.replicas=0

0→1 (Scale-from-zero):

  • Tests that ImageType is used when recreating nodes after scale-to-zero
  • Validates Windows AMI selection after MachineDeployment recreates machines
  • Proves Windows licensing metadata is preserved

Pros:

  • ✅ 50% time reduction (~25 minutes vs ~45 minutes)
  • ✅ Validates most critical customer paths
  • ✅ Still provides unique value (no other test covers manual scaling with ImageType)
  • ✅ Maintains compliance protection

Cons:

  • ⚠️ Less comprehensive than full test
  • ⚠️ Doesn't test scale to non-zero values >1

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my comment still stands. NodePool Scaling is completely orthogonal to this. Why does this test need to exercise it at all?
afaic all you need is to implement BuildNodePoolManifest and the fraemwork will wait for those nodes to be ready. Then at most you can check in the nodes status if they has the right os

@wangke19
Copy link
Copy Markdown
Contributor Author

/retest

1 similar comment
@wangke19
Copy link
Copy Markdown
Contributor Author

/retest

Comment thread test/e2e/nodepool_imagetype_test.go Outdated
nodePool.Spec.Platform.AWS.InstanceType = "m5.metal"

// Start with 1 replica and Windows ImageType
replicas := int32(1)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change, what was wrong with using oneReplicas

@wangke19 wangke19 force-pushed the test-imagetype-scaling branch from a449bec to 8ee4241 Compare February 2, 2026 10:21
@wangke19 wangke19 changed the title test: add NodePool ImageType persistence through scaling e2e test test: add NodePool Windows ImageType e2e test Feb 2, 2026
@wangke19 wangke19 marked this pull request as draft February 3, 2026 03:40
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 3, 2026
Add comprehensive e2e test to validate that NodePool ImageType persists
correctly through various scaling operations, with proper timeout handling
and node readiness validation for Windows nodes.

The test validates ImageType persistence through:
1. Scale down to 0 replicas - verify ImageType preserved
2. Scale up to 2 replicas - verify ImageType preserved + nodes ready
3. Scale down to 1 replica - verify ImageType preserved + node ready
4. Restore original replicas - verify ImageType preserved + nodes ready

Key improvements for Windows node support:
- Extended timeout from 5min to 30min for scale-up operations (Windows
  nodes require 18+ minutes to provision based on test observations)
- Added explicit node readiness checks using WaitForNReadyNodesWithOptions
  to ensure nodes are fully provisioned, not just replica counts matching
- Validates both spec.replicas and status.replicas match expected values

This ensures ImageType field is immutable during scaling and Windows
nodes have adequate provisioning time.

Related: CNTRLPLANE-753, OCPSTRAT-1949
@wangke19 wangke19 force-pushed the test-imagetype-scaling branch from 4504af2 to 334e6be Compare February 3, 2026 04:00
@wangke19 wangke19 marked this pull request as ready for review February 3, 2026 04:08
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 3, 2026
@wangke19 wangke19 changed the title test: add NodePool Windows ImageType e2e test [WIP]test: add NodePool Windows ImageType e2e test Feb 3, 2026
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 3, 2026
@openshift-ci openshift-ci Bot requested review from bryan-cox and csrwng February 3, 2026 04:09
@wangke19 wangke19 force-pushed the test-imagetype-scaling branch from 334e6be to 5246d63 Compare February 3, 2026 05:19
Add debugging information to check actual node Operating System after
scaling operations. This will help us verify whether setting ImageType=Windows
actually provisions Windows nodes or if it only persists the field value.

The debug output includes:
- Node OS (linux/windows)
- OS Image details
- Kernel version
- Comparison between expected and actual OS based on ImageType

This is investigative work to determine if we need to add OS verification
to the test assertions or if the test should only verify field persistence.
@wangke19 wangke19 force-pushed the test-imagetype-scaling branch from 5246d63 to ecc43f2 Compare February 3, 2026 09:28
@wangke19
Copy link
Copy Markdown
Contributor Author

wangke19 commented Feb 3, 2026

/test e2e-aws

@wangke19
Copy link
Copy Markdown
Contributor Author

wangke19 commented Feb 4, 2026

/test e2e-aws-4-21

@wangke19
Copy link
Copy Markdown
Contributor Author

wangke19 commented Feb 5, 2026

/test e2e-aws

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 5, 2026

@wangke19: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-4-21 ecc43f2 link true /test e2e-aws-4-21
ci/prow/e2e-aks-4-21 ecc43f2 link true /test e2e-aks-4-21
ci/prow/e2e-aws ecc43f2 link true /test e2e-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@wangke19
Copy link
Copy Markdown
Contributor Author

wangke19 commented Feb 9, 2026

We need a clean PR to complete validation of #7429 (comment)

@wangke19
Copy link
Copy Markdown
Contributor Author

wangke19 commented Feb 9, 2026

/close

@openshift-ci openshift-ci Bot closed this Feb 9, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 9, 2026

@wangke19: Closed this PR.

Details

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/testing Indicates the PR includes changes for e2e testing do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants