[WIP]test: add NodePool Windows ImageType e2e test#7429
[WIP]test: add NodePool Windows ImageType e2e test#7429wangke19 wants to merge 2 commits intoopenshift:mainfrom
Conversation
|
@wangke19: This pull request references CNTRLPLANE-753 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Important Review skippedAuto reviews are limited based on label configuration. 🚫 Excluded labels (none allowed) (1)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the
WalkthroughAdds an orchestrated NodePool scaling test that verifies ImageType (Windows) persistence across a sequence of replica changes, updates NodePool manifest initialization and imports, and introduces helper routines for scaling, verification, and event collection. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes ✨ Finishing touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: wangke19 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
test/e2e/nodepool_imagetype_test.go (1)
75-209: Well-structured test with comprehensive scaling coverage.The test thoroughly validates ImageType persistence across multiple scaling operations and includes proper node readiness checks. The explicit checkpoint logging aids debugging.
Optional: Extract repeated predicate into helper.
The predicate logic for verifying replicas + imageType is duplicated 4 times. Consider extracting it to improve maintainability:
🔎 Suggested helper function
func makeReplicaAndImageTypePredicate(wantReplicas int32, wantImageType hyperv1.AWSImageType) e2eutil.Predicate[*hyperv1.NodePool] { return func(np *hyperv1.NodePool) (done bool, reasons string, err error) { gotSpecReplicas := int32(0) if np.Spec.Replicas != nil { gotSpecReplicas = *np.Spec.Replicas } gotStatusReplicas := np.Status.Replicas gotImageType := np.Spec.Platform.AWS.ImageType replicasMatch := wantReplicas == gotSpecReplicas && wantReplicas == gotStatusReplicas imageTypeMatch := wantImageType == gotImageType return replicasMatch && imageTypeMatch, fmt.Sprintf("expected spec.replicas=%d status.replicas=%d imageType=%s, got spec.replicas=%d status.replicas=%d imageType=%s", wantReplicas, wantReplicas, wantImageType, gotSpecReplicas, gotStatusReplicas, gotImageType), nil } }Usage would then simplify each call site to:
[]e2eutil.Predicate[*hyperv1.NodePool]{ makeReplicaAndImageTypePredicate(zeroReplicas, originalImageType), },
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (1)
test/e2e/nodepool_imagetype_test.go
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
test/e2e/nodepool_imagetype_test.go
🧬 Code graph analysis (1)
test/e2e/nodepool_imagetype_test.go (2)
test/e2e/util/eventually.go (3)
EventuallyObject(84-164)WithInterval(49-53)WithTimeout(56-60)test/e2e/util/util.go (2)
WaitForNReadyNodesWithOptions(538-588)WithClientOptions(526-530)
384896c to
9663698
Compare
|
@wangke19: This pull request references CNTRLPLANE-2288 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
6d96549 to
5b1aded
Compare
|
/test e2e-aws |
248bdc9 to
91e2114
Compare
jparrill
left a comment
There was a problem hiding this comment.
Dropped some comments. Question, You mentioned in the PR description those cases:
The test validates ImageType persistence through:
Scale down to 0 replicas - verify ImageType preserved
Scale up to 2 replicas - verify ImageType preserved + nodes ready
Scale down to 1 replica - verify ImageType preserved + node ready
Restore original replicas - verify ImageType preserved + nodes ready
But there is not scale events, could you please clarify the PR expectations?
| e2eutil.EventuallyObject(t, ctx, fmt.Sprintf("wait for nodepool %s/%s to have a populated PlatformImage status condition", nodePool.Namespace, nodePool.Name), | ||
| // This test uses the existing default NodePool to test ImageType updates | ||
| // Following the manual test case from CNTRLPLANE-2277 | ||
| t.Log("Test Case: Update Existing NodePool ImageType (CNTRLPLANE-2277)") |
There was a problem hiding this comment.
Is the Jira ID needed in the log entry?
| var defaultNodePool hyperv1.NodePool | ||
| err := it.mgmtClient.Get(ctx, crclient.ObjectKey{ | ||
| Namespace: it.hostedCluster.Namespace, | ||
| Name: it.hostedCluster.Name, |
There was a problem hiding this comment.
The name of the nodepool it's not already at nodePool var?
| t.Logf("✓ Checkpoint 1: Default NodePool '%s' has ImageType: %s", defaultNodePool.Name, originalImageType) | ||
|
|
||
| // Test 1: Update ImageType from Linux to Windows | ||
| t.Log("Test 1: Updating ImageType from Linux to Windows") |
There was a problem hiding this comment.
It's there a possiblity the image could be originalImageType == hyperv1.ImageTypeWindows?
| g.Expect(err).NotTo(HaveOccurred(), "failed to get NodePool") | ||
|
|
||
| currentImageType := nodePool.Spec.Platform.AWS.ImageType | ||
| if currentImageType == "" { |
There was a problem hiding this comment.
Redundant check with the above one
There was a problem hiding this comment.
@jparrill Thank you for your review, the PR still is in development, I need more tests, also I will check your comments.
91e2114 to
1e39c00
Compare
|
The test ran passed on both 4.21 and 4.22 e2e-aws CI jobs. |
| // Create a new NodePool with Windows ImageType for scaling tests | ||
| nodePool := defaultNodepool.DeepCopy() | ||
| nodePool.ObjectMeta.Name = it.hostedCluster.Name + "-test-imagetype" | ||
|
|
There was a problem hiding this comment.
why these changes? does the current pattern breaks?
There was a problem hiding this comment.
You're absolutely right - this doesn't follow the established pattern. I'll fix it.
| it.scaleAndVerifyImageType(t, g, ctx, nodePool, 1, hyperv1.ImageTypeWindows) | ||
| } | ||
|
|
||
| func (it *NodePoolImageTypeTest) scaleAndVerifyImageType(t *testing.T, g *WithT, ctx context.Context, nodePool *hyperv1.NodePool, targetReplicas int32, expectedImageType hyperv1.ImageType) { |
There was a problem hiding this comment.
what value is this test adding? there's nothing about scaling intrinsic to the ImageType and scaling per se is tested in multiple other tests.
When you implement BuildNodePoolManifest, the e2e wait for that nodepool to give ready nodes. That seems enough to me.
There was a problem hiding this comment.
Thanks for your feedback! I understand the concern about test redundancy and CI time. Let me clarify the specific value this test adds that isn't covered by existing tests, and offer a simplified option.
What Makes This Test Different from Existing Tests
1. create_cluster_test.go - API Validation Only (NOT Runtime Persistence)
Location: test/e2e/create_cluster_test.go:2240-2304
What it tests:
{
name: "should pass when imageType is Windows with amd64 architecture",
mutateInput: func(np *hyperv1.NodePool) {
np.Spec.Arch = hyperv1.ArchitectureAMD64
np.Spec.Platform.Type = hyperv1.AWSPlatform
np.Spec.Platform.AWS.InstanceType = "m6i.large"
np.Spec.Platform.AWS.ImageType = hyperv1.ImageTypeWindows // ← Sets ImageType
},
expectedErrorSubstring: "", // ← Validates API accepts it
}What it does NOT test:
- ❌ NodePool creation with actual AWS resources
- ❌ Windows nodes provisioning
- ❌ ImageType persisting through any operations
- ❌ Controller behavior during spec updates
Why: This is a dry-run validation test using ValidateCreate. It validates API schema rules (e.g., "Windows requires amd64"), but doesn't create real NodePools or verify runtime behavior.
2. Framework BuildNodePoolManifest - Single Creation Path Only
Location: Framework in test/e2e/nodepool_test.go and test/e2e/util/hypershift_framework.go
What the framework does:
// 1. BuildNodePoolManifest creates NodePool spec
nodePool, err := test.BuildNodePoolManifest(defaultNodepool)
// 2. Framework creates NodePool in cluster
err = mgtClient.Create(ctx, nodePool)
// 3. WaitForNodePoolCorrectStatus validates it becomes healthy
WaitForNodePoolCorrectStatus(...)
// 4. test.Run() is called with the healthy NodePool
test.Run(t, *nodePool, nodes)What it validates:
- ✅ NodePool with ImageType=Windows can be created
- ✅ Windows nodes provision successfully
- ✅ NodePool reaches healthy state
What it does NOT validate:
- ❌ ImageType persisting through any subsequent operations
- ❌ Scaling behavior (spec.replicas changes)
- ❌ Controller reconciliation after modifications
- ❌ Field stability during lifecycle operations
Why: The framework tests the creation path only. Once Run() is called, the test decides what additional operations to perform. Without our test, no test validates ImageType through scaling.
3. autoscaling_test.go - Tests Autoscaling, NOT Manual Scaling or Windows
Location: test/e2e/autoscaling_test.go
What it tests:
Test 1: TestAutoscaling (lines 85-161)
// Sets up autoscaling (cluster-autoscaler driven)
nodepool.Spec.AutoScaling = &hyperv1.NodePoolAutoScaling{
Min: ptr.To[int32](numNodes),
Max: max,
}
nodepool.Spec.Replicas = nil // ← Replicas managed by autoscaler, not manual scaling- ✅ Cluster-autoscaler scales NodePools based on pod pressure
- ❌ Does NOT test manual
spec.replicaschanges - ❌ Does NOT use ImageType=Windows (uses default Linux)
Test 2: TestNodePoolAutoscalingScaleFromZero (lines 390-657)
scaleFromZeroNP.Spec.Replicas = nil
scaleFromZeroNP.Spec.AutoScaling = &hyperv1.NodePoolAutoScaling{
Min: ptr.To[int32](0),
Max: 2,
}
// Autoscaler scales from 0→N based on workload- ✅ Tests autoscaler scale-from-zero functionality
- ❌ Does NOT test manual
spec.replicasupdates - ❌ Does NOT use ImageType=Windows
Why this matters: Autoscaling (spec.autoScaling) and manual scaling (spec.replicas) are different code paths:
- Autoscaling: Cluster-autoscaler modifies Machine/MachineDeployment directly
- Manual scaling: NodePool controller reconciles
spec.replicas→ updates MachineDeployment
Our test validates the manual scaling controller path with Windows ImageType.
4. Other NodePool Tests - Establishes Pattern for AWS-Specific Fields
Example: nodepool_day2_tags_test.go (lines 47-118)
What it tests:
func (ar *NodePoolDay2TagsTest) Run(t *testing.T, nodePool hyperv1.NodePool, nodes []corev1.Node) {
// Updates AWS resource tags after NodePool creation
err := e2eutil.UpdateObject(t, ar.ctx, ar.mgmtClient, &nodePool, func(nodePool *hyperv1.NodePool) {
nodePool.Spec.Platform.AWS.ResourceTags = append(...) // ← Updates tags
})
// Validates tags are applied to EC2 instances
// Validates MachineDeployment generation doesn't change (no rolling update)
}Pattern: Tests that AWS-specific fields (ResourceTags) persist and behave correctly during updates.
Our test follows the same pattern: Tests that ImageType (another AWS-specific field) persists correctly during scaling updates.
What This Test UNIQUELY Validates
Our test is the ONLY test that validates:
- Manual scaling code path (
spec.replicasupdates) with Windows ImageType - ImageType persistence through controller reconciliation loops during scaling
- Scale-to-zero behavior for Windows NodePools (critical for cost optimization)
- Scale-from-zero behavior for Windows NodePools (ensuring metadata persists when recreating nodes)
- Windows node provisioning after scaling operations (not just initial creation)
Why This Matters for OCPSTRAT-1949
Customer Impact
From JIRA OCPSTRAT-1949:
"Customers running Windows VMs within OCP-Virt on ROSA-HCP, are running machine-pool nodes that include the Windows LI metadata in order to be compliant with AWS EC2 LI terms."
What happens if ImageType doesn't persist:
- Customer creates Windows NodePool with
ImageType=Windows✅ - Customer scales NodePool to 0 to save costs 💰
- Customer scales back to 2 for production workload 📈
- BUG: ImageType reverts to default (Linux) ❌
- Result: Linux AMI used instead of Windows LI AMI
- Customer violation: AWS charges for Windows licenses but nodes are Linux
- Compliance issue: Customer violates Windows LI licensing terms
This is NOT theoretical:
- NodePool controller performs complex reconciliation during scaling
spec.replicaschanges trigger MachineDeployment updates- Without this test, we have zero validation that ImageType survives these updates
Two Options: Full Test or Simplified Test
I can offer two approaches based on your preference:
Simplified Test
Test Duration: ~20-25 minutes (50% reduction)
Scaling Pattern: 1 → 0 → 1 (2 critical transitions)
Code change:
func (it *NodePoolImageTypeTest) testImageTypePersistenceThroughScaling(t *testing.T, g *WithT, ctx context.Context, nodePool *hyperv1.NodePool) {
// Test critical scale-to-zero path (cost optimization)
t.Log("Test 1: Scaling NodePool to 0 replicas (scale-to-zero)")
it.scaleAndVerifyImageType(t, g, ctx, nodePool, 0, hyperv1.ImageTypeWindows)
// Test critical scale-from-zero path (provisioning after scale-to-zero)
t.Log("Test 2: Scaling NodePool back to 1 replica (scale-from-zero)")
it.scaleAndVerifyImageType(t, g, ctx, nodePool, 1, hyperv1.ImageTypeWindows)
}What it validates:
- ✅ Scale-to-zero: Most common customer operation (cost savings)
- ✅ Scale-from-zero: ImageType persists when recreating nodes
- ✅ Framework validation: Test ends with healthy state
- ✅ Still unique (no other test covers this)
Why these two transitions are most critical:
1→0 (Scale-to-zero):
- Most common customer operation for cost savings
- NodePool controller deallocates all machines
- Tests that ImageType field isn't lost when status.replicas=0
0→1 (Scale-from-zero):
- Tests that ImageType is used when recreating nodes after scale-to-zero
- Validates Windows AMI selection after MachineDeployment recreates machines
- Proves Windows licensing metadata is preserved
Pros:
- ✅ 50% time reduction (~25 minutes vs ~45 minutes)
- ✅ Validates most critical customer paths
- ✅ Still provides unique value (no other test covers manual scaling with ImageType)
- ✅ Maintains compliance protection
Cons:
⚠️ Less comprehensive than full test⚠️ Doesn't test scale to non-zero values >1
There was a problem hiding this comment.
my comment still stands. NodePool Scaling is completely orthogonal to this. Why does this test need to exercise it at all?
afaic all you need is to implement BuildNodePoolManifest and the fraemwork will wait for those nodes to be ready. Then at most you can check in the nodes status if they has the right os
|
/retest |
1 similar comment
|
/retest |
| nodePool.Spec.Platform.AWS.InstanceType = "m5.metal" | ||
|
|
||
| // Start with 1 replica and Windows ImageType | ||
| replicas := int32(1) |
There was a problem hiding this comment.
why this change, what was wrong with using oneReplicas
a449bec to
8ee4241
Compare
Add comprehensive e2e test to validate that NodePool ImageType persists correctly through various scaling operations, with proper timeout handling and node readiness validation for Windows nodes. The test validates ImageType persistence through: 1. Scale down to 0 replicas - verify ImageType preserved 2. Scale up to 2 replicas - verify ImageType preserved + nodes ready 3. Scale down to 1 replica - verify ImageType preserved + node ready 4. Restore original replicas - verify ImageType preserved + nodes ready Key improvements for Windows node support: - Extended timeout from 5min to 30min for scale-up operations (Windows nodes require 18+ minutes to provision based on test observations) - Added explicit node readiness checks using WaitForNReadyNodesWithOptions to ensure nodes are fully provisioned, not just replica counts matching - Validates both spec.replicas and status.replicas match expected values This ensures ImageType field is immutable during scaling and Windows nodes have adequate provisioning time. Related: CNTRLPLANE-753, OCPSTRAT-1949
4504af2 to
334e6be
Compare
334e6be to
5246d63
Compare
Add debugging information to check actual node Operating System after scaling operations. This will help us verify whether setting ImageType=Windows actually provisions Windows nodes or if it only persists the field value. The debug output includes: - Node OS (linux/windows) - OS Image details - Kernel version - Comparison between expected and actual OS based on ImageType This is investigative work to determine if we need to add OS verification to the test assertions or if the test should only verify field persistence.
5246d63 to
ecc43f2
Compare
|
/test e2e-aws |
|
/test e2e-aws-4-21 |
|
/test e2e-aws |
|
@wangke19: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
We need a clean PR to complete validation of #7429 (comment) |
|
/close |
|
@wangke19: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Summary
Add e2e test to validate that NodePool
ImageType=Windowscorrectly provisions Windows nodes. This ensures Windows ImageType configuration works as expected for ROSA-HCP.Related JIRA
OCPSTRAT-1949 - Windows License-Included (LI) AMI support for ROSA-HCP
What This Tests
Validates that setting
ImageType=Windowson a NodePool results in Windows nodes:ImageType=Windowsnode.Status.NodeInfo.OperatingSystem == "windows"Test Details
NodePoolArm64CreateTest)BuildNodePoolManifest- framework handles all node provisioning and waitingExample Output
Files Changed
test/e2e/nodepool_imagetype_test.go- Simplified test implementation (75 lines)Why This Approach
Focused Validation: Tests only the ImageType functionality - that
ImageType=Windowsprovisions Windows nodes.Leverages Framework: Uses standard e2e framework patterns:
BuildNodePoolManifest()creates the NodePool configurationRun()validates the Windows OS in node statusNot Testing Scaling: NodePool scaling is orthogonal to ImageType validation. The test creates a single replica NodePool to validate ImageType works correctly.
Checklist