Add AWS spot instance support for EC2 hypervisors#63
Add AWS spot instance support for EC2 hypervisors#63gamado wants to merge 2 commits intoopenshift-eng:mainfrom
Conversation
Spot instances can reduce EC2 costs by up to 90%. Uses persistent spot requests with stop (not terminate) interruption behavior, so instances can be restarted after capacity is restored. - Add USE_SPOT_INSTANCE and AVAILABILITY_ZONE options to instance.env - Create a dedicated spot launch template in CloudFormation - Make spot and capacity reservations mutually exclusive - Skip capacity reservation API calls for spot instances on start - Detect spot lifecycle in start.sh to avoid unsupported operations Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: gamado The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @gamado. Thanks for your PR. I'm waiting for a openshift-eng member to verify that this patch is reasonable to test. If it is, they should reply with Tip We noticed you've done this a few times! Consider joining the org to skip this step and gain Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 46 minutes and 50 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
WalkthroughThe pull request adds AWS Spot Instance support to the hypervisor deployment infrastructure. It introduces environment variables for spot instance configuration and availability zone override, updates deployment scripts to enforce mutual exclusion between spot instances and capacity reservations, and modifies CloudFormation templates to conditionally configure spot instance purchasing and interruption behavior. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant DeployScript as Deployment Script<br/>(create.sh)
participant Env as Environment Config<br/>(common.sh)
participant CloudFormation as CloudFormation
participant EC2 as EC2 API
User->>DeployScript: Execute deployment with USE_SPOT_INSTANCE
DeployScript->>Env: Read USE_SPOT_INSTANCE & AVAILABILITY_ZONE
alt Spot Instance Enabled
DeployScript->>DeployScript: Set ENABLE_CAPACITY_RESERVATION=false
DeployScript->>CloudFormation: Pass UseSpot="Yes"
else Spot Instance Disabled
DeployScript->>CloudFormation: Pass UseSpot="No"<br/>& capacity reservation config
end
CloudFormation->>CloudFormation: Select launch template<br/>(spot or standard)
CloudFormation->>EC2: Create instance with<br/>selected template
EC2-->>CloudFormation: Instance created
CloudFormation-->>DeployScript: Stack creation complete
sequenceDiagram
participant StartScript as Start Script<br/>(start.sh)
participant EC2 as EC2 API
participant StartScript2 as Start Script<br/>(continued)
StartScript->>EC2: Query instance state &<br/>InstanceLifecycle
EC2-->>StartScript: Return state & lifecycle
alt Instance is Spot
StartScript->>StartScript2: Log lifecycle="spot"
alt Instance Stopped/Stopping
StartScript2->>StartScript2: Skip capacity<br/>reservation preference
end
else Instance is On-Demand
alt Instance Stopped/Stopping
StartScript2->>StartScript2: Ensure capacity<br/>reservation preference
end
end
StartScript2->>EC2: Start instance
EC2-->>StartScript2: Instance started
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
deploy/aws-hypervisor/scripts/start.sh (1)
55-73:⚠️ Potential issue | 🟡 MinorSpot instance in
stoppedstate may fail to start due to capacity.Persistent spot instances stopped via AWS interruption can fail
ec2 start-instanceswithInsufficientInstanceCapacityuntil capacity returns. AWS normally auto-restarts persistent spot instances when capacity is available, so manually invoking start on such an instance may race with that behavior or fail outright with no retry/backoff here.Consider:
- Detecting
StateReason.Code == "Server.SpotInstanceShutdown"/Client.InstanceInitiatedShutdownfor spot and either waiting for AWS to auto-restart, or retrying the start on capacity errors with backoff.- Surfacing a clearer error message to the user when a spot start fails due to capacity.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deploy/aws-hypervisor/scripts/start.sh` around lines 55 - 73, The spot instance start path may fail with InsufficientInstanceCapacity when the stopped instance was a persistent spot; update the "stopped" case around the aws ec2 start-instances call to detect spot shutdowns and handle capacity errors: first query the instance StateReason.Code (use INSTANCE_ID/REGION and existing ensure_open_capacity_preference helper) and if it equals "Server.SpotInstanceShutdown" or "Client.InstanceInitiatedShutdown" either wait for AWS to auto-restart instead of calling start or implement a retry loop that calls aws ec2 start-instances and on InsufficientInstanceCapacity errors performs exponential backoff and retries a bounded number of times; ensure you log a clear, actionable error when retries are exhausted or when start fails due to capacity so the user knows it’s a spot capacity issue.
🧹 Nitpick comments (3)
deploy/aws-hypervisor/scripts/create.sh (1)
60-65: Consider warning (not just info) when overriding an explicit capacity reservation setting.If a user explicitly sets
ENABLE_CAPACITY_RESERVATION=trueininstance.envalongsideUSE_SPOT_INSTANCE=true, the script silently downgrades it with only an info-level message. Consider promoting this tomsg_warningso the conflict is more visible, or erroring out so users are forced to resolve the conflict deliberately.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deploy/aws-hypervisor/scripts/create.sh` around lines 60 - 65, The script currently logs a silent info when USE_SPOT (from USE_SPOT_INSTANCE) forces ENABLE_CAPACITY_RESERVATION="false"; change the behavior in the block where USE_SPOT is checked (symbols: USE_SPOT, USE_SPOT_INSTANCE, ENABLE_CAPACITY_RESERVATION, msg_info) to use msg_warning instead of msg_info and include text that clearly states the user explicitly set ENABLE_CAPACITY_RESERVATION and it is being overridden by spot selection; alternatively, if you prefer fail-fast behavior, replace the msg_info with an error/exit path that prints the conflict and aborts so the user must resolve it.deploy/aws-hypervisor/scripts/start.sh (1)
44-49: Collapse two describe-instances calls into one.Lines 44 and 45 issue two separate
describe-instancesAPI calls for the same instance. You can fetch both fields in a single call to reduce API churn and the chance of race conditions between the two queries.♻️ Proposed refactor
-INSTANCE_STATE=$(aws --region "${REGION}" ec2 describe-instances --instance-ids "${INSTANCE_ID}" --query 'Reservations[0].Instances[0].State.Name' --output text --no-cli-pager) -INSTANCE_LIFECYCLE=$(aws --region "${REGION}" ec2 describe-instances --instance-ids "${INSTANCE_ID}" --query 'Reservations[0].Instances[0].InstanceLifecycle' --output text --no-cli-pager || echo "unknown") +INSTANCE_INFO=$(aws --region "${REGION}" ec2 describe-instances --instance-ids "${INSTANCE_ID}" \ + --query 'Reservations[0].Instances[0].[State.Name, InstanceLifecycle]' \ + --output text --no-cli-pager || echo "unknown unknown") +INSTANCE_STATE=$(echo "${INSTANCE_INFO}" | awk '{print $1}') +INSTANCE_LIFECYCLE=$(echo "${INSTANCE_INFO}" | awk '{print $2}')Note:
InstanceLifecycleis returned as the literal stringNone(not empty) for on-demand instances, so the!= "spot"guard on lines 57 and 71 is correct.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deploy/aws-hypervisor/scripts/start.sh` around lines 44 - 49, The script currently calls aws ec2 describe-instances twice to populate INSTANCE_STATE and INSTANCE_LIFECYCLE; combine into a single call that queries both Reservations[0].Instances[0].State.Name and Reservations[0].Instances[0].InstanceLifecycle and capture the two outputs (e.g., with --query returning a list or JSON then parse into INSTANCE_STATE and INSTANCE_LIFECYCLE). Update the variables INSTANCE_STATE and INSTANCE_LIFECYCLE assignment to use that single aws invocation (keep REGION and INSTANCE_ID usage), preserve the existing logic that treats InstanceLifecycle == "spot" and the msg_info call, and ensure fallback/unknown handling remains when InstanceLifecycle is missing or "None".deploy/aws-hypervisor/instance.env.template (1)
9-20: Document spot-mode limitations.The spot comments explain cost savings and interruption behavior but omit a few operational caveats worth surfacing here so users aren't surprised:
- Spot instances may experience an unbounded stopped duration until capacity returns;
start.shdoes not currently retry onInsufficientInstanceCapacity.c5n.metal(the defaultEC2_INSTANCE_TYPE) is a bare-metal type and spot availability for bare-metal is historically much more constrained than for virtualized types — users defaulting to metal may rarely see capacity.- Setting
AVAILABILITY_ZONEwith spot pins capacity to a single AZ; leaving it empty lets AWS pick.A short note on each will save downstream debugging.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deploy/aws-hypervisor/instance.env.template` around lines 9 - 20, Update the spot instance comments to document operational caveats: mention that start.sh does not retry on InsufficientInstanceCapacity so stopped durations can be unbounded, warn that the default EC2_INSTANCE_TYPE (c5n.metal) is a bare-metal type with much more constrained spot availability, and note that setting AVAILABILITY_ZONE pins capacity to a single AZ (leaving it empty lets AWS pick); reference USE_SPOT_INSTANCE as the toggle to enable spot behavior and keep the notes brief and adjacent to the existing Spot Instance Settings block.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@deploy/aws-hypervisor/templates/rhel-instance.yaml`:
- Around line 249-260: The teardown lacks cancellation of persistent Spot
Instance Requests created by the RHELSpotLaunchTemplate (SpotOptions ->
SpotInstanceType: persistent); update the destroy.sh script to locate the Spot
Request ID recorded when the instance was launched (or query for requests tied
to the CloudFormation stack/instance tags) and call aws ec2
cancel-spot-instance-requests --spot-instance-request-ids <ids> as part of
teardown (before/after stack delete) to ensure persistent requests are removed
and no replacement instances are launched.
---
Outside diff comments:
In `@deploy/aws-hypervisor/scripts/start.sh`:
- Around line 55-73: The spot instance start path may fail with
InsufficientInstanceCapacity when the stopped instance was a persistent spot;
update the "stopped" case around the aws ec2 start-instances call to detect spot
shutdowns and handle capacity errors: first query the instance StateReason.Code
(use INSTANCE_ID/REGION and existing ensure_open_capacity_preference helper) and
if it equals "Server.SpotInstanceShutdown" or "Client.InstanceInitiatedShutdown"
either wait for AWS to auto-restart instead of calling start or implement a
retry loop that calls aws ec2 start-instances and on
InsufficientInstanceCapacity errors performs exponential backoff and retries a
bounded number of times; ensure you log a clear, actionable error when retries
are exhausted or when start fails due to capacity so the user knows it’s a spot
capacity issue.
---
Nitpick comments:
In `@deploy/aws-hypervisor/instance.env.template`:
- Around line 9-20: Update the spot instance comments to document operational
caveats: mention that start.sh does not retry on InsufficientInstanceCapacity so
stopped durations can be unbounded, warn that the default EC2_INSTANCE_TYPE
(c5n.metal) is a bare-metal type with much more constrained spot availability,
and note that setting AVAILABILITY_ZONE pins capacity to a single AZ (leaving it
empty lets AWS pick); reference USE_SPOT_INSTANCE as the toggle to enable spot
behavior and keep the notes brief and adjacent to the existing Spot Instance
Settings block.
In `@deploy/aws-hypervisor/scripts/create.sh`:
- Around line 60-65: The script currently logs a silent info when USE_SPOT (from
USE_SPOT_INSTANCE) forces ENABLE_CAPACITY_RESERVATION="false"; change the
behavior in the block where USE_SPOT is checked (symbols: USE_SPOT,
USE_SPOT_INSTANCE, ENABLE_CAPACITY_RESERVATION, msg_info) to use msg_warning
instead of msg_info and include text that clearly states the user explicitly set
ENABLE_CAPACITY_RESERVATION and it is being overridden by spot selection;
alternatively, if you prefer fail-fast behavior, replace the msg_info with an
error/exit path that prints the conflict and aborts so the user must resolve it.
In `@deploy/aws-hypervisor/scripts/start.sh`:
- Around line 44-49: The script currently calls aws ec2 describe-instances twice
to populate INSTANCE_STATE and INSTANCE_LIFECYCLE; combine into a single call
that queries both Reservations[0].Instances[0].State.Name and
Reservations[0].Instances[0].InstanceLifecycle and capture the two outputs
(e.g., with --query returning a list or JSON then parse into INSTANCE_STATE and
INSTANCE_LIFECYCLE). Update the variables INSTANCE_STATE and INSTANCE_LIFECYCLE
assignment to use that single aws invocation (keep REGION and INSTANCE_ID
usage), preserve the existing logic that treats InstanceLifecycle == "spot" and
the msg_info call, and ensure fallback/unknown handling remains when
InstanceLifecycle is missing or "None".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 6b8ce601-caee-443b-b63c-c99f86fccb1c
📒 Files selected for processing (5)
deploy/aws-hypervisor/instance.env.templatedeploy/aws-hypervisor/scripts/common.shdeploy/aws-hypervisor/scripts/create.shdeploy/aws-hypervisor/scripts/start.shdeploy/aws-hypervisor/templates/rhel-instance.yaml
| # Launch Template for Spot Instances (only created when spot is used) | ||
| RHELSpotLaunchTemplate: | ||
| Type: AWS::EC2::LaunchTemplate | ||
| Condition: UseSpotInstance | ||
| Properties: | ||
| LaunchTemplateData: | ||
| InstanceMarketOptions: | ||
| MarketType: spot | ||
| SpotOptions: | ||
| SpotInstanceType: persistent | ||
| InstanceInterruptionBehavior: stop | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Look for any spot-request cancellation logic across the deploy tree
rg -nP -C2 'cancel-spot-instance-requests|describe-spot-instance-requests|SpotInstanceRequestId'
fd -t f 'destroy|delete|cleanup|teardown' deploy/aws-hypervisor | xargs -I{} sh -c 'echo "=== {} ==="; cat "{}"'Repository: openshift-eng/two-node-toolbox
Length of output: 2729
🌐 Web query:
When an EC2 instance is launched via CloudFormation AWS::EC2::Instance using a LaunchTemplate with InstanceMarketOptions MarketType spot and SpotInstanceType persistent, is the associated spot instance request cancelled automatically when the CloudFormation stack is deleted?
💡 Result:
No, the associated Spot Instance request is not cancelled automatically when the CloudFormation stack is deleted. CloudFormation deletes the EC2 instance by terminating it, but for persistent Spot requests, termination causes the request to become open again, potentially launching a new instance. Multiple user reports confirm the Spot request persists after stack deletion, requiring manual cancellation.
Citations:
- 1: https://stackoverflow.com/questions/63230849/spot-requests-not-canceled-after-deleting-cloudformation-stack
- 2: https://www.reddit.com/r/aws/comments/i3k0b8/cloudformation_persistent_spot_instance/
- 3: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances-request.html
Add spot instance request cancellation to teardown process.
The destroy.sh script handles capacity reservation cleanup but does not cancel the persistent Spot Instance Request created by SpotInstanceType: persistent. When the CloudFormation stack deletes the EC2 instance, the persistent Spot request remains open and can relaunch a replacement instance, leading to orphaned instances and continued billing.
The teardown script must explicitly cancel all Spot Instance Requests associated with the stack before or after CloudFormation stack deletion. Reference the Spot request ID stored during instance launch and use aws ec2 cancel-spot-instance-requests to clean it up during the destroy phase.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@deploy/aws-hypervisor/templates/rhel-instance.yaml` around lines 249 - 260,
The teardown lacks cancellation of persistent Spot Instance Requests created by
the RHELSpotLaunchTemplate (SpotOptions -> SpotInstanceType: persistent); update
the destroy.sh script to locate the Spot Request ID recorded when the instance
was launched (or query for requests tied to the CloudFormation stack/instance
tags) and call aws ec2 cancel-spot-instance-requests --spot-instance-request-ids
<ids> as part of teardown (before/after stack delete) to ensure persistent
requests are removed and no replacement instances are launched.
Persistent spot requests survive instance termination and could launch a new instance. Cancel the spot request explicitly before deleting the CloudFormation stack to prevent orphaned resources. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Hey Gal, thanks for putting this together — spot support for TNT is a great idea and the cost savings are real. I've gone through the implementation and have some feedback. The core direction is sound but I think the persistent spot request lifecycle needs more attention before this is ready. Architecture: spot request as a managed resourceThe existing codebase treats capacity reservations as a first-class managed resource — created, persisted to a file, cleaned up in both success and error paths. The spot request that I'd suggest following the same pattern as capacity reservations:
This makes the spot lifecycle explicit, file-driven, and consistent with how capacity reservations are already managed. Blocking:
|
Spot instances can reduce EC2 costs by up to 90%. Uses persistent spot requests with stop (not terminate) interruption behavior, so instances can be restarted after capacity is restored.