Fix TOCTOU race condition in scale-down that terminates busy runners by JVenberg · Pull Request #6 · roverdotcom/terraform-aws-github-runner

JVenberg · 2026-03-30T06:38:07Z

Summary

Fixes a race condition in the scale-down lambda where a job could be assigned to a runner between the busy check and EC2 termination, killing in-flight jobs (e.g., Helm deploys)
Adds a post-deregistration busy re-check: after deregistering the runner from GitHub (which prevents new job assignment), we re-check busy state to catch jobs that snuck in during the race window
If the runner became busy during the race window, the instance is left running so the job can complete, and is cleaned up as an orphan on a subsequent cycle

Context

On 2026-03-27, the `deploy-ssr-rover` job in the Webapp Deploy workflow was killed mid-Helm-deploy when the scale-down lambda terminated instance `i-06d1f1037840a1952` while it was actively running a job. CloudTrail confirmed the `github-actions-aws-4-ubuntu-scale-down` lambda terminated the instance at exactly the moment the runner received a shutdown signal.

The root cause was a TOCTOU (time-of-check to time-of-use) race:

Scale-down checks GitHub API: "Is this runner busy?" -> No
A job gets assigned to the runner
Scale-down deregisters and terminates the instance
The in-flight job is killed

How the fix works

The new flow:

Check busy (fast-path to skip obviously busy runners)
Deregister from GitHub (prevents new job assignment server-side)
Re-check busy (now stable, since no new jobs can be assigned after deregistration)

If the re-check finds the runner busy, we leave the instance running. The in-flight job continues to completion because the runner worker uses job-scoped OAuth credentials, not the runner registration:

// JobRunner.cs - Worker creates its own connection using job message credentials
var jobServer = HostContext.GetService<IJobServer>();
VssCredentials jobServerCredential = VssUtil.GetVssCredential(systemConnection);
Uri jobServerUrl = systemConnection.Url;
VssConnection jobConnection = VssUtil.CreateConnection(jobServerUrl, jobServerCredential, delegatingHandlers);
await jobServer.ConnectAsync(jobConnection);

The worker never checks runner registration status during job execution. Deregistration only affects the listener loop (which polls for new messages), not the worker process running the current job.

Test plan

All 68 existing scale-down tests pass
New test: runner that becomes busy between deregister and re-check is NOT terminated
New test: runner that returns 404 on post-deregister busy check IS terminated (runner fully removed from GitHub)
Deploy to test environment and verify scale-down behavior with idle and busy runners
Monitor orphan cleanup cycle to ensure instances left running are properly cleaned up

The scale-down lambda had a race condition where a job could be assigned to a runner between checking its busy state and terminating the EC2 instance. This caused in-flight jobs (e.g., Helm deploys) to be killed mid-execution when the instance was terminated. The fix adds a post-deregistration busy re-check: 1. Check busy state (fast-path to skip busy runners) 2. Deregister the runner from GitHub (prevents new job assignment) 3. Re-check busy state after deregistration After deregistration, GitHub cannot assign new jobs to the runner, so the re-check is stable (no more race window). If the runner became busy between step 1 and step 2, the in-flight job will complete using its job-scoped OAuth token and the instance is left for orphan cleanup.

JVenberg closed this Mar 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix TOCTOU race condition in scale-down that terminates busy runners#6

Fix TOCTOU race condition in scale-down that terminates busy runners#6
JVenberg wants to merge 1 commit intomainfrom
fix-scale-down-race

JVenberg commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JVenberg commented Mar 30, 2026

Summary

Context

How the fix works

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant