Skip to content

Fix TOCTOU race condition in scale-down that terminates busy runners#6

Closed
JVenberg wants to merge 1 commit intomainfrom
fix-scale-down-race
Closed

Fix TOCTOU race condition in scale-down that terminates busy runners#6
JVenberg wants to merge 1 commit intomainfrom
fix-scale-down-race

Conversation

@JVenberg
Copy link
Copy Markdown

Summary

  • Fixes a race condition in the scale-down lambda where a job could be assigned to a runner between the busy check and EC2 termination, killing in-flight jobs (e.g., Helm deploys)
  • Adds a post-deregistration busy re-check: after deregistering the runner from GitHub (which prevents new job assignment), we re-check busy state to catch jobs that snuck in during the race window
  • If the runner became busy during the race window, the instance is left running so the job can complete, and is cleaned up as an orphan on a subsequent cycle

Context

On 2026-03-27, the `deploy-ssr-rover` job in the Webapp Deploy workflow was killed mid-Helm-deploy when the scale-down lambda terminated instance `i-06d1f1037840a1952` while it was actively running a job. CloudTrail confirmed the `github-actions-aws-4-ubuntu-scale-down` lambda terminated the instance at exactly the moment the runner received a shutdown signal.

The root cause was a TOCTOU (time-of-check to time-of-use) race:

  1. Scale-down checks GitHub API: "Is this runner busy?" -> No
  2. A job gets assigned to the runner
  3. Scale-down deregisters and terminates the instance
  4. The in-flight job is killed

How the fix works

The new flow:

  1. Check busy (fast-path to skip obviously busy runners)
  2. Deregister from GitHub (prevents new job assignment server-side)
  3. Re-check busy (now stable, since no new jobs can be assigned after deregistration)

If the re-check finds the runner busy, we leave the instance running. The in-flight job continues to completion because the runner worker uses job-scoped OAuth credentials, not the runner registration:

// JobRunner.cs - Worker creates its own connection using job message credentials
var jobServer = HostContext.GetService<IJobServer>();
VssCredentials jobServerCredential = VssUtil.GetVssCredential(systemConnection);
Uri jobServerUrl = systemConnection.Url;
VssConnection jobConnection = VssUtil.CreateConnection(jobServerUrl, jobServerCredential, delegatingHandlers);
await jobServer.ConnectAsync(jobConnection);

The worker never checks runner registration status during job execution. Deregistration only affects the listener loop (which polls for new messages), not the worker process running the current job.

Test plan

  • All 68 existing scale-down tests pass
  • New test: runner that becomes busy between deregister and re-check is NOT terminated
  • New test: runner that returns 404 on post-deregister busy check IS terminated (runner fully removed from GitHub)
  • Deploy to test environment and verify scale-down behavior with idle and busy runners
  • Monitor orphan cleanup cycle to ensure instances left running are properly cleaned up

The scale-down lambda had a race condition where a job could be assigned
to a runner between checking its busy state and terminating the EC2
instance. This caused in-flight jobs (e.g., Helm deploys) to be killed
mid-execution when the instance was terminated.

The fix adds a post-deregistration busy re-check:

1. Check busy state (fast-path to skip busy runners)
2. Deregister the runner from GitHub (prevents new job assignment)
3. Re-check busy state after deregistration

After deregistration, GitHub cannot assign new jobs to the runner, so the
re-check is stable (no more race window). If the runner became busy
between step 1 and step 2, the in-flight job will complete using its
job-scoped OAuth token and the instance is left for orphan cleanup.
@JVenberg JVenberg closed this Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant