Summary
Warden check runs (both warden core and warden: <skill> per-skill) can get permanently stuck as in_progress when a GitHub Actions runner is cancelled, timed out, OOM-killed, or otherwise exits before the completion checks.update calls run. A later successful Warden run creates new check run IDs and completes those, but never touches the orphaned ones from the dead runner — leaving them visible as perpetually in-progress on the PR.
Observed on getsentry/sentry-python#6508: check run 79732958848 (warden + warden: find-bugs) stuck in_progress for 3+ days. A separate successful run created 79733025770 on the same day and completed normally.
Root Cause
The code flow in pr-workflow.ts:
setupGitHubState calls createCoreCheck → status: in_progress
executeTrigger calls createSkillCheck per skill → status: in_progress
- On success:
updateSkillCheck / updateCoreCheck → status: completed
- On error:
failSkillCheck → status: completed, conclusion: failure
Steps 3 and 4 only run if the process lives long enough to reach them. If GitHub Actions kills the job first (concurrency cancellation, job timeout, OOM, runner loss), those API calls never happen. GitHub does not auto-complete stale in_progress Check Runs.
There is no durable cleanup path today:
- No startup reconciliation: warden never queries existing
in_progress checks before starting a new run
- No SIGTERM handler:
signals.ts only handles SIGINT for the CLI path; run.ts has no teardown hook
- Stale handling in
stale.ts / resolveStaleComments operates on PR review comments, not Check Runs
Proposed Fix
Primary: startup reconciliation (durable)
At the start of each PR run, after resolving the PR head SHA and before creating new Warden check runs:
- Call
checks.listForRef for the PR head SHA with status: in_progress and filter: all
- Filter to Warden-owned check runs:
name === 'warden' || name.startsWith('warden: ')
- Mark stale matches as
completed / cancelled with a summary explaining they were orphaned by an interrupted previous run
- Safety: skip checks with
started_at less than ~30 minutes ago (avoid cancelling a concurrent active Warden run) and/or set external_id to ${GITHUB_RUN_ID}:${GITHUB_RUN_ATTEMPT} on creation to identify checks belonging to the current process
This is the right durable fix because it handles runs killed by cancellation, timeout, OOM, VM loss, or any exit path where signal handlers never run.
Secondary: SIGTERM/process teardown cleanup (best-effort)
Register a SIGTERM handler in run.ts that tracks all check run IDs created by the current process and attempts to cancel them before exit. This complements the startup cleanup but should not be the only fix — SIGKILL and OOM bypass signal handlers entirely, and Actions cancellation may not leave enough time for async API calls.
Key Gotcha: Avoid Cancelling a Live Concurrent Run
The startup cleanup must not blindly cancel all in_progress warden* checks. If two Warden jobs start concurrently for the same commit, they could interfere with each other.
Mitigations:
- Age threshold: only cancel checks with
started_at older than 30–60 minutes
external_id tagging: skip checks whose external_id contains the current GITHUB_RUN_ID
- Acceptance criterion: startup cleanup must not cancel a concurrently running Warden job on the same SHA
Name Matching
Use exact matching to avoid false positives:
name === 'warden' || name.startsWith('warden: ')
Not a looser prefix like name.startsWith('warden') which could match warden-prod, warden-lint, etc.
Acceptance Criteria
Reported via sentry-python#6508. Action taken on behalf of immutable dcramer.
View Session in Sentry
Summary
Warden check runs (both
wardencore andwarden: <skill>per-skill) can get permanently stuck asin_progresswhen a GitHub Actions runner is cancelled, timed out, OOM-killed, or otherwise exits before the completionchecks.updatecalls run. A later successful Warden run creates new check run IDs and completes those, but never touches the orphaned ones from the dead runner — leaving them visible as perpetually in-progress on the PR.Observed on getsentry/sentry-python#6508: check run
79732958848(warden+warden: find-bugs) stuck in_progress for 3+ days. A separate successful run created79733025770on the same day and completed normally.Root Cause
The code flow in
pr-workflow.ts:setupGitHubStatecallscreateCoreCheck→status: in_progressexecuteTriggercallscreateSkillCheckper skill →status: in_progressupdateSkillCheck/updateCoreCheck→status: completedfailSkillCheck→status: completed, conclusion: failureSteps 3 and 4 only run if the process lives long enough to reach them. If GitHub Actions kills the job first (concurrency cancellation, job timeout, OOM, runner loss), those API calls never happen. GitHub does not auto-complete stale
in_progressCheck Runs.There is no durable cleanup path today:
in_progresschecks before starting a new runsignals.tsonly handles SIGINT for the CLI path;run.tshas no teardown hookstale.ts/resolveStaleCommentsoperates on PR review comments, not Check RunsProposed Fix
Primary: startup reconciliation (durable)
At the start of each PR run, after resolving the PR head SHA and before creating new Warden check runs:
checks.listForReffor the PR head SHA withstatus: in_progressandfilter: allname === 'warden' || name.startsWith('warden: ')completed / cancelledwith a summary explaining they were orphaned by an interrupted previous runstarted_atless than ~30 minutes ago (avoid cancelling a concurrent active Warden run) and/or setexternal_idto${GITHUB_RUN_ID}:${GITHUB_RUN_ATTEMPT}on creation to identify checks belonging to the current processThis is the right durable fix because it handles runs killed by cancellation, timeout, OOM, VM loss, or any exit path where signal handlers never run.
Secondary: SIGTERM/process teardown cleanup (best-effort)
Register a SIGTERM handler in
run.tsthat tracks all check run IDs created by the current process and attempts to cancel them before exit. This complements the startup cleanup but should not be the only fix — SIGKILL and OOM bypass signal handlers entirely, and Actions cancellation may not leave enough time for async API calls.Key Gotcha: Avoid Cancelling a Live Concurrent Run
The startup cleanup must not blindly cancel all
in_progresswarden*checks. If two Warden jobs start concurrently for the same commit, they could interfere with each other.Mitigations:
started_atolder than 30–60 minutesexternal_idtagging: skip checks whoseexternal_idcontains the currentGITHUB_RUN_IDName Matching
Use exact matching to avoid false positives:
Not a looser prefix like
name.startsWith('warden')which could matchwarden-prod,warden-lint, etc.Acceptance Criteria
in_progresswarden check runs from prior interrupted runs on the same head SHAchecks: write(handle permission error gracefully)Reported via sentry-python#6508. Action taken on behalf of immutable dcramer.
View Session in Sentry