Skip to content

fix: stale in_progress check runs left behind when runner is cancelled or killed #402

@sentry-junior

Description

@sentry-junior

Summary

Warden check runs (both warden core and warden: <skill> per-skill) can get permanently stuck as in_progress when a GitHub Actions runner is cancelled, timed out, OOM-killed, or otherwise exits before the completion checks.update calls run. A later successful Warden run creates new check run IDs and completes those, but never touches the orphaned ones from the dead runner — leaving them visible as perpetually in-progress on the PR.

Observed on getsentry/sentry-python#6508: check run 79732958848 (warden + warden: find-bugs) stuck in_progress for 3+ days. A separate successful run created 79733025770 on the same day and completed normally.

Root Cause

The code flow in pr-workflow.ts:

  1. setupGitHubState calls createCoreCheckstatus: in_progress
  2. executeTrigger calls createSkillCheck per skill → status: in_progress
  3. On success: updateSkillCheck / updateCoreCheckstatus: completed
  4. On error: failSkillCheckstatus: completed, conclusion: failure

Steps 3 and 4 only run if the process lives long enough to reach them. If GitHub Actions kills the job first (concurrency cancellation, job timeout, OOM, runner loss), those API calls never happen. GitHub does not auto-complete stale in_progress Check Runs.

There is no durable cleanup path today:

  • No startup reconciliation: warden never queries existing in_progress checks before starting a new run
  • No SIGTERM handler: signals.ts only handles SIGINT for the CLI path; run.ts has no teardown hook
  • Stale handling in stale.ts / resolveStaleComments operates on PR review comments, not Check Runs

Proposed Fix

Primary: startup reconciliation (durable)

At the start of each PR run, after resolving the PR head SHA and before creating new Warden check runs:

  1. Call checks.listForRef for the PR head SHA with status: in_progress and filter: all
  2. Filter to Warden-owned check runs: name === 'warden' || name.startsWith('warden: ')
  3. Mark stale matches as completed / cancelled with a summary explaining they were orphaned by an interrupted previous run
  4. Safety: skip checks with started_at less than ~30 minutes ago (avoid cancelling a concurrent active Warden run) and/or set external_id to ${GITHUB_RUN_ID}:${GITHUB_RUN_ATTEMPT} on creation to identify checks belonging to the current process

This is the right durable fix because it handles runs killed by cancellation, timeout, OOM, VM loss, or any exit path where signal handlers never run.

Secondary: SIGTERM/process teardown cleanup (best-effort)

Register a SIGTERM handler in run.ts that tracks all check run IDs created by the current process and attempts to cancel them before exit. This complements the startup cleanup but should not be the only fix — SIGKILL and OOM bypass signal handlers entirely, and Actions cancellation may not leave enough time for async API calls.

Key Gotcha: Avoid Cancelling a Live Concurrent Run

The startup cleanup must not blindly cancel all in_progress warden* checks. If two Warden jobs start concurrently for the same commit, they could interfere with each other.

Mitigations:

  • Age threshold: only cancel checks with started_at older than 30–60 minutes
  • external_id tagging: skip checks whose external_id contains the current GITHUB_RUN_ID
  • Acceptance criterion: startup cleanup must not cancel a concurrently running Warden job on the same SHA

Name Matching

Use exact matching to avoid false positives:

name === 'warden' || name.startsWith('warden: ')

Not a looser prefix like name.startsWith('warden') which could match warden-prod, warden-lint, etc.

Acceptance Criteria

  • A new warden run on a PR cancels any stale in_progress warden check runs from prior interrupted runs on the same head SHA
  • A concurrent active warden run on the same SHA is not accidentally cancelled
  • Startup cleanup is skipped if the GitHub token lacks checks: write (handle permission error gracefully)
  • Best-effort SIGTERM handler cancels check runs if the process is given time to clean up

Reported via sentry-python#6508. Action taken on behalf of immutable dcramer.


View Session in Sentry

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions