fix: stale in_progress check runs left behind when runner is cancelled or killed

## Summary

Warden check runs (both `warden` core and `warden: <skill>` per-skill) can get permanently stuck as `in_progress` when a GitHub Actions runner is cancelled, timed out, OOM-killed, or otherwise exits before the completion `checks.update` calls run. A later successful Warden run creates new check run IDs and completes those, but never touches the orphaned ones from the dead runner — leaving them visible as perpetually in-progress on the PR.

Observed on [getsentry/sentry-python#6508](https://github.com/getsentry/sentry-python/pull/6508): check run `79732958848` (`warden` + `warden: find-bugs`) stuck in_progress for 3+ days. A separate successful run created `79733025770` on the same day and completed normally.

## Root Cause

The code flow in `pr-workflow.ts`:

1. `setupGitHubState` calls `createCoreCheck` → `status: in_progress`
2. `executeTrigger` calls `createSkillCheck` per skill → `status: in_progress`
3. On success: `updateSkillCheck` / `updateCoreCheck` → `status: completed`
4. On error: `failSkillCheck` → `status: completed, conclusion: failure`

Steps 3 and 4 only run if the process lives long enough to reach them. If GitHub Actions kills the job first (concurrency cancellation, job timeout, OOM, runner loss), those API calls never happen. GitHub does not auto-complete stale `in_progress` Check Runs.

There is no durable cleanup path today:
- No startup reconciliation: warden never queries existing `in_progress` checks before starting a new run
- No SIGTERM handler: `signals.ts` only handles SIGINT for the CLI path; `run.ts` has no teardown hook
- Stale handling in `stale.ts` / `resolveStaleComments` operates on PR review comments, not Check Runs

## Proposed Fix

### Primary: startup reconciliation (durable)

At the start of each PR run, after resolving the PR head SHA and before creating new Warden check runs:

1. Call `checks.listForRef` for the PR head SHA with `status: in_progress` and `filter: all`
2. Filter to Warden-owned check runs: `name === 'warden' || name.startsWith('warden: ')`
3. Mark stale matches as `completed / cancelled` with a summary explaining they were orphaned by an interrupted previous run
4. **Safety**: skip checks with `started_at` less than ~30 minutes ago (avoid cancelling a concurrent active Warden run) and/or set `external_id` to `${GITHUB_RUN_ID}:${GITHUB_RUN_ATTEMPT}` on creation to identify checks belonging to the current process

This is the right durable fix because it handles runs killed by cancellation, timeout, OOM, VM loss, or any exit path where signal handlers never run.

### Secondary: SIGTERM/process teardown cleanup (best-effort)

Register a SIGTERM handler in `run.ts` that tracks all check run IDs created by the current process and attempts to cancel them before exit. This complements the startup cleanup but should not be the only fix — SIGKILL and OOM bypass signal handlers entirely, and Actions cancellation may not leave enough time for async API calls.

## Key Gotcha: Avoid Cancelling a Live Concurrent Run

The startup cleanup must not blindly cancel all `in_progress` `warden*` checks. If two Warden jobs start concurrently for the same commit, they could interfere with each other.

Mitigations:
- Age threshold: only cancel checks with `started_at` older than 30–60 minutes
- `external_id` tagging: skip checks whose `external_id` contains the current `GITHUB_RUN_ID`
- Acceptance criterion: startup cleanup must not cancel a concurrently running Warden job on the same SHA

## Name Matching

Use exact matching to avoid false positives:
```ts
name === 'warden' || name.startsWith('warden: ')
```

Not a looser prefix like `name.startsWith('warden')` which could match `warden-prod`, `warden-lint`, etc.

## Acceptance Criteria

- [ ] A new warden run on a PR cancels any stale `in_progress` warden check runs from prior interrupted runs on the same head SHA
- [ ] A concurrent active warden run on the same SHA is not accidentally cancelled
- [ ] Startup cleanup is skipped if the GitHub token lacks `checks: write` (handle permission error gracefully)
- [ ] Best-effort SIGTERM handler cancels check runs if the process is given time to clean up

---
*Reported via [sentry-python#6508](https://github.com/getsentry/sentry-python/pull/6508). Action taken on behalf of immutable dcramer.*

---
[View Session in Sentry](https://sentry.sentry.io/traces/?project=4510944073809921&query=gen_ai.conversation.id%3A%22slack%3AC0ACFA5JBDX%3A1780917589.932679%22)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: stale in_progress check runs left behind when runner is cancelled or killed #402

Summary

Root Cause

Proposed Fix

Primary: startup reconciliation (durable)

Secondary: SIGTERM/process teardown cleanup (best-effort)

Key Gotcha: Avoid Cancelling a Live Concurrent Run

Name Matching

Acceptance Criteria

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

Uh oh!

fix: stale in_progress check runs left behind when runner is cancelled or killed #402

Description

Summary

Root Cause

Proposed Fix

Primary: startup reconciliation (durable)

Secondary: SIGTERM/process teardown cleanup (best-effort)

Key Gotcha: Avoid Cancelling a Live Concurrent Run

Name Matching

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

Issue actions