Skip to content

Phase 6 follow-up: heartbeat infra for runtime-check-private-freshness.yml #204

@cbeaulieu-gt

Description

@cbeaulieu-gt

Summary

Phase 6 (#144) ships runtime-check-private-freshness.yml — a Mondays-08:00-UTC alarm that opens an issue when path-scoped drift exists between the manifest's pinned private.ref and claude-configs/main. The alarm only emits a signal when there's drift; "no issue" can mean genuinely fresh, OR auth-broke, OR cron-missed-to-fire — and from the outside all three look identical.

This issue tracks building a heartbeat trip-wire so that "the alarm is silent" becomes a detectable state distinct from "everything is fresh."

Hard SLA — blocks Phase 7

This issue must land before Phase 7 begins. The freshness alarm is informational on day 1 (it ships in Phase 6 as a baseline) but cannot be treated as load-bearing for any operational decision until heartbeat detection exists. Without that, operators have no way to distinguish a quiet alarm from a broken alarm. Phase 7 work that depends on claude-configs freshness assumptions (image rebuilds triggered by config drift, auto-bumps of pinned private.ref, etc.) cannot proceed safely until heartbeat is in place.

Motivation

Inquisitor Pass 2 on Phase 6's plan (2026-05-05) flagged Charge 4: the heartbeat step in Step 6.5.1 is a stub — it gh api-reads issues with a fixed label and pipes to head -1 || true, producing zero observable output that lives past the workflow run. The accompanying comment in the plan explicitly says "Heartbeat detection is a follow-up; this stub captures the intent." The stub was retained in Phase 6 to acknowledge the silent-failure surface; this issue commits to making it real.

Three resolutions were considered (Pass 2 triage):

Acceptance Criteria

  • runtime-check-private-freshness.yml writes a deterministic timestamp on every run (success OR failure) — either:
    • (a) updates a fixed "heartbeat anchor" issue with a label freshness-alarm-heartbeat, OR
    • (b) updates a status check on a known commit (runtime-build/freshness-heartbeat), OR
    • (c) writes a /heartbeats/freshness.json file to a dedicated repo / a release asset / a GHCR latest-tagged image label
  • An external monitor (a separate workflow scheduled at a different cron offset, OR a GitHub status check freshness rule, OR a manual operator runbook) detects when the heartbeat hasn't updated in the past 8 days (the cron is weekly, so 8 days = 2 missed cycles)
  • The detection step opens a NEW issue or fires a NEW alarm — distinct from the drift-detected alarm — when stale
  • CLAUDE.md "CI Runtime" section documents the heartbeat anchor, where to look, and the diagnostic runbook for "alarm appears silent"
  • At least one deliberate-failure rehearsal: disable the workflow's auth (e.g., temporarily revoke its App installation), run a cron cycle, confirm the heartbeat-detection step fires within 8-16 days
  • The runtime-check-private-freshness.yml workflow's Self-test — assert workflow ran (heartbeat) stub step is removed as part of this issue's PR

Technical Notes

  • Anchor selection. Option (a) is simplest but couples to GitHub Issues (which can be wiped, archived, or have search behavior change). Option (b) is the most native — status checks have first-class observability in GitHub's UI. Option (c) requires more infra. Recommend (b) if the deployment target is GitHub-only; (a) if you want zero new dependencies.
  • Detection cadence. External monitor needs to run more frequently than the alarm itself (otherwise it can't catch "alarm missed its slot"). A separate GHA cron at offset (e.g., Tuesdays + Fridays) running at the same cadence as the alarm-watch threshold works.
  • Distinguish from drift signal. The drift-detected alarm and the heartbeat-stale alarm should be visibly different — different issue labels, different titles ("Phase 6: private-config drift" vs "Phase 6: freshness alarm itself appears stale"). An operator skimming the inbox should not confuse "alarm fired" with "alarm broken."
  • Auth-failure self-disclosure. If the workflow can't even start (App permissions revoked, secret missing), GHA's normal failure-notification path catches it via run-fail emails. The heartbeat catches the harder case: workflow runs successfully BUT silently doesn't do its job (e.g., the path-scope filter accidentally skips everything, or git log returns 0 results due to a bad ref).

Out of Scope

  • Phase 6 MVP (Phase 6: digest-bump-PR automation + rollback.yml + freshness alarm + prune-pending.yml #144) — ships the alarm with the stub heartbeat removed; this issue replaces the stub with a real one
  • General observability for OTHER Phase 6 workflows (runtime-prune-pending.yml, runtime-rollback.yml, STAGE 5) — those have their own observability surfaces and aren't blocked on this work
  • Cron-firing infrastructure validation (whether GHA itself reliably fires scheduled workflows at the requested cadence) — pre-existing concern outside this issue's scope

References

🤖 Generated by Claude Code on behalf of @cbeaulieu-gt

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions