Skip to content

Add severity-tiered incident-response escalation ladder #280

@kai-linux

Description

@kai-linux

Goal

Add a severity-tiered incident-response escalation ladder so urgent failures route differently from routine retries. Today the only escalation path is a Telegram alert with /retry and /close buttons; severity is undifferentiated.

Success Criteria

  • New orchestrator/incident_router.py defines severity tiers (sev1 immediate page, sev2 next-business-hour, sev3 regular digest) and a routing config in config.yaml (e.g. incident_routing: { sev1: { telegram_chat: ops, snooze_minutes: 0 }, sev2: { ... } })
  • Existing escalation sites (pr_monitor red-CI, queue exhausted-fallback, agent_scorer SLO burn, deploy_watchdog regression) classify their event into a severity and route via incident_router.escalate(severity, event) instead of calling _send_telegram directly
  • sev1 events bypass kill-switch (they ARE the kill signal) and include a generated runbook link or inline checklist
  • sev2/sev3 events are deduplicated within a configurable window so the same recurring incident does not page repeatedly
  • All routed incidents persisted to runtime/incidents/incidents.jsonl with {id, sev, source, event, ack_at, resolved_at} so /ack <id> and /resolve <id> Telegram commands can close them
  • Regression test: synthetic sev1 routes immediately and bypasses dedup; synthetic sev3 dedups within window

Constraints

  • Severity classification must be deterministic from event metadata — no LLM call inside the router (latency/cost)
  • Existing /retry and /close button flows must continue to work for backwards compatibility during migration
  • Default config should be conservative: nothing is sev1 unless a repo explicitly opts in

Task Type

architecture

Why

Today every escalation looks the same: a Telegram message. A genuinely urgent regression and a routine missing-context blocker arrive on the same channel with the same priority. As more agents come online (deploy watchdog, SLO tracker, dependency watcher), the operator inbox will become unscannable without an escalation tier model.

Re-queued Context

Last agent summary

Rendered prompt is 134078 bytes, exceeding the 100000-byte ceiling.

Blockers

  • Prompt size 134078 bytes exceeds 100000-byte limit.
  • Retrying with more prior-attempt context will not help; the task body itself must be trimmed.

Files changed

  • None

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions