Goal
Add a severity-tiered incident-response escalation ladder so urgent failures route differently from routine retries. Today the only escalation path is a Telegram alert with /retry and /close buttons; severity is undifferentiated.
Success Criteria
- New
orchestrator/incident_router.py defines severity tiers (sev1 immediate page, sev2 next-business-hour, sev3 regular digest) and a routing config in config.yaml (e.g. incident_routing: { sev1: { telegram_chat: ops, snooze_minutes: 0 }, sev2: { ... } })
- Existing escalation sites (
pr_monitor red-CI, queue exhausted-fallback, agent_scorer SLO burn, deploy_watchdog regression) classify their event into a severity and route via incident_router.escalate(severity, event) instead of calling _send_telegram directly
sev1 events bypass kill-switch (they ARE the kill signal) and include a generated runbook link or inline checklist
sev2/sev3 events are deduplicated within a configurable window so the same recurring incident does not page repeatedly
- All routed incidents persisted to
runtime/incidents/incidents.jsonl with {id, sev, source, event, ack_at, resolved_at} so /ack <id> and /resolve <id> Telegram commands can close them
- Regression test: synthetic sev1 routes immediately and bypasses dedup; synthetic sev3 dedups within window
Constraints
- Severity classification must be deterministic from event metadata — no LLM call inside the router (latency/cost)
- Existing
/retry and /close button flows must continue to work for backwards compatibility during migration
- Default config should be conservative: nothing is sev1 unless a repo explicitly opts in
Task Type
architecture
Why
Today every escalation looks the same: a Telegram message. A genuinely urgent regression and a routine missing-context blocker arrive on the same channel with the same priority. As more agents come online (deploy watchdog, SLO tracker, dependency watcher), the operator inbox will become unscannable without an escalation tier model.
Re-queued Context
Last agent summary
Rendered prompt is 134078 bytes, exceeding the 100000-byte ceiling.
Blockers
- Prompt size 134078 bytes exceeds 100000-byte limit.
- Retrying with more prior-attempt context will not help; the task body itself must be trimmed.
Files changed
Goal
Add a severity-tiered incident-response escalation ladder so urgent failures route differently from routine retries. Today the only escalation path is a Telegram alert with
/retryand/closebuttons; severity is undifferentiated.Success Criteria
orchestrator/incident_router.pydefines severity tiers (sev1immediate page,sev2next-business-hour,sev3regular digest) and a routing config inconfig.yaml(e.g.incident_routing: { sev1: { telegram_chat: ops, snooze_minutes: 0 }, sev2: { ... } })pr_monitorred-CI,queueexhausted-fallback,agent_scorerSLO burn,deploy_watchdogregression) classify their event into a severity and route viaincident_router.escalate(severity, event)instead of calling_send_telegramdirectlysev1events bypass kill-switch (they ARE the kill signal) and include a generated runbook link or inline checklistsev2/sev3events are deduplicated within a configurable window so the same recurring incident does not page repeatedlyruntime/incidents/incidents.jsonlwith{id, sev, source, event, ack_at, resolved_at}so/ack <id>and/resolve <id>Telegram commands can close themConstraints
/retryand/closebutton flows must continue to work for backwards compatibility during migrationTask Type
architecture
Why
Today every escalation looks the same: a Telegram message. A genuinely urgent regression and a routine missing-context blocker arrive on the same channel with the same priority. As more agents come online (deploy watchdog, SLO tracker, dependency watcher), the operator inbox will become unscannable without an escalation tier model.
Re-queued Context
Last agent summary
Rendered prompt is 134078 bytes, exceeding the 100000-byte ceiling.
Blockers
Files changed