Problem
#162 / PR #173 fixed the cron's maxDuration alignment and added poison-query backoff. Good. But the cron still has no surfaced telemetry — a hung outbound fetch, a 500 from /api/agent, a Socrata 5xx, or a markAttemptFailed storm all live only in Vercel function logs (which nobody tails) and as a row that quietly stays pending in the demand queue.
For a system whose marketing claim is "ask a question, the agent answers it on its own schedule," the right SLO surface is "what % of cron picks ran to phase=done in the last 24h, and which queries are stuck in poison-backoff." We don't have that.
Fix
Two additions in app/lib/demand.ts + app/api/cron/run-top-query/route.ts:
- Per-run record — when the cron picks a query, INSERT a
cron_runs row with (query_hash, started_at, status='running'). On success: status='done', finished_at, duration_ms. On markAttemptFailed: status='failed', error_excerpt, finished_at.
- Admin endpoint —
GET /api/admin/cron-runs?limit=50 returning the newest rows, plus ?status=failed filter. Add a small panel to app/admin/AdminConsole.tsx showing the last 24h success rate + the 5 most-recent failures with their error excerpts.
Bonus: a single counter cron_runs_failed_24h exposed via /api/cache-stats (or a sibling endpoint) so the existing watchdog can alert when it crosses a threshold.
Code pointers
app/api/cron/run-top-query/route.ts:1-40 — entry, no telemetry
app/lib/demand.ts — add recordCronRun(), finishCronRun(), listCronRuns()
app/api/admin/runs/route.ts — pattern to copy for the new admin endpoint
app/admin/AdminConsole.tsx — where the new panel lands
scripts/watchdog.sh — could optionally curl the new counter
Acceptance
Problem
#162/ PR#173fixed the cron's maxDuration alignment and added poison-query backoff. Good. But the cron still has no surfaced telemetry — a hung outbound fetch, a 500 from/api/agent, a Socrata 5xx, or amarkAttemptFailedstorm all live only in Vercel function logs (which nobody tails) and as a row that quietly stayspendingin the demand queue.For a system whose marketing claim is "ask a question, the agent answers it on its own schedule," the right SLO surface is "what % of cron picks ran to phase=done in the last 24h, and which queries are stuck in poison-backoff." We don't have that.
Fix
Two additions in
app/lib/demand.ts+app/api/cron/run-top-query/route.ts:cron_runsrow with(query_hash, started_at, status='running'). On success:status='done', finished_at, duration_ms. OnmarkAttemptFailed:status='failed', error_excerpt, finished_at.GET /api/admin/cron-runs?limit=50returning the newest rows, plus?status=failedfilter. Add a small panel toapp/admin/AdminConsole.tsxshowing the last 24h success rate + the 5 most-recent failures with their error excerpts.Bonus: a single counter
cron_runs_failed_24hexposed via/api/cache-stats(or a sibling endpoint) so the existing watchdog can alert when it crosses a threshold.Code pointers
app/api/cron/run-top-query/route.ts:1-40— entry, no telemetryapp/lib/demand.ts— addrecordCronRun(),finishCronRun(),listCronRuns()app/api/admin/runs/route.ts— pattern to copy for the new admin endpointapp/admin/AdminConsole.tsx— where the new panel landsscripts/watchdog.sh— could optionally curl the new counterAcceptance
cron_runstable created via the existing migration mechanism/api/admin/cron-runsreturns the last 50 with status filter