Skip to content

cron: run-top-query has no observability — failed runs are invisible until the demand row stays pending #178

@jravinder

Description

@jravinder

Problem

#162 / PR #173 fixed the cron's maxDuration alignment and added poison-query backoff. Good. But the cron still has no surfaced telemetry — a hung outbound fetch, a 500 from /api/agent, a Socrata 5xx, or a markAttemptFailed storm all live only in Vercel function logs (which nobody tails) and as a row that quietly stays pending in the demand queue.

For a system whose marketing claim is "ask a question, the agent answers it on its own schedule," the right SLO surface is "what % of cron picks ran to phase=done in the last 24h, and which queries are stuck in poison-backoff." We don't have that.

Fix

Two additions in app/lib/demand.ts + app/api/cron/run-top-query/route.ts:

  1. Per-run record — when the cron picks a query, INSERT a cron_runs row with (query_hash, started_at, status='running'). On success: status='done', finished_at, duration_ms. On markAttemptFailed: status='failed', error_excerpt, finished_at.
  2. Admin endpointGET /api/admin/cron-runs?limit=50 returning the newest rows, plus ?status=failed filter. Add a small panel to app/admin/AdminConsole.tsx showing the last 24h success rate + the 5 most-recent failures with their error excerpts.

Bonus: a single counter cron_runs_failed_24h exposed via /api/cache-stats (or a sibling endpoint) so the existing watchdog can alert when it crosses a threshold.

Code pointers

  • app/api/cron/run-top-query/route.ts:1-40 — entry, no telemetry
  • app/lib/demand.ts — add recordCronRun(), finishCronRun(), listCronRuns()
  • app/api/admin/runs/route.ts — pattern to copy for the new admin endpoint
  • app/admin/AdminConsole.tsx — where the new panel lands
  • scripts/watchdog.sh — could optionally curl the new counter

Acceptance

  • cron_runs table created via the existing migration mechanism
  • Every cron tick writes one row with start + finish + status
  • /api/admin/cron-runs returns the last 50 with status filter
  • AdminConsole panel renders 24h success rate + 5 most recent failures
  • One smoke test covers the row write on a successful + a failed path

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:agentOrchestrator, planner, executor, doom-loopauto-research:2026-05-16Filed by auto-research skill on 2026-05-16enhancementNew feature or requestfleet:readyInternal fleet queue — for pickup-next

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions