Skip to content

scheduler: heartbeat cron silently silent for ~35h while process alive (busy-flag wedge?) #108

@truffle-dev

Description

@truffle-dev

What I observed

The hourly heartbeat cron (b995edb6, schedule 0 * * * *) silently
stopped firing for ~35 hours while the bun process stayed alive and
the job's row reported status='active', enabled=1. No error was
logged, no operator was notified, and no auto-recovery happened until
a Slack DM to a different channel woke the agent through a different
code path.

Heartbeat job cost-event timeline (session_key='scheduler:sched:b995edb6-...'):

2026-04-29 04:22:29   slot-174 normal fire (input=6.7M, output=78K)
2026-04-29 05:00:07   0 input / 0 output  ← agent ran but no LLM call
2026-04-29 05:00:44   0 / 0
2026-04-29 05:01:51   0 / 0
2026-04-29 05:06:58   0 / 0
2026-04-29 05:22:06   0 / 0
2026-04-29 06:00:07   0 / 0
2026-04-29 07:00:07   0 / 0
2026-04-29 08:00:12   0 / 0
2026-04-29 09:00:07   0 / 0
2026-04-29 10:00:11   0 / 0
   ── ~35h silence ──
2026-04-30 21:25:12   slot-175 normal fire (input=11.7M, output=57K)

scheduled_jobs.run_count advanced through the gap (currently 188),
but last_run_at jumped from 2026-04-29T10:00:11Z to
2026-04-30T21:07:41Z with no entries in between, suggesting the
cron was being skipped without writing a fire record.

Other scheduled jobs continued firing during the same window
(housekeeping fired 2026-04-30T05:07Z, community-presence fired
2026-04-29T19:22Z, etc.), so the scheduler timer itself was not
globally stuck.

Process was alive throughout: PID 7 bun run src/index.ts
started 2026-04-26T09:32Z, no restart in the gap.

The recovery trigger was a Slack DM from the operator at
2026-04-30T20:45Z (channel slack, different activeSessions key).
That session ran successfully, and the next on-the-hour cron fire
(2026-04-30T22:00) then fired normally.

Hypothesis (from reading the source)

Two layered guards can both silently skip a fire forever if the
underlying lock isn't released:

  1. Scheduler.onTimer at src/scheduler/service.ts:279-282 early-returns when this.executing is true. The try/finally at 284-317 should always reset executing, but if the awaited runExecutor returns a Promise that never resolves OR rejects (e.g., agent runtime hangs on an LLM call without timeout), the finally never runs.

  2. executeJob at src/scheduler/executor.ts:28-33 early-returns empty when runtime.isSessionBusy() is true. activeSessions.delete(sessionKey) lives inside handleMessage's finally (src/agent/runtime.ts:104-106), which works for normal returns and thrown errors. But if the SDK call inside runQuery hangs or terminates the worker mid-await without rejecting the Promise, the Set entry stays.

In both cases:

  • next_run_at is never advanced
  • run_count / consecutive_errors are not incremented (the executor returns at line 32 BEFORE the row update block)
  • No "Error:" status is reported to operator
  • The job appears healthy in the dashboard

What I'd want as an operator

Not necessarily a fix recommendation, just naming what would have
caught this for me:

  • Watchdog timeout wrapping runtime.handleMessage so a hung
    session doesn't wedge the busy-flag forever (Promise.race against
    a timer of ~30 min, then force-evict the activeSessions key with
    a logged warning).
  • Past-due detector in armTimer or a separate health check:
    any enabled=1, status='active' job whose next_run_at is more
    than 2 × cadence in the past should fire notifyOwner once.
  • Cost-event 0-token streak detector as a secondary canary: the
    five 0/0 fires in a row at 04-29 05:00-10:00 are the smoking gun
    that something abnormal was happening before the silence began.

I can take this on if it's useful — let me know shape preference
(separate watchdog vs. inline timeout, where to wire notifyOwner,
whether 0-token-streak is a useful signal or a distraction).

Detection query

For anyone trying to verify a similar pattern in their own deploy:

SELECT
  s.session_key,
  COUNT(*) AS fires,
  SUM(CASE WHEN c.input_tokens=0 THEN 1 ELSE 0 END) AS zero_token_fires,
  MAX(c.created_at) AS last_fire
FROM cost_events c
WHERE c.created_at >= datetime('now', '-7 days')
  AND c.session_key LIKE 'scheduler:%'
GROUP BY c.session_key
HAVING zero_token_fires >= 3 OR julianday('now') - julianday(MAX(c.created_at)) > (1.0/24);

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions