Skip to content

feat: replace agent heartbeat with Castellarius-driven liveness detection#506

Merged
MichielDean merged 1 commit into
mainfrom
remove-agent-heartbeat
May 11, 2026
Merged

feat: replace agent heartbeat with Castellarius-driven liveness detection#506
MichielDean merged 1 commit into
mainfrom
remove-agent-heartbeat

Conversation

@MichielDean
Copy link
Copy Markdown
Owner

Summary

  • Remove agent-driven heartbeat (ct droplet heartbeat) and replace with Castellarius-driven liveness detection using session log mtime
  • Agents no longer need to call any heartbeat command — the Castellarius passively checks whether the tmux pipe-pane session log has been written to recently
  • This removes context-wasting LLM calls and eliminates an entire class of bugs where agents forgot or mis-called heartbeat

What changed

Removed:

  • ct droplet heartbeat <id> CLI command
  • POST /api/droplets/{id}/heartbeat API endpoint
  • Client.Heartbeat() method and CisternClient.Heartbeat() interface method
  • last_heartbeat_at database column (migration 020 drops it)
  • EventHeartbeat event type and displayInfoHeartbeat() function
  • Heartbeat instruction from agent prompt in session.go
  • heartbeat field from stall event payload
  • HeartbeatInterval config field (renamed to LivenessInterval)
  • heartbeatInterval / heartbeatInProgress / heartbeatRepo (renamed to livenessInterval / livenessCheck / livenessCheckRepo)

Added:

  • sessionLogMtime() function that checks ~/.cistern/session-logs/<repo>-<worker>.log modification time
  • Stall detection now uses session log mtime (with fallback to updated_at for orphans)
  • 16 liveness regression tests covering exit detection, stall detection, orphan recovery, DB integration, and error fallbacks

Why

Agent heartbeats were added when reading agent output was unreliable. Those bugs are now solved, and the heartbeat mechanism:

  • Wastes LLM context (every 60s the agent calls a CLI command)
  • Adds complexity (agent must remember to call it, Castellarius must track the timestamp)
  • Creates event noise (60 events/hour/droplet)
  • Doesn't actually distinguish "alive" from "stuck" better than checking whether the agent's session log is still being written to

The session log mtime is a passive signal — the agent is already writing to it via tmux pipe-pane, so no agent cooperation is needed.

Testing

  • All 13 packages pass (including 16 new liveness regression tests)
  • Existing castellarius, client, and CLI tests updated
  • Migration 020 verified in TestEndToEndSchemaVerification

…tion

Remove the agent-driven heartbeat mechanism (ct droplet heartbeat) and
replace it with Castellarius-driven liveness detection using session log
mtime. Agents no longer need to call any heartbeat command — the
Castellarius checks whether the tmux pipe-pane session log file has been
written to recently, which is a passive signal that requires zero agent
cooperation.

Changes:
- Remove ct droplet heartbeat CLI command and HTTP API endpoint
- Remove Heartbeat() from Client and CisternClient interface
- Remove last_heartbeat_at column (migration 020 drops it; schema.sql updated)
- Remove EventHeartbeat event type and display function
- Replace stall detection: session log mtime replaces LastHeartbeatAt
- Rename heartbeat → liveness throughout (interval, goroutine, config field)
- Remove heartbeat instruction from agent prompt in session.go
- Add sessionLogMtime() function alongside isTmuxAlive()
- Add migration 020 to drop last_heartbeat_at column
- Update README, troubleshooting docs, commands docs
- Add 16 liveness regression tests covering exit detection, stall
  detection, orphan recovery, DB integration, error fallbacks
@MichielDean MichielDean merged commit 7fa94dd into main May 11, 2026
3 checks passed
@MichielDean MichielDean deleted the remove-agent-heartbeat branch May 11, 2026 22:07
MichielDean added a commit that referenced this pull request May 11, 2026
…tion (#507)

## Summary

Extract a shared `internal/sessionlog` package so the session log path
(`~/.cistern/session-logs/<id>.log`) is resolved in one place instead of
three.

**Before:** Hard-coded `filepath.Join(home, ".cistern", "session-logs",
id+".log")` in:
- `internal/cataractae/session.go` — spawn (writes the log)
- `internal/castellarius/scheduler.go` — liveness (reads mtime)
- `cmd/ct/cistern.go` — peek `--raw` (reads content)

**After:** All three use `sessionlog.Path()`, `sessionlog.Mtime()`,
`sessionlog.Read()`, and `sessionlog.EnsureDir()`.

## What changed

- New `internal/sessionlog` package with `Path()`, `Mtime()`, `Read()`,
`EnsureDir()`
- `LogDirFn` and `MtimeFn` are exported for test overrides (same pattern
as `isTmuxAliveFn`)
- Scheduler's `sessionLogMtimeFn` now delegates to `sessionlog.MtimeFn`
- CLI peek's `--raw` mode uses `sessionlog.Read()` instead of manual
`os.Open`
- Cataractae spawn uses `sessionlog.Path()` and `sessionlog.EnsureDir()`
- 6 unit tests for the new package
- Removed `sessionLogDir` variable from CLI peek tests (uses
`sessionlog.LogDirFn` instead)

This is a follow-up to #506 (heartbeat removal) and does not depend on
it being merged first — it builds on the same `sessionLogMtimeFn` that
PR already introduced, just moving it to a shared package.

Co-authored-by: Lobsterdog Contributors <noreply@lobsterdog.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant