Skip to content

feat(health-check): add 9router health workflow#126

Merged
TechNickAI merged 3 commits into
mainfrom
feature/9router-health-check
May 21, 2026
Merged

feat(health-check): add 9router health workflow#126
TechNickAI merged 3 commits into
mainfrom
feature/9router-health-check

Conversation

@TechNickAI
Copy link
Copy Markdown
Owner

Summary

  • add a new workflows/9router-health/ workflow for host-gated 9router monitoring
  • keep host-specific details in gitignored CLAUDE.local.md, with public AGENT.md placeholders only
  • detect liveness, /api/health, provider-error signatures, stream-failure ratio, and log bloat within the existing health-check pattern

Scope

  • new workflows/9router-health/AGENT.md
  • new workflows/9router-health/agent_notes.md
  • new workflows/9router-health/logs/.gitkeep
  • CHANGELOG.md entry for the workflow

Host gating and PII discipline

  • applicability is controlled by host-local CLAUDE.local.md (installed, port, base_url, log_dir, restart_command, stderr_log, log_bloat_threshold_gb)
  • public workflow intentionally avoids hardcoded IPs, personal paths, hostnames, launchd labels, chat IDs, tokens, and raw logs
  • non-9router hosts should return HEARTBEAT_OK, log non-applicability, and send no notifications

Severity / signature table

  • P1: 9router-process-down, 9router-local-probe-failed, severe 9router-stream-failures
  • P2: 9router-anthropic-429-storm, 9router-credential-404, 9router-temperature-deprecate, moderate 9router-stream-failures
  • P3: 9router-log-bloat, 9router-stuck-or-idle-ambiguous

Performance / exec rules

  • target full configured-host run time: ≤ 5s
  • bounded tail-window log scans only
  • gtimeout for /api/health probes when available
  • no heredocs in workflow exec snippets, no oversized commands, mirroring bridge-health discipline

Dry-run evidence (Cora local, delivery suppressed)

  • listening: node:91124:100.123.215.95:20128
  • /api/health: 200
  • anthropic 429 hits (bounded stderr tail): 566 → normalized degraded / P2 as expected
  • credential 404 hits: 0
  • temperature-deprecate hits: 0
  • log dir: 1386 MB with 5 GB threshold
  • stream sample: 0 recent [STREAM] lines
  • synthetic temp-file injection matched all three regression signatures without mutating real logs or sending notifications
  • artifact path used during local validation: workflows/9router-health/logs/2026-05-20-173832-dryrun.md

Rollout plan

  1. merge to main after review
  2. install locally on Cora with delivery.mode: none
  3. require ≥ 3 consecutive green runs and one alert-path test
  4. soak 24h on Cora before any fleet rollout proposal
  5. only roll out to other hosts after Nick approval

Tracking

  • plan: ~/.openclaw-bosun/workspace/memory/plans/2026-05-20-9router-health-check.md
  • preflight: ~/.openclaw-bosun/workspace/memory/plans/2026-05-20-9router-preflight.md

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: be79ee9f44

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread CHANGELOG.md
Comment on lines +12 to +13
- `9router-health` workflow — Monitors host-gated 9router liveness, provider-error
signatures, stream failures, and log bloat with public-repo-safe local configuration.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Update README workflow table for new 9router workflow

Adding 9router-health to the changelog without updating the README creates documentation drift for users who rely on the workflow catalog; the repository guideline in AGENTS.md explicitly requires keeping README skill/workflow tables in sync when workflows are added or versioned. In the current tree, the README workflow table (README.md lines 145–155) still omits 9router-health, so discovery and version visibility are immediately inconsistent after this commit.

Useful? React with 👍 / 👎.

Comment thread workflows/9router-health/AGENT.md
Comment thread workflows/9router-health/AGENT.md
Comment thread workflows/9router-health/AGENT.md Outdated
…og retention, stale-config protection, first-run discovery
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit c1fb507. Configure here.

- `port: <local-port>`
- `base_url: http://<host-or-ip>:<port>`
- `log_dir: ~/path/to/9router/logs`
- `restart_command: <local-service-restart-command>`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Required restart_command suppresses alerts despite being unused

Medium Severity

restart_command is listed in the required schema, but the Remediation Posture explicitly disables automatic restarts and no signature's suggested action references this value. When any required key is missing, the first-run logic enters report-only mode with "send no alerts," meaning a host that simply omits an unused restart_command loses all alerting — including P1 process-down notifications. The sibling bridge-health workflow avoids this by keeping restart commands in a separate optional section rather than in the required config schema.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit c1fb507. Configure here.

@TechNickAI TechNickAI merged commit 745aea1 into main May 21, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant