Skip to content

Title: Agent self-hang via foreground until-loop blocks dispatch indefinitely (recurring) #807

@RayKuo-Mantis

Description

@RayKuo-Mantis

Description

Summary

On 2026-05-13 my self-hosted OpenAB instance (claude-agent-acp) hung three times in one day because the Claude agent self-spawned until ... grep -q "X" log ; do sleep 30; done Bash tools where the grep keyword never matched. Each hang blocked the entire Discord dispatch queue for 1-3 hours until I manually docker exec kill the stuck PID inside the container.

This is orthogonal to PR #791 (gateway reconnect) — the Discord connection is healthy throughout; the agent itself is stuck in a Bash subprocess that never returns to the ACP layer.

Repro

# In any active Discord thread with claude-agent-acp:
# Ask Claude to run any long-poll without timeout:
"執行: until false; do sleep 30; done"
# OpenAB will hang forever, queue subsequent messages with wait_ms growing unboundedly

Observed today

Time Until-loop target keyword Why it never matched
14:02 "summary table" Actual log string was "Multi-turn chain summary"
15:19 ===== done|watchdog Sim was watchdog-killed mid-run, done marker never written
15:42 ===== done|fail (chained 2x) Same as above

Pattern: agent writes brittle keyword-based wait without max-iteration timeout. When keyword never appears (typo / runner kill / log format drift), Bash subprocess sleeps forever, ACP dispatch tool blocks, subsequent Discord messages queue silently.

Proposed feature: agent_dispatch_timeout_ms

Add config (config.toml [agent] or env var):

[agent]
dispatch_timeout_ms = 1800000  # 30 min default, 0 = unlimited (current behavior)

When a single dispatch exceeds the timeout:

  1. Kill the agent process / send SIGTERM to claude-agent-acp
  2. Post Discord reply: ⚠️ Previous turn timed out after ${minutes}m. Session reset, please retry.
  3. Optionally: reset session pool so next message starts fresh
  4. Log WARN openab::dispatch: agent timeout with thread_key + elapsed + last seen activity

Lower-priority companion features

  • When wait_ms of pending message > threshold (e.g. 5 min), send a passive 🕒 reaction or 1-line "still on previous turn, X messages queued" so user knows it's not silently dead
  • Per-thread queue depth in metrics

Severity

Lower than PR #791 (network drops more frequent than agent self-hangs in normal use), but higher than expected: I hit this 3× in 6 hours today, all without my agent doing anything malicious — just routine until ... grep ... patterns that broke when log format / sim outcome diverged from agent's assumption.

Saved logs / repro available if useful.

Steps to Reproduce

# In any active Discord thread with claude-agent-acp:
# Ask Claude to run any long-poll without timeout:
"執行: until false; do sleep 30; done"
# OpenAB will hang forever, queue subsequent messages with wait_ms growing unboundedly

Expected Behavior

normal conversation.

Environment

No response

Screenshots / Logs

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions