Description
Summary
On 2026-05-13 my self-hosted OpenAB instance (claude-agent-acp) hung three times in one day because the Claude agent self-spawned until ... grep -q "X" log ; do sleep 30; done Bash tools where the grep keyword never matched. Each hang blocked the entire Discord dispatch queue for 1-3 hours until I manually docker exec kill the stuck PID inside the container.
This is orthogonal to PR #791 (gateway reconnect) — the Discord connection is healthy throughout; the agent itself is stuck in a Bash subprocess that never returns to the ACP layer.
Repro
# In any active Discord thread with claude-agent-acp:
# Ask Claude to run any long-poll without timeout:
"執行: until false; do sleep 30; done"
# OpenAB will hang forever, queue subsequent messages with wait_ms growing unboundedly
Observed today
| Time |
Until-loop target keyword |
Why it never matched |
| 14:02 |
"summary table" |
Actual log string was "Multi-turn chain summary" |
| 15:19 |
===== done|watchdog |
Sim was watchdog-killed mid-run, done marker never written |
| 15:42 |
===== done|fail (chained 2x) |
Same as above |
Pattern: agent writes brittle keyword-based wait without max-iteration timeout. When keyword never appears (typo / runner kill / log format drift), Bash subprocess sleeps forever, ACP dispatch tool blocks, subsequent Discord messages queue silently.
Proposed feature: agent_dispatch_timeout_ms
Add config (config.toml [agent] or env var):
[agent]
dispatch_timeout_ms = 1800000 # 30 min default, 0 = unlimited (current behavior)
When a single dispatch exceeds the timeout:
- Kill the agent process / send SIGTERM to claude-agent-acp
- Post Discord reply:
⚠️ Previous turn timed out after ${minutes}m. Session reset, please retry.
- Optionally: reset session pool so next message starts fresh
- Log
WARN openab::dispatch: agent timeout with thread_key + elapsed + last seen activity
Lower-priority companion features
- When
wait_ms of pending message > threshold (e.g. 5 min), send a passive 🕒 reaction or 1-line "still on previous turn, X messages queued" so user knows it's not silently dead
- Per-thread queue depth in metrics
Severity
Lower than PR #791 (network drops more frequent than agent self-hangs in normal use), but higher than expected: I hit this 3× in 6 hours today, all without my agent doing anything malicious — just routine until ... grep ... patterns that broke when log format / sim outcome diverged from agent's assumption.
Saved logs / repro available if useful.
Steps to Reproduce
# In any active Discord thread with claude-agent-acp:
# Ask Claude to run any long-poll without timeout:
"執行: until false; do sleep 30; done"
# OpenAB will hang forever, queue subsequent messages with wait_ms growing unboundedly
Expected Behavior
normal conversation.
Environment
No response
Screenshots / Logs
No response
Description
Summary
On 2026-05-13 my self-hosted OpenAB instance (claude-agent-acp) hung three times in one day because the Claude agent self-spawned
until ... grep -q "X" log ; do sleep 30; doneBash tools where the grep keyword never matched. Each hang blocked the entire Discord dispatch queue for 1-3 hours until I manuallydocker exec killthe stuck PID inside the container.This is orthogonal to PR #791 (gateway reconnect) — the Discord connection is healthy throughout; the agent itself is stuck in a Bash subprocess that never returns to the ACP layer.
Repro
Observed today
"summary table""Multi-turn chain summary"===== done|watchdog===== done|fail(chained 2x)Pattern: agent writes brittle keyword-based wait without max-iteration timeout. When keyword never appears (typo / runner kill / log format drift), Bash subprocess sleeps forever, ACP dispatch tool blocks, subsequent Discord messages queue silently.
Proposed feature:
agent_dispatch_timeout_msAdd config (
config.toml [agent]or env var):When a single dispatch exceeds the timeout:
⚠️ Previous turn timed out after ${minutes}m. Session reset, please retry.WARN openab::dispatch: agent timeoutwith thread_key + elapsed + last seen activityLower-priority companion features
wait_msof pending message > threshold (e.g. 5 min), send a passive 🕒 reaction or 1-line "still on previous turn, X messages queued" so user knows it's not silently deadSeverity
Lower than PR #791 (network drops more frequent than agent self-hangs in normal use), but higher than expected: I hit this 3× in 6 hours today, all without my agent doing anything malicious — just routine
until ... grep ...patterns that broke when log format / sim outcome diverged from agent's assumption.Saved logs / repro available if useful.
Steps to Reproduce
Expected Behavior
normal conversation.
Environment
No response
Screenshots / Logs
No response