Bug Report: Tool Executor Deadlock After Long-Running exec_shell in Durable Task
Environment:
- codewhale version: 0.8.49
- Platform: Windows (PowerShell)
- Model: deepseek-v4-flash
- Mode: agent (durable task)
Summary:
After an exec_shell call with a long timeout (600s) is dispatched inside a
durable task, ALL subsequent tool calls (including read_file, list_dir, and
new exec_shell) hang indefinitely in "running" state. Even new threads/retries
issued hours later cannot recover — they also hang. This is NOT an API server
issue — the DeepSeek API inference stream (agent_reasoning deltas) continues
to work normally throughout.
Steps to Reproduce:
- Create a durable task (task_create) with mode: agent, running in a
workspace that contains a node script.
- The agent calls exec_shell with a long-running command, e.g.:
cd && node extract-pdf.js
with timeout_ms: 600000 (10 minutes).
- The node process does not finish quickly (e.g. it hangs on PDF parsing or
stdout buffer fills).
- Wait. The 10-minute timeout passes, but exec_shell never returns.
- The agent's next tool calls (e.g. read_file, list_dir) also never return.
- Even hours later, when the runtime retries (creating a new thread for the
same task), the new read_file/list_dir calls also hang.
Expected Behavior:
- exec_shell should return a timeout error after the specified timeout_ms
elapses.
- Subsequent tool calls should execute normally.
- A retry/new thread should work cleanly.
Actual Behavior:
- All tool calls from these tasks remain in "status": "running" with
"ended_at": null indefinitely.
- The tool executor appears to enter a global deadlock state that persists
across threads.
Affected Resources (from my session):
- Task task_d233593b — created 2026-06-01 18:36 UTC, tool calls still
hanging 12+ hours later.
- Task task_28d61c4a — created 2026-06-01 18:34 UTC, same behavior.
Timeline of a single affected task (task_d233593b):
timestamp (UTC) action
───────────────────────────────────────────────────────
18:36:08 Task queued
18:36:10 read_file("extract-pdf.js")
→ status: running, never ended
18:36:10 list_dir(".")
→ status: running, never ended
18:36:13 exec_shell("node extract-pdf.js",
timeout=600s)
→ status: running, never ended
02:11:14 [retry] read_file("extract-pdf.js")
→ status: running, never ended
02:11:14 list_dir(".")
→ status: running, never ended
02:13:04 read_file("package.json")
→ status: running, never ended
Key Evidence (why this is a codewhale tool executor bug, NOT an API issue):
-
read_file hangs
read_file is a purely local filesystem operation with no network
dependency. If it hangs, the tool executor itself is broken — the API
connection is irrelevant.
-
Multiple tool types all hang simultaneously
read_file, list_dir, and exec_shell all hang at the same time. This is a
systemic executor issue, not a per-tool bug.
-
API inference stream is healthy
The agent_reasoning delta events streamed in successfully throughout the
entire period (verified via task_read timeline). This proves the DeepSeek
API connection is fully functional and the model is generating output
normally. The hang is purely on the tool execution side.
-
New threads cannot recover
Hours later (02:11 UTC), the runtime created a new thread for the same
task. Its very first read_file also hung immediately. The deadlock state
is global/persistent across threads — not a per-thread issue.
-
task_list confirms both tasks still stuck
As of the time of this report, both task_d233593b and task_28d61c4a remain
in "running" status with duration_ms: null (no end time recorded).
Root Cause Hypothesis:
The tool executor's internal state machine appears to block on the outstanding
exec_shell future. When the timeout fires but the exec_shell handler fails to
properly abort/cancel the underlying child process (e.g. the node process
running extract-pdf.js), the executor enters a deadlock: its dispatch loop
waits forever for that one slot to free.
Since the executor likely uses a shared channel/queue for all tool calls,
every subsequent submission queues up behind the stuck call and never
executes. This also explains why new threads are affected — they share the
same executor instance.
Suggested Fix Areas:
- exec_shell timeout should force-kill the child process and return a
ToolTimeout error instead of blocking indefinitely.
- The tool executor should have a watchdog or independent per-call timeout
that does not rely on the tool handler's cooperation to be triggered.
- The executor state should be isolated per-thread so one stuck thread does
not poison others.
- Consider adding a hard upper bound on exec_shell execution time regardless
of the user-provided timeout_ms.
Additional Data Available on Request:
- Full task_read output for task_d233593b (3873 lines of timeline and events)
- Full task_read output for task_28d61c4a
Bug Report: Tool Executor Deadlock After Long-Running exec_shell in Durable Task
Environment:
Summary:
After an exec_shell call with a long timeout (600s) is dispatched inside a
durable task, ALL subsequent tool calls (including read_file, list_dir, and
new exec_shell) hang indefinitely in "running" state. Even new threads/retries
issued hours later cannot recover — they also hang. This is NOT an API server
issue — the DeepSeek API inference stream (agent_reasoning deltas) continues
to work normally throughout.
Steps to Reproduce:
workspace that contains a node script.
cd && node extract-pdf.js
with timeout_ms: 600000 (10 minutes).
stdout buffer fills).
same task), the new read_file/list_dir calls also hang.
Expected Behavior:
elapses.
Actual Behavior:
"ended_at": null indefinitely.
across threads.
Affected Resources (from my session):
hanging 12+ hours later.
Timeline of a single affected task (task_d233593b):
timestamp (UTC) action
───────────────────────────────────────────────────────
18:36:08 Task queued
18:36:10 read_file("extract-pdf.js")
→ status: running, never ended
18:36:10 list_dir(".")
→ status: running, never ended
18:36:13 exec_shell("node extract-pdf.js",
timeout=600s)
→ status: running, never ended
02:11:14 [retry] read_file("extract-pdf.js")
→ status: running, never ended
02:11:14 list_dir(".")
→ status: running, never ended
02:13:04 read_file("package.json")
→ status: running, never ended
Key Evidence (why this is a codewhale tool executor bug, NOT an API issue):
read_file hangs
read_file is a purely local filesystem operation with no network
dependency. If it hangs, the tool executor itself is broken — the API
connection is irrelevant.
Multiple tool types all hang simultaneously
read_file, list_dir, and exec_shell all hang at the same time. This is a
systemic executor issue, not a per-tool bug.
API inference stream is healthy
The agent_reasoning delta events streamed in successfully throughout the
entire period (verified via task_read timeline). This proves the DeepSeek
API connection is fully functional and the model is generating output
normally. The hang is purely on the tool execution side.
New threads cannot recover
Hours later (02:11 UTC), the runtime created a new thread for the same
task. Its very first read_file also hung immediately. The deadlock state
is global/persistent across threads — not a per-thread issue.
task_list confirms both tasks still stuck
As of the time of this report, both task_d233593b and task_28d61c4a remain
in "running" status with duration_ms: null (no end time recorded).
Root Cause Hypothesis:
The tool executor's internal state machine appears to block on the outstanding
exec_shell future. When the timeout fires but the exec_shell handler fails to
properly abort/cancel the underlying child process (e.g. the node process
running extract-pdf.js), the executor enters a deadlock: its dispatch loop
waits forever for that one slot to free.
Since the executor likely uses a shared channel/queue for all tool calls,
every subsequent submission queues up behind the stuck call and never
executes. This also explains why new threads are affected — they share the
same executor instance.
Suggested Fix Areas:
ToolTimeout error instead of blocking indefinitely.
that does not rely on the tool handler's cooperation to be triggered.
not poison others.
of the user-provided timeout_ms.
Additional Data Available on Request: