Skip to content

Bug Report: Tool Executor Deadlock After Long-Running exec_shell in Durable Task #2571

@chinaqy110

Description

@chinaqy110

Bug Report: Tool Executor Deadlock After Long-Running exec_shell in Durable Task

Environment:

  • codewhale version: 0.8.49
  • Platform: Windows (PowerShell)
  • Model: deepseek-v4-flash
  • Mode: agent (durable task)

Summary:
After an exec_shell call with a long timeout (600s) is dispatched inside a
durable task, ALL subsequent tool calls (including read_file, list_dir, and
new exec_shell) hang indefinitely in "running" state. Even new threads/retries
issued hours later cannot recover — they also hang. This is NOT an API server
issue — the DeepSeek API inference stream (agent_reasoning deltas) continues
to work normally throughout.

Steps to Reproduce:

  1. Create a durable task (task_create) with mode: agent, running in a
    workspace that contains a node script.
  2. The agent calls exec_shell with a long-running command, e.g.:
    cd && node extract-pdf.js
    with timeout_ms: 600000 (10 minutes).
  3. The node process does not finish quickly (e.g. it hangs on PDF parsing or
    stdout buffer fills).
  4. Wait. The 10-minute timeout passes, but exec_shell never returns.
  5. The agent's next tool calls (e.g. read_file, list_dir) also never return.
  6. Even hours later, when the runtime retries (creating a new thread for the
    same task), the new read_file/list_dir calls also hang.

Expected Behavior:

  • exec_shell should return a timeout error after the specified timeout_ms
    elapses.
  • Subsequent tool calls should execute normally.
  • A retry/new thread should work cleanly.

Actual Behavior:

  • All tool calls from these tasks remain in "status": "running" with
    "ended_at": null indefinitely.
  • The tool executor appears to enter a global deadlock state that persists
    across threads.

Affected Resources (from my session):

  • Task task_d233593b — created 2026-06-01 18:36 UTC, tool calls still
    hanging 12+ hours later.
  • Task task_28d61c4a — created 2026-06-01 18:34 UTC, same behavior.

Timeline of a single affected task (task_d233593b):

timestamp (UTC) action
───────────────────────────────────────────────────────
18:36:08 Task queued
18:36:10 read_file("extract-pdf.js")
→ status: running, never ended
18:36:10 list_dir(".")
→ status: running, never ended
18:36:13 exec_shell("node extract-pdf.js",
timeout=600s)
→ status: running, never ended
02:11:14 [retry] read_file("extract-pdf.js")
→ status: running, never ended
02:11:14 list_dir(".")
→ status: running, never ended
02:13:04 read_file("package.json")
→ status: running, never ended

Key Evidence (why this is a codewhale tool executor bug, NOT an API issue):

  1. read_file hangs
    read_file is a purely local filesystem operation with no network
    dependency. If it hangs, the tool executor itself is broken — the API
    connection is irrelevant.

  2. Multiple tool types all hang simultaneously
    read_file, list_dir, and exec_shell all hang at the same time. This is a
    systemic executor issue, not a per-tool bug.

  3. API inference stream is healthy
    The agent_reasoning delta events streamed in successfully throughout the
    entire period (verified via task_read timeline). This proves the DeepSeek
    API connection is fully functional and the model is generating output
    normally. The hang is purely on the tool execution side.

  4. New threads cannot recover
    Hours later (02:11 UTC), the runtime created a new thread for the same
    task. Its very first read_file also hung immediately. The deadlock state
    is global/persistent across threads — not a per-thread issue.

  5. task_list confirms both tasks still stuck
    As of the time of this report, both task_d233593b and task_28d61c4a remain
    in "running" status with duration_ms: null (no end time recorded).

Root Cause Hypothesis:

The tool executor's internal state machine appears to block on the outstanding
exec_shell future. When the timeout fires but the exec_shell handler fails to
properly abort/cancel the underlying child process (e.g. the node process
running extract-pdf.js), the executor enters a deadlock: its dispatch loop
waits forever for that one slot to free.

Since the executor likely uses a shared channel/queue for all tool calls,
every subsequent submission queues up behind the stuck call and never
executes. This also explains why new threads are affected — they share the
same executor instance.

Suggested Fix Areas:

  • exec_shell timeout should force-kill the child process and return a
    ToolTimeout error instead of blocking indefinitely.
  • The tool executor should have a watchdog or independent per-call timeout
    that does not rely on the tool handler's cooperation to be triggered.
  • The executor state should be isolated per-thread so one stuck thread does
    not poison others.
  • Consider adding a hard upper bound on exec_shell execution time regardless
    of the user-provided timeout_ms.

Additional Data Available on Request:

  • Full task_read output for task_d233593b (3873 lines of timeline and events)
  • Full task_read output for task_28d61c4a

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    Status
    In progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions