Skip to content

daemon leaks pty masters — long-running daemon exhausts macOS pty pool ("out of pty devices") #3

@nc9

Description

@nc9

Summary

A long-running daemon accumulates pty master fds it never closes. On macOS the default pool is kern.tty.ptmx_max = 511; once the daemon's leaked masters (plus other apps) hit that cap, every taskmux start <task> fails with:

Error: out of pty devices

and no stopped task can be restarted anywhere — across all registered projects.

Evidence (v0.9.10, macOS 24.6.0 / Darwin)

Daemon up for a few hours, 18 projects registered, ~20 tasks actually running:

$ ls /dev/ttys* | wc -l
527
$ sysctl kern.tty.ptmx_max
kern.tty.ptmx_max: 511
$ lsof /dev/ptmx | awk '{print $1, $2}' | sort | uniq -c | sort -rn | head
 498 python3.1 76376     <- taskmux daemon
  59 iTerm2 27729
  10 iTermServ 27773

498 pty masters held for ~20 live tasks. Immediately after sudo taskmux daemon restart, the fresh daemon (same workload, all auto_start tasks back up) holds 23 — so ~475 of those fds were leaked, not in use:

  23 python3.1 90453     <- daemon after restart

Likely cause

Master fds aren't closed when a task exits/stops/restarts. Workloads with churny tasks (auto-restart loops, crashing dev servers, periodic restart) leak one master per task start, so uptime × churn eventually exhausts the pool. The supervisor's own restart cycles presumably leak too, which would explain reaching ~475 within hours.

Impact / workaround

  • Hard failure of taskmux start/restart for all projects once the pool is exhausted.
  • Recovery requires sudo taskmux daemon restart (root needed for :443/:80 + resolver), bouncing every registered project's tasks.

Related observation (cosmetic but bit us)

While a task's upstream is down, the HTTPS proxy answers requests with its own plain-text body:

taskmux: no upstream for pagecog.localhost hint: run `taskmux start <task>` for the host '@' in project 'pagecog'.

Any app code that surfaces fetch-error/response text verbatim ends up showing that internal hint string in its UI (we shipped exactly that into a wizard error banner before sanitizing). Suggest serving it as a proper 502 with content-type: text/html error page (and maybe an x-taskmux: 1 header) so app-level error handling can distinguish proxy chrome from upstream responses.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions