Summary
A long-running daemon accumulates pty master fds it never closes. On macOS the default pool is kern.tty.ptmx_max = 511; once the daemon's leaked masters (plus other apps) hit that cap, every taskmux start <task> fails with:
Error: out of pty devices
and no stopped task can be restarted anywhere — across all registered projects.
Evidence (v0.9.10, macOS 24.6.0 / Darwin)
Daemon up for a few hours, 18 projects registered, ~20 tasks actually running:
$ ls /dev/ttys* | wc -l
527
$ sysctl kern.tty.ptmx_max
kern.tty.ptmx_max: 511
$ lsof /dev/ptmx | awk '{print $1, $2}' | sort | uniq -c | sort -rn | head
498 python3.1 76376 <- taskmux daemon
59 iTerm2 27729
10 iTermServ 27773
498 pty masters held for ~20 live tasks. Immediately after sudo taskmux daemon restart, the fresh daemon (same workload, all auto_start tasks back up) holds 23 — so ~475 of those fds were leaked, not in use:
23 python3.1 90453 <- daemon after restart
Likely cause
Master fds aren't closed when a task exits/stops/restarts. Workloads with churny tasks (auto-restart loops, crashing dev servers, periodic restart) leak one master per task start, so uptime × churn eventually exhausts the pool. The supervisor's own restart cycles presumably leak too, which would explain reaching ~475 within hours.
Impact / workaround
- Hard failure of
taskmux start/restart for all projects once the pool is exhausted.
- Recovery requires
sudo taskmux daemon restart (root needed for :443/:80 + resolver), bouncing every registered project's tasks.
Related observation (cosmetic but bit us)
While a task's upstream is down, the HTTPS proxy answers requests with its own plain-text body:
taskmux: no upstream for pagecog.localhost hint: run `taskmux start <task>` for the host '@' in project 'pagecog'.
Any app code that surfaces fetch-error/response text verbatim ends up showing that internal hint string in its UI (we shipped exactly that into a wizard error banner before sanitizing). Suggest serving it as a proper 502 with content-type: text/html error page (and maybe an x-taskmux: 1 header) so app-level error handling can distinguish proxy chrome from upstream responses.
Summary
A long-running daemon accumulates pty master fds it never closes. On macOS the default pool is
kern.tty.ptmx_max = 511; once the daemon's leaked masters (plus other apps) hit that cap, everytaskmux start <task>fails with:and no stopped task can be restarted anywhere — across all registered projects.
Evidence (v0.9.10, macOS 24.6.0 / Darwin)
Daemon up for a few hours, 18 projects registered, ~20 tasks actually running:
498 pty masters held for ~20 live tasks. Immediately after
sudo taskmux daemon restart, the fresh daemon (same workload, all auto_start tasks back up) holds 23 — so ~475 of those fds were leaked, not in use:Likely cause
Master fds aren't closed when a task exits/stops/restarts. Workloads with churny tasks (auto-restart loops, crashing dev servers, periodic
restart) leak one master per task start, so uptime × churn eventually exhausts the pool. The supervisor's own restart cycles presumably leak too, which would explain reaching ~475 within hours.Impact / workaround
taskmux start/restartfor all projects once the pool is exhausted.sudo taskmux daemon restart(root needed for :443/:80 + resolver), bouncing every registered project's tasks.Related observation (cosmetic but bit us)
While a task's upstream is down, the HTTPS proxy answers requests with its own plain-text body:
Any app code that surfaces fetch-error/response text verbatim ends up showing that internal hint string in its UI (we shipped exactly that into a wizard error banner before sanitizing). Suggest serving it as a proper
502withcontent-type: text/htmlerror page (and maybe anx-taskmux: 1header) so app-level error handling can distinguish proxy chrome from upstream responses.