acw is a tmux-side automation layer for Codex sessions.
The current runtime model has seven code-level pieces:
bin/auto_continue_watchd.py- the CLI and command/control plane
- wires the probe, state, daemon, repair, and UI modules together
bin/acw_probe.py- strict tmux/process/Codex thread discovery
- pane snippet extraction and thread provenance helpers
bin/acw_state.py- the SQLite-backed manager state store
- desired session rows plus observed runtime rows
bin/acw_daemon.py- daemon-owned per-session runtime state and shared event-loop helpers
bin/acw_rpc.py- Unix-socket request/response transport between the CLI and the daemon
bin/acw_ui.py- status-table and doctor rendering helpers
bin/auto_continue_logwatch.py- shared low-level Codex-log, pane, and tmux-send helpers
- still usable as a direct single-session CLI for debugging/tests
- no longer discovered or spawned by the manager at runtime
The key simplification is that acw no longer launches or discovers one OS
watcher process per managed pane. The daemon is the only long-running
manager-owned process for one tmux socket, and each managed session is a
state machine inside it.
acw is built around a strict 1:1:1 relationship:
- one live tmux pane
- one live Codex thread id
- one managed session record
- one live managed runtime row
Window or agent name is metadata attached to that mapping, not the primary identity.
If any of those entities are:
- missing
- duplicated
- mismatched
that is treated as broken state. acw reports the invariant error instead of
guessing.
All manager state is scoped to one tmux server.
The scope key is:
AUTO_CONTINUE_TMUX_SOCKETwhen explicitly set- otherwise the live
$TMUXsocket path, except the default tmux socket is normalized to the empty default scope
This lets:
- the default tmux server behave consistently whether
acwis invoked inside or outside tmux - private harness sockets keep fully isolated daemon/runtime state
Durable desired state lives in:
~/.codex/acw_state.sqlite
The agents table holds manager-owned intent for one managed session:
thread_idpanenamemessagetmux_socketpausedgeneration
They also carry durable health metadata written by the worker:
healthhealth_detailhealth_tslast_continue_at
The session row is the desired state for the daemon.
Observed managed-worker state lives in:
~/.codex/acw_state.sqlite
The agent_runtimes table is written by the daemon and describes the
currently attached live session runtime for one managed session:
- daemon-assigned runtime id (
id=in the UI; it identifies one live managed session runtime and is not an OS pid) - daemon pid
- runtime status
- paused flag
- generation
- watch log path
- heartbeat timestamp
The runtime row is the daemon-owned observed state. Pane and thread identity
stay in agents, so SQLite enforces the 1:1 mapping between one managed agent,
one pane, one thread, and one live runtime row.
The daemon_scopes table tracks one shared daemon heartbeat per tmux scope:
- scope key (
tmux_socket) - daemon pid
- control socket path
- heartbeat timestamp
- daemon start timestamp
That lets status and doctor report "shared daemon is down" as one scope
failure instead of inventing many pane-local runtime failures.
The manager still probes tmux and live processes directly for correctness:
tmux list-panes -atmux display-message- tmux pane capture
- one
psprocess snapshot for pane process trees
This is how acw proves:
- pane existence
- live window names
- pane shell pid
- whether a live Codex process exists in the pane
Thread identity is proven only from strong evidence:
- pid -> thread mappings from Codex
state_*.sqlite - explicit
codex resume <thread-id>argv - thread ids embedded in Codex-owned open files
acw never guesses a thread id from cwd, timestamps, or other weak signals.
The daemon reads:
~/.codex/log/codex-tui.log
It uses that log for:
- completion events
- interrupt events
- startup replay of one pending completion
The older design used one auto_continue_logwatch.py process per pane. That
caused a lot of avoidable complexity:
- one watcher pid per pane
- pid files as primary runtime identity
- repeated
psscans just to discover watcher state - pause/resume implemented as OS stop/continue
- duplicated global
codex-tui.logtailers
The shared daemon simplifies the control plane:
- one daemon process per tmux socket
- one reconcile loop
- one desired-state model
- one runtime-state model
- per-session state machines instead of manager-owned per-pane processes
- one shared
codex-tui.logtail feeding all managed sessions - no manager-side legacy process migration path
The hidden _daemon subcommand is started on demand by the CLI.
At a high level the loop does this:
- Drain new Codex events from one shared
codex-tui.logtail. - Load all session rows for the current tmux socket scope.
- Reconcile them against the in-memory per-session runtime map.
- Create missing runtimes.
- Drop runtimes whose session was deleted.
- Reinitialize runtimes whose
(pane, message, generation)signature changed. - Route the drained events to the matching session runtime by
thread_id. - Refresh session health and upsert one runtime row per live managed session.
- Exit automatically after a short idle period when the scope has no sessions.
That means normal commands are cheap:
- they ensure the daemon exists
- they send one local Unix-socket RPC to the daemon
- the daemon persists desired state and waits for runtime convergence
- progress lines stream back over the same socket
They do not spawn or discover per-pane manager processes anymore.
Each tmux scope now has one daemon control socket:
$XDG_RUNTIME_DIR/acw-<uid>/acw_daemon.<scope>.sock- falls back to
/tmp/acw-<uid>/acw_daemon.<scope>.sockwhenXDG_RUNTIME_DIRis not set - kept out of
~/.codex/so long private test-home paths do not exceed the AF_UNIX socket-path limit
- falls back to
The CLI uses it for mutating commands like:
-
start -
stop -
pause -
resume -
restart -
editThe transport is intentionally small: -
local Unix domain socket only
-
one JSON request per connection
-
streamed
progressmessages -
one final
resultorerror
SQLite remains the durable source of truth, but the socket is the fast control
path. That removes the old manager-side "write state, then poll the database"
behavior from normal mutations and gives restart --restart-codex and similar
flows a place to surface step-by-step daemon progress.
Each managed session runtime inside the daemon keeps:
- pane
- thread id
- continue message
- generation
- paused flag
- last handled turn
- one pending startup completion, if any
- recent health state
- watch-log path
The daemon uses shared helper code from bin/auto_continue_logwatch.py for:
- Codex completion parsing
- interrupt parsing
- pane-visible error detection
- pending-startup completion detection
- health evaluation
- tmux
send-keys
The important daemon-specific behaviors are:
pauseis a logical session flag, notSIGSTOPresumestill restarts the generation instead of thawing an old reader, so stale log backlog is not replayed- one shared log tail means the daemon never duplicates global log reads across sessions
This preserves the correctness fix from the old resume rewrite without keeping per-pane OS processes around.
start now means:
- Resolve the target pane.
- Prove the live Codex thread.
- Resolve or edit the continue message.
- Reject conflicting pane/thread/session state.
- Upsert the agent row for that pane/thread.
- Ensure the shared daemon is running.
- Wait for the matching runtime row.
stop is destructive manager intent:
- remove the saved session row
- let the daemon tear down the runtime row
- forget the session from
acw status
pause sets paused=true in the saved session row and waits for the daemon
runtime row to reflect that state.
resume does not unfreeze an old reader in place.
Instead it restarts the managed session generation from the current pane/thread state so backlog is not replayed.
restart is implemented as:
- resolve pane and prove thread
- load the saved message unless
--messageoverrides it - optionally relaunch Codex with
codex resume <thread-id> - bump the session
generation - ensure the daemon is running
- wait for the runtime row with the new generation
With no target, restart iterates all live managed panes for the current
scope.
The Codex relaunch path is still explicit and strict:
- prove the current thread
- terminate the live Codex subprocess tree in the pane
- wait for a real shell prompt
- run
codex resume <thread-id> - handle the directory-trust prompt if it appears
- request the new session generation
edit is just:
- prove pane and thread
- seed the editor from the saved session message
- persist the new message
- restart the session generation if the pane is managed
status builds a fresh runtime model from:
- live panes
- live managed runtime rows
- saved session rows
It does not collapse ambiguity away. If there are duplicate or missing entities, the table shows them as invariant errors.
doctor is the strict proof path for one target.
It reports:
- tmux reachability
- auth/state prerequisites
- target resolution
- live thread identity
- thread provenance
- runtime/session agreement
- a concrete next command when possible
repair is built on the same invariant model as status and doctor.
It only auto-applies safe changes, such as:
- recreating a missing saved session for a proven live pane
- restarting one mismatched managed session generation
- syncing the saved session name to the live tmux window name
It does not guess how to resolve genuinely ambiguous multi-pane or multi-thread corruption.
cleanup is intentionally narrower than repair.
It only removes stale artifacts, such as:
- dead runtime rows from a vanished daemon
- dead session rows whose thread has neither a live pane nor a live worker
The command-layer runtime model lives in bin/auto_continue_watchd.py, while
the probing, persistence, daemon-session, and UI helpers now live in
bin/acw_probe.py, bin/acw_state.py, bin/acw_daemon.py, and
bin/acw_ui.py.
The main record types are:
LivePaneRecordLiveWatcherRecordSavedSessionRecordRuntimeSnapshotRuntimeModel
These are indexed separately by:
- pane id
- thread id
- session name
The critical design rule is:
never merge multiplicity away.
If two workers claim one thread, or two panes claim one thread, the model keeps both and reports the invariant violation.
Thread discovery is strict by design.
For one live pane:
tmux list-panesprovides the pane shell pid.- One
pssnapshot walks the process tree below that shell. - Any live
codexprocess is mapped to a thread id through:- Codex SQLite pid rows first
- then explicit
resume <thread-id>argv - then Codex-owned open files
If that still does not prove a thread, acw returns an error.
The tmux window-renamed hook is still installed by the manager.
When a window name changes, acw syncs the saved session name field for the
thread attached to panes in that window.
Because pane/thread identity is rebuilt from live state on every command, a
rename is naturally visible to later status, doctor, and edit calls even
without trusting the hook alone.
All manager-owned state lives under ~/.codex/.
Important files:
acw_state.sqliteagentstable: desired state for one managed sessionagent_runtimestable: observed daemon runtime state for one live managed sessiondaemon_scopestable: daemon heartbeat/control metadata for one tmux scope
acw_daemon.<scope>.pid- daemon pid for one tmux socket scope
acw_daemon.<scope>.log- daemon stdout/stderr log
auto_continue_logwatch.<pane>.log- per-session event log written by the daemon
The daemon rewrite improves command latency in two ways:
- no manager-side discovery of one OS watcher process per pane
- one shared
codex-tui.logreader instead of one per managed session - no process spawning for normal start/edit/restart flows
On this checkout after the rewrite:
statusis about0.57sedit %0is about0.23s
After removing the transitional legacy layer and collapsing the daemon to one
shared log tail plus per-session state objects, a direct
_build_runtime_snapshot() microbenchmark on this checkout is about 85ms
best-of-3 for 5 loops.
The remaining cost is mostly:
- live tmux metadata
- one process-tree snapshot for thread proof
- rendering
The rewrite is covered at three levels:
- unit tests for manager/runtime logic in
test/test_watchd_unit.py - unit tests for worker logic in
test/test_logwatch_unit.py - real Codex integration tests in
test/test_real_codex_integration.py
The important real-manager cases are:
startagainst a live Codex pane- multi-target
resume restart --restart-codex- dead-worker recovery
doctorand status behavior on live panes
The current architecture is:
- strict about identity
- explicit about ambiguity
- centered on thread-keyed desired state
- backed by one daemon per tmux socket
- still external to Codex and tmux internals
That keeps the operator model simple:
startmeans "manage this pane/thread"pausemeans "stop continuing this managed session"resumeandrestartmean "request a fresh session generation"stopmeans "forget this managed session"repairmeans "fix live invariant problems"cleanupmeans "delete stale artifacts"