Conversation
The transport could ask libssh to read or write when the socket or channel was not actually ready. We synthesized write progress, trusted stale readable notifications, and finalized channels before the peer had closed them. Under load this produced truncated close/flush runs, fatal "Resource temporarily unavailable" socket errors, and occasional wedges. This change makes channel payload writes wait for libssh's own channel_write_wontblock callback, keeps EOF and CLOSE behind session flush completion, and filters POLLIN through a fresh readiness check before handing it back to libssh. It also restores accepted and post-connect session sockets to nonblocking mode after libssh switches them back to blocking. These changes align the binding with libssh's external event-loop contract. Read and write callbacks now only run when the transport can actually make progress, and channel teardown waits until both local flush and remote close have completed.
A rare close/flush stress failure still remained after the main transport hardening. Under four concurrent workers the client could occasionally truncate the payload around 487000 bytes, which matched the earlier write-side failure signature. The remaining hole was that we still trusted libuv's UV_WRITABLE notification directly. libssh treats a write-side EAGAIN as a fatal socket error once we tell it the fd will not block, so a stale or already-consumed writable event could still tear the session down. This change confirms POLLOUT with a fresh poll(0) check before advertising writability to libssh on both client and server session poll callbacks. That keeps the binding symmetric with the existing read-side hardening and only hands libssh readiness that the kernel still agrees with.
The close/flush stress gate still had a brittle failure mode after the transport fixes. Each iteration spins up a server, completes an SSH handshake and auth exchange, issues a subsystem request, transfers a 1 MiB payload, and waits for channel and session teardown. Under higher worker counts that full wall-clock path can exceed the old one-second budget without the transport being stuck. This change increases the test timeout to two seconds. That keeps the budget tight enough to catch real stalls while removing false stress failures caused by scheduler and handshake tail latency rather than transport corruption.
The close and flush stress gate is intended to catch truncated transfers and teardown races, not to enforce a tight wall-clock budget. After the transport fixes it could still exceed the shorter timeout under the default stress worker count because each iteration pays for server startup, SSH handshake, auth, a subsystem request, a 1 MiB transfer, and full channel/session shutdown. This change raises the timeout from two to five seconds. That keeps the test focused on correctness while still failing genuinely stuck runs, and it removes false stress failures caused by normal concurrency tail latency.
A normal Client.close() or ServerSession.close() could tear down every child channel immediately, even when channels still had queued data or libssh still had buffered session writes. Under load that dropped in-flight payloads and turned orderly shutdown into spurious session errors. This change separates graceful parent shutdown from forced error cleanup. Normal client and session close now mark child channels for EOF and CLOSE, keep polling until channel queues and libssh pending writes drain, and only finalize the parent transport once the child channels are gone. Error paths still escalate to immediate teardown. This is correct because parent shutdown now follows the same write-drain contract as individual channels instead of bypassing it at a higher level.
It was too easy to use channel close as if it were a half-close. The Acton API only said "Close the channel", while libssh close semantics explicitly discard unread inbound data. That made request-response flows look valid even though they were racing their own replies. This change documents the distinction directly on Channel and ServerChannel: send_eof keeps the read side open, while close is a full close that may drop unread inbound bytes. It also adds concurrency coverage around shared-session traffic so the safe pattern stays exercised under load. This is correct because the binding now states the contract that libssh already implements, and the new stress cases follow the same half-close pattern production request-response traffic needs.
A server-side failure was meant to force child sessions closed, but that intent was lost before it reached the session layer. server_fail set the server state to ERROR, then server_close_internal immediately overwrote it to CLOSING and checked the new state when closing child sessions. In practice that quietly degraded server error paths into graceful session drains. This change captures whether the server is already in an error close before switching the server state to CLOSING, then passes that latched decision into each child session close. This is correct because graceful server shutdown still drains child sessions, while real server failures now preserve the immediate error-close behavior they were already trying to request.
Server-side channel teardown had two holes in exceptional paths. Errored server channels were marked SCHAN_STATE_ERROR and then removed from the session list without running the finalizer, which left the actor-side channel id and EOF bookkeeping behind. At the same time, client and server channel callback structs were allocated by the binding, removed from libssh on teardown, and then leaked because ownership never returned to the binding. This change finalizes errored server channels before unlinking them from the session, frees channel callback structs after removing them from libssh, and only marks callbacks as installed after ssh_add_channel_callbacks succeeds. This is correct because every channel teardown path now converges on the same cleanup logic, and callback lifetime now matches libssh's callback registration API instead of relying on list removal to free caller-owned memory.
Sending a server-side exit status already reported a close when it failed, but it left the channel otherwise live. That meant callers could receive an error close callback while the channel remained in the session list and kept participating in later drive cycles. This change marks the channel as failed when ssh_channel_request_send_exit_status returns an error, so the normal server-channel error teardown path runs on the next session drive. This is correct because exit-status send failures are channel write failures like the other server-side send paths, and they should terminate the channel instead of only emitting a notification.
Per-write queue nodes were retained after successful writes and also left behind when channels were finalized with queued data still present. Under long stress this leaked one heap allocation per queued write chunk on both the client and server channel paths. Free write queue nodes as they are consumed and drain any remaining nodes during channel finalization. This keeps graceful close semantics unchanged while making channel teardown release the queue state it no longer owns.
Client and server channel entrypoints trusted actor handles without checking which client or session actually owned the underlying channel. A wrong-handle call could queue writes or accept requests on an unrelated channel, and in the client case could also poison the real channel with a spurious "Channel not ready" error. Track the owning client for each client channel and validate ownership on both client and server channel APIs before mutating channel state. Mismatched handles are now ignored instead of touching the foreign channel. A regression test drives two live sessions, crosses both the client and server channel handles, and asserts that no payload leaks into the wrong channel.
RunCommand timeouts only closed the channel, so callers could miss a terminal callback or see late channel events mutate state after timeout. Several server reply paths also assumed callback registration and libssh control-packet sends always succeeded, which hid real errors under load. The stress suite had two tests that raced their own requests or teardown and could time out even when the transport behaved correctly. This change makes RunCommand timeout an explicit terminal error and ignores late open, data, exit, and close callbacks after completion. It checks server-side callback setup and libssh reply/send results, treats SSH_AGAIN as backpressure instead of failure, and adds explicit coverage for RunCommand timeout and exit-status delivery. The ownership and unsupported-request stress tests now use isolated port ranges and a more ordered shutdown so they exercise the binding instead of their own close races. This is correct because command completion is now single-shot, server control replies no longer fail silently, and the stress suite keeps reporting transport regressions instead of test-induced flakes.
Accepted server sessions were started before any ServerSession actor attached to them. That let sockets sit in a pre-attach limbo with no bounded lifetime, and a late attach could still publish a session that had already closed underneath it. This defers per-session polling until attach, starts a bounded pre-attach timeout for newly accepted sessions, and skips publishing a ServerSession when native attach finds the session already closed. The session is now either attached and driven through the normal close path, or failed before it becomes visible to user code. That removes a dead-session publication race and prevents pre-attach sessions from lingering indefinitely.
The pre-attach timeout reused the session auth timer slot. Attach stopped that timer, but uv_close is asynchronous, so the pointer stayed non-null long enough for key exchange to reach AUTH with no real auth timeout armed for the session. This gives the pre-attach guard its own timer handle, derives its budget from server auth policy when available, and adds a regression test that stalls hostkey acceptance until the server must time out authentication after attach. Attach timeout and auth timeout now have independent lifetimes, so stopping one can no longer mask the other. Sessions either attach within the configured budget or continue into AUTH with a real auth timeout still enforced.
Long stress runs drove RSS into multi-gigabyte territory while file descriptor counts stayed flat. The poll and timer handles were heap-allocated for every client and session, but the libuv close callbacks only cleared pointers and never released the handle memory. Disconnect-only readiness could also be dropped before reaching libssh because hangups were filtered behind readable-byte checks. This frees poll and timer handles from their close callbacks, treats POLLHUP and POLLERR as read-side work, and forwards UV_DISCONNECT into the libssh poll mapping instead of discarding it. Handle lifetime now matches the allocation path, so repeated open and close churn does not retain closed libuv objects. Peer hangups also reach libssh promptly even when no additional bytes are readable, which makes close detection less timing-sensitive under load.
Channel close callbacks were delivered after the native ssh_channel had already been freed, but before the actor handle was invalidated. That left a re-entrancy window where on_close or EOF handlers could call back into write or close on a channel whose native state was already gone. This marks client and server channels closed and clears their exported actor ids before delivering exit, EOF, and close notifications from the finalize path. Callbacks still observe the expected terminal notifications, but any re-entrant channel operation now resolves to a dead handle instead of a half-finalized native channel. That removes a latent teardown footgun without changing normal close ordering.
A channel could still accept writes after its parent client or server session had left the ready state, which left a gap where late payloads could be queued during shutdown. Attached server sessions could also sit in key exchange without a real timeout after the attach timer stopped, and disconnect-only poll edges were not subscribed consistently. This change makes channel validation require a ready parent session, starts the normal auth timeout as soon as an attached session begins running, and treats attached key exchange timeout separately from the pre-attach timeout. It also subscribes session polls to disconnect notifications and makes the readiness probe treat EOF and reset as terminal readable state. These checks align the binding with the actual session state machine. Late writes are rejected once shutdown starts, attached sessions cannot stall indefinitely before authentication, and close paths see terminal socket state promptly instead of depending on another read event.
Native SSH contexts and channel structs were finalized logically but their wrapper allocations stayed live until process exit. That showed up as steady RSS growth under stress even with flat fd counts. The client also treated a channel close as a terminal request result even when an exec request was still pending, so a rejected request could race with close and either look successful or poison sibling channels. This change defers releasing client, server, and session contexts until their libuv handles are actually closed, frees channel structs when they leave the active lists, and drops stored close reasons during final release. It also keeps a remote close from resolving a pending client request until teardown knows whether the request was ever answered, and reports a closed channel as a request failure only if the request remains unresolved at finalization. This is correct because the native wrapper lifetime now matches the lifetime of the libuv handles that still reference it, and request completion is no longer inferred from close ordering alone. Rejected or aborted commands fail deterministically, while sibling channels and long stress runs no longer accumulate unreachable native state.
Server channel-open accept still assumed libssh's convenience helper could distinguish nonblocking backpressure from failure. It cannot: ssh_message_channel_request_open_reply_accept() returns NULL for any negative status, including SSH_AGAIN, so an accepted open could be followed by a default reject when the confirmation packet was only write-pending. The temporary server-channel wrapper also leaked on the callback-setup failure path. This change switches the binding to ssh_message_channel_request_open_reply_accept_channel() so SSH_AGAIN is preserved as a pending write while the accepted channel remains live. The code now only rejects the open on real errors, carries write-pending state forward, and frees the temporary wrapper if callback setup fails. The concurrent channel write test now opens more channels and has a larger timeout budget so it exercises the accept path under higher load without turning wall-clock variance into noise. This is correct because libssh already binds the channel and queues the open confirmation before returning SSH_AGAIN. Treating that as a pending write keeps channel ownership and protocol state aligned instead of sending a contradictory second reply.
AuthTimeoutTester assumed attached sessions would always finish key exchange inside a 200ms timeout and then fail in the auth state. Under the default stress worker count that was no longer reliable, so the test intermittently reported SSH key exchange timeout even though the path under test was the post-attach auth timeout. This change raises the test server's auth timeout to one second and gives the actor-level watchdog more headroom. The test still verifies that a real client which never progresses past host key handling eventually closes with SSH authentication timeout, but it no longer conflates scheduler-delayed key exchange with the auth timeout semantics it is meant to cover. This is correct because the regression being protected is that the auth timeout must still fire after attach. Giving key exchange a realistic budget keeps that assertion stable under stress without weakening the expected close reason.
Close callbacks could run before channel finalization while the native handle was still live. That made terminal callback ordering ambiguous and allowed reentrant on_close handlers to race the rest of teardown. This change defers close notification until finalization, strengthens the close-callback regression to require EOF and exit delivery before close, adds a hostkey-wait disconnect regression, and retunes the auth timeout test to fit the module runner's fixed timeout budget. That keeps terminal callbacks on a single teardown edge, exercises the disconnect path without depending on a slow auth-timeout wait, and preserves auth-timeout coverage in the normal test suite.
One malformed TCP peer could still leave a gap in the audit because the suite did not prove that the server stayed healthy for the next SSH client. This adds a regression that connects a raw TCP peer, sends garbage, and then connects a real SSH client to the same server. The test asserts the bad peer is cleaned up and the following SSH session still authenticates and closes normally. That gives us direct coverage for per-session failure isolation and exercises key-exchange teardown under churn without relying on a longer soak run.
The suite still had a blind spot around short-lived non-SSH peers. A single bad connection was covered, but repeated raw connects and abrupt closes were not, which left per-session cleanup and server isolation under-tested. This adds one regression that proves a garbage-speaking TCP peer does not poison the next real SSH client, and another that churns repeated raw connect-close cycles before opening a good SSH session. That gives direct coverage for bad-peer isolation and rapid session teardown without changing the transport itself, so future regressions in accept, attach, or close paths should fail quickly.
The server accepted sessions and channels without explicit limits, and several connection-local setup failures during accept could tear down the whole listener. That left the binding exposed to resource exhaustion and made one bad peer capable of poisoning otherwise healthy traffic. This adds configurable server admission limits for concurrent sessions and channels per session, bounds the accept work done in a single poll turn, and rejects over-limit channel opens without failing the surrounding session. It also downgrades accepted-socket setup failures so they close or discard only the affected connection instead of escalating to a server-wide failure. That keeps admission pressure local to the offending peer and gives the server predictable resource bounds while preserving service for healthy clients on the same listener.
The SSH test module mixed wrapper functions with top-level actor tests, but Acton was auto-discovering the actors directly. That bypassed the wrapper logic that skips deterministic tests in stress mode and made it hard to control timeout-sensitive cases consistently. This change switches the module to an explicit env-test registry and a manual test runner entrypoint. The wrappers now own test selection, so stress-only skips and per-case harness behavior apply to the tests they were written for. That is the correct model for this module because several of the harder SSH edge cases need different harness behavior in deterministic runs and in stress runs, and auto-discovery was silently ignoring that split.
Attached server sessions could remain in key exchange forever. The pre-attach timer was stopped as soon as the session actor attached, but the auth timer was only started after key exchange completed, so a peer that stalled between those phases had no timeout at all. This change keeps the attach timer scoped to the pre-attach phase, starts a dedicated timeout for attached key exchange, and only arms the auth timer after key exchange succeeds. That is correct because the server now enforces a timeout for each session phase without overlapping their responsibilities, so a client cannot evade both limits by attaching and then never completing key exchange.
The key exchange timeout stress case could report a failure even when the server closed the session for the expected reason. Under heavy stress, the test's own watchdog could win a narrow race against the close callback and turn a correct late close into a spurious error. This change gives the test a grace window before it declares failure. The key exchange timeout still has to happen, but the harness now allows for scheduler and callback latency under high concurrency. That is correct because the test is supposed to prove the timeout state machine, not enforce a tight wall-clock deadline on when the callback is delivered under load.
Long stress still had three sharp edges after the earlier transport work. The server handed a generated host key to ssh_bind and then kept freeing the same key itself, channel drive could synthesize EOF from ssh_channel_is_eof() before libssh had finished delivering buffered data callbacks, and the longest stress cases still burned time on bind collisions and outer test timeouts rather than on the transport they were meant to exercise. This change clears the wrapper host-key pointer once IMPORT_KEY hands ownership to ssh_bind, removes eager EOF notification from the live client and server channel drive loops, and widens the retry and timeout budget for the long-running concurrent-write and key-exchange timeout stress cases. That matches libssh ownership and callback contracts more closely and keeps whole-module stress focused on real SSH failures instead of spurious double frees, premature EOF delivery, or avoidable bind and timing noise.
SSH client and server actors could be dropped without an explicit close, leaving native state alive with no cleanup path from the Acton edge. The binding also needed terminal callbacks to see a live wrapper long enough to clear native ids before user code could re-enter those handles. This change adds __cleanup__ hooks to the public SSH wrapper actors, native cleanup entry points in the C binding, and opt-in GC cleanup probes for client and server wrappers. It also stores client and server backpointers as hidden GC pointers and clears exposed native ids before terminal callbacks run. Explicit close remains the primary resource-management path, but abandoned top-level SSH wrappers now release native state eventually instead of leaking it. Clearing ids before close callbacks also makes late re-entry on terminal events a safe no-op instead of touching freed native handles.
Server session attach still exposed stale native state across the deferred actor handoff. The server passed a raw session pointer through the pending-session callback, and ServerSession attach drove libssh before the server had been told that the session actor was ready. Under rapid disconnect churn that could reorder session close ahead of session ready, and it left the attach path dependent on a raw pointer remaining valid until the deferred attach ran. This change gives each pending server session a stable token, resolves affinity and attach against the server's live session list, and only drives the attached session after the ready callback has been queued. It also keeps closed client and server channel contexts on retired lists owned by the session or client and frees them only when the owning transport finalizes. That keeps the async attach handoff from chasing stale session pointers, preserves the expected ready-before-close ordering at the Acton boundary, and avoids reusing channel callback userdata while late close processing is still unwinding.
Timer-driven keepalive packets were one of the few remaining SSH transport paths without direct regression coverage. The suite already stressed channel close, multiplexed writes, auth churn, and raw disconnects, but it did not hold an authenticated session open long enough for repeated keepalives to overlap with channel progress. Add a keepalive traffic tester that drives one channel over a session with aggressive client and server keepalive intervals while writes are still in flight. The test keeps the session alive long enough for multiple keepalive ticks, verifies that the full payload reaches the server, and requires the reply and close path to complete cleanly. That gives the transport suite coverage for timer-driven libssh traffic interleaving with normal channel I/O, which is exactly the control-path overlap that would otherwise stay invisible until long-running stress.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a66c745189
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "libssh": { | ||
| "url": "https://github.com/actonlang/libssh/archive/refs/heads/zig-build.tar.gz", | ||
| "hash": "122076b4676ec7d777e5efc24ae4b7f1ab4fa3df407ab9649fdd55f9fc594ca053ec", | ||
| "path": "../acton-deps/libssh", |
There was a problem hiding this comment.
Restore a resolvable libssh dependency source
Pointing zig_dependencies.libssh to "../acton-deps/libssh" makes builds depend on a sibling directory that is not part of this repository, so clean checkouts/CI environments fail at dependency resolution before compilation. This commit also removed the previously self-contained url/hash source, so there is no fallback for users who do not have that local path.
Useful? React with 👍 / 👎.
No description provided.