Skip to content

Kll#3

Open
plajjan wants to merge 38 commits into
mainfrom
kll
Open

Kll#3
plajjan wants to merge 38 commits into
mainfrom
kll

Conversation

@plajjan
Copy link
Copy Markdown
Contributor

@plajjan plajjan commented Apr 24, 2026

No description provided.

Kristian Larsson added 30 commits January 6, 2026 16:41
The transport could ask libssh to read or write when the socket or
channel was not actually ready. We synthesized write progress,
trusted stale readable notifications, and finalized channels before
the peer had closed them. Under load this produced truncated
close/flush runs, fatal "Resource temporarily unavailable" socket
errors, and occasional wedges.

This change makes channel payload writes wait for libssh's own
channel_write_wontblock callback, keeps EOF and CLOSE behind
session flush completion, and filters POLLIN through a fresh
readiness check before handing it back to libssh. It also restores
accepted and post-connect session sockets to nonblocking mode
after libssh switches them back to blocking.

These changes align the binding with libssh's external event-loop
contract. Read and write callbacks now only run when the
transport can actually make progress, and channel teardown waits
until both local flush and remote close have completed.
A rare close/flush stress failure still remained after the main
transport hardening. Under four concurrent workers the client could
occasionally truncate the payload around 487000 bytes, which matched
the earlier write-side failure signature.

The remaining hole was that we still trusted libuv's UV_WRITABLE
notification directly. libssh treats a write-side EAGAIN as a fatal
socket error once we tell it the fd will not block, so a stale or
already-consumed writable event could still tear the session down.

This change confirms POLLOUT with a fresh poll(0) check before
advertising writability to libssh on both client and server session
poll callbacks. That keeps the binding symmetric with the existing
read-side hardening and only hands libssh readiness that the kernel
still agrees with.
The close/flush stress gate still had a brittle failure mode after the
transport fixes. Each iteration spins up a server, completes an SSH
handshake and auth exchange, issues a subsystem request, transfers a
1 MiB payload, and waits for channel and session teardown. Under
higher worker counts that full wall-clock path can exceed the old
one-second budget without the transport being stuck.

This change increases the test timeout to two seconds. That keeps the
budget tight enough to catch real stalls while removing false stress
failures caused by scheduler and handshake tail latency rather than
transport corruption.
The close and flush stress gate is intended to catch truncated
transfers and teardown races, not to enforce a tight wall-clock
budget. After the transport fixes it could still exceed the shorter
timeout under the default stress worker count because each iteration
pays for server startup, SSH handshake, auth, a subsystem request, a
1 MiB transfer, and full channel/session shutdown.

This change raises the timeout from two to five seconds. That keeps
the test focused on correctness while still failing genuinely stuck
runs, and it removes false stress failures caused by normal
concurrency tail latency.
A normal Client.close() or ServerSession.close() could tear
down every child channel immediately, even when channels
still had queued data or libssh still had buffered session
writes. Under load that dropped in-flight payloads and
turned orderly shutdown into spurious session errors.

This change separates graceful parent shutdown from forced
error cleanup. Normal client and session close now mark
child channels for EOF and CLOSE, keep polling until
channel queues and libssh pending writes drain, and only
finalize the parent transport once the child channels are
gone. Error paths still escalate to immediate teardown.

This is correct because parent shutdown now follows the
same write-drain contract as individual channels instead of
bypassing it at a higher level.
It was too easy to use channel close as if it were a
half-close. The Acton API only said "Close the channel",
while libssh close semantics explicitly discard unread
inbound data. That made request-response flows look valid
even though they were racing their own replies.

This change documents the distinction directly on Channel
and ServerChannel: send_eof keeps the read side open,
while close is a full close that may drop unread inbound
bytes. It also adds concurrency coverage around
shared-session traffic so the safe pattern stays exercised
under load.

This is correct because the binding now states the
contract that libssh already implements, and the new
stress cases follow the same half-close pattern
production request-response traffic needs.
A server-side failure was meant to force child sessions
closed, but that intent was lost before it reached the
session layer. server_fail set the server state to ERROR,
then server_close_internal immediately overwrote it to
CLOSING and checked the new state when closing child
sessions. In practice that quietly degraded server error
paths into graceful session drains.

This change captures whether the server is already in an
error close before switching the server state to CLOSING,
then passes that latched decision into each child session
close.

This is correct because graceful server shutdown still
drains child sessions, while real server failures now
preserve the immediate error-close behavior they were
already trying to request.
Server-side channel teardown had two holes in exceptional
paths. Errored server channels were marked
SCHAN_STATE_ERROR and then removed from the session list
without running the finalizer, which left the actor-side
channel id and EOF bookkeeping behind. At the same time,
client and server channel callback structs were allocated
by the binding, removed from libssh on teardown, and then
leaked because ownership never returned to the binding.

This change finalizes errored server channels before
unlinking them from the session, frees channel callback
structs after removing them from libssh, and only marks
callbacks as installed after ssh_add_channel_callbacks
succeeds.

This is correct because every channel teardown path now
converges on the same cleanup logic, and callback
lifetime now matches libssh's callback registration
API instead of relying on list removal to free
caller-owned memory.
Sending a server-side exit status already reported a close
when it failed, but it left the channel otherwise live.
That meant callers could receive an error close callback
while the channel remained in the session list and kept
participating in later drive cycles.

This change marks the channel as failed when
ssh_channel_request_send_exit_status returns an error, so
the normal server-channel error teardown path runs on the
next session drive.

This is correct because exit-status send failures are
channel write failures like the other server-side send
paths, and they should terminate the channel instead of
only emitting a notification.
Per-write queue nodes were retained after successful writes and also
left behind when channels were finalized with queued data still
present. Under long stress this leaked one heap allocation per queued
write chunk on both the client and server channel paths.

Free write queue nodes as they are consumed and drain any remaining
nodes during channel finalization. This keeps graceful close semantics
unchanged while making channel teardown release the queue state it no
longer owns.
Client and server channel entrypoints trusted actor handles without
checking which client or session actually owned the underlying
channel. A wrong-handle call could queue writes or accept requests on
an unrelated channel, and in the client case could also poison the real
channel with a spurious "Channel not ready" error.

Track the owning client for each client channel and validate ownership
on both client and server channel APIs before mutating channel state.
Mismatched handles are now ignored instead of touching the foreign
channel. A regression test drives two live sessions, crosses both the
client and server channel handles, and asserts that no payload leaks
into the wrong channel.
RunCommand timeouts only closed the channel, so callers could miss
a terminal callback or see late channel events mutate state after
timeout. Several server reply paths also assumed callback
registration and libssh control-packet sends always succeeded,
which hid real errors under load. The stress suite had two tests
that raced their own requests or teardown and could time out even
when the transport behaved correctly.

This change makes RunCommand timeout an explicit terminal error
and ignores late open, data, exit, and close callbacks after
completion. It checks server-side callback setup and libssh
reply/send results, treats SSH_AGAIN as backpressure instead of
failure, and adds explicit coverage for RunCommand timeout and
exit-status delivery. The ownership and unsupported-request stress
tests now use isolated port ranges and a more ordered shutdown so
they exercise the binding instead of their own close races.

This is correct because command completion is now single-shot,
server control replies no longer fail silently, and the stress
suite keeps reporting transport regressions instead of
test-induced flakes.
Accepted server sessions were started before any ServerSession actor
attached to them. That let sockets sit in a pre-attach limbo with no
bounded lifetime, and a late attach could still publish a session that
had already closed underneath it.

This defers per-session polling until attach, starts a bounded
pre-attach timeout for newly accepted sessions, and skips publishing a
ServerSession when native attach finds the session already closed.

The session is now either attached and driven through the normal close
path, or failed before it becomes visible to user code. That removes
a dead-session publication race and prevents pre-attach sessions from
lingering indefinitely.
The pre-attach timeout reused the session auth timer slot. Attach
stopped that timer, but uv_close is asynchronous, so the pointer stayed
non-null long enough for key exchange to reach AUTH with no real auth
timeout armed for the session.

This gives the pre-attach guard its own timer handle, derives its budget
from server auth policy when available, and adds a regression test that
stalls hostkey acceptance until the server must time out authentication
after attach.

Attach timeout and auth timeout now have independent lifetimes, so
stopping one can no longer mask the other. Sessions either attach within
the configured budget or continue into AUTH with a real auth timeout
still enforced.
Long stress runs drove RSS into multi-gigabyte territory while file
descriptor counts stayed flat. The poll and timer handles were
heap-allocated for every client and session, but the libuv close
callbacks only cleared pointers and never released the handle memory.
Disconnect-only readiness could also be dropped before reaching libssh
because hangups were filtered behind readable-byte checks.

This frees poll and timer handles from their close callbacks, treats
POLLHUP and POLLERR as read-side work, and forwards UV_DISCONNECT into
the libssh poll mapping instead of discarding it.

Handle lifetime now matches the allocation path, so repeated open and
close churn does not retain closed libuv objects. Peer hangups also
reach libssh promptly even when no additional bytes are readable, which
makes close detection less timing-sensitive under load.
Channel close callbacks were delivered after the native ssh_channel had
already been freed, but before the actor handle was invalidated. That
left a re-entrancy window where on_close or EOF handlers could call
back into write or close on a channel whose native state was already
gone.

This marks client and server channels closed and clears their exported
actor ids before delivering exit, EOF, and close notifications from the
finalize path.

Callbacks still observe the expected terminal notifications, but any
re-entrant channel operation now resolves to a dead handle instead of a
half-finalized native channel. That removes a latent teardown footgun
without changing normal close ordering.
A channel could still accept writes after its parent client or server
session had left the ready state, which left a gap where late payloads
could be queued during shutdown. Attached server sessions could also sit
in key exchange without a real timeout after the attach timer stopped,
and disconnect-only poll edges were not subscribed consistently.

This change makes channel validation require a ready parent session,
starts the normal auth timeout as soon as an attached session begins
running, and treats attached key exchange timeout separately from the
pre-attach timeout. It also subscribes session polls to disconnect
notifications and makes the readiness probe treat EOF and reset as
terminal readable state.

These checks align the binding with the actual session state machine.
Late writes are rejected once shutdown starts, attached sessions cannot
stall indefinitely before authentication, and close paths see terminal
socket state promptly instead of depending on another read event.
Native SSH contexts and channel structs were finalized logically
but their wrapper allocations stayed live until process exit. That
showed up as steady RSS growth under stress even with flat fd
counts. The client also treated a channel close as a terminal
request result even when an exec request was still pending, so a
rejected request could race with close and either look successful
or poison sibling channels.

This change defers releasing client, server, and session contexts
until their libuv handles are actually closed, frees channel
structs when they leave the active lists, and drops stored close
reasons during final release. It also keeps a remote close from
resolving a pending client request until teardown knows whether
the request was ever answered, and reports a closed channel as a
request failure only if the request remains unresolved at
finalization.

This is correct because the native wrapper lifetime now matches
the lifetime of the libuv handles that still reference it, and
request completion is no longer inferred from close ordering
alone. Rejected or aborted commands fail deterministically, while
sibling channels and long stress runs no longer accumulate
unreachable native state.
Server channel-open accept still assumed libssh's convenience
helper could distinguish nonblocking backpressure from failure.
It cannot: ssh_message_channel_request_open_reply_accept()
returns NULL for any negative status, including SSH_AGAIN, so an
accepted open could be followed by a default reject when the
confirmation packet was only write-pending. The temporary
server-channel wrapper also leaked on the callback-setup failure
path.

This change switches the binding to
ssh_message_channel_request_open_reply_accept_channel() so
SSH_AGAIN is preserved as a pending write while the accepted
channel remains live. The code now only rejects the open on real
errors, carries write-pending state forward, and frees the
temporary wrapper if callback setup fails. The concurrent channel
write test now opens more channels and has a larger timeout
budget so it exercises the accept path under higher load without
turning wall-clock variance into noise.

This is correct because libssh already binds the channel and
queues the open confirmation before returning SSH_AGAIN.
Treating that as a pending write keeps channel ownership and
protocol state aligned instead of sending a contradictory second
reply.
AuthTimeoutTester assumed attached sessions would always finish
key exchange inside a 200ms timeout and then fail in the auth
state. Under the default stress worker count that was no longer
reliable, so the test intermittently reported SSH key exchange
timeout even though the path under test was the post-attach auth
timeout.

This change raises the test server's auth timeout to one second
and gives the actor-level watchdog more headroom. The test still
verifies that a real client which never progresses past host key
handling eventually closes with SSH authentication timeout, but
it no longer conflates scheduler-delayed key exchange with the
auth timeout semantics it is meant to cover.

This is correct because the regression being protected is that
the auth timeout must still fire after attach. Giving key
exchange a realistic budget keeps that assertion stable under
stress without weakening the expected close reason.
Close callbacks could run before channel finalization while the
native handle was still live. That made terminal callback ordering
ambiguous and allowed reentrant on_close handlers to race the rest of
teardown.

This change defers close notification until finalization, strengthens
the close-callback regression to require EOF and exit delivery before
close, adds a hostkey-wait disconnect regression, and retunes the auth
timeout test to fit the module runner's fixed timeout budget.

That keeps terminal callbacks on a single teardown edge, exercises the
disconnect path without depending on a slow auth-timeout wait, and
preserves auth-timeout coverage in the normal test suite.
One malformed TCP peer could still leave a gap in the audit
because the suite did not prove that the server stayed healthy for the
next SSH client.

This adds a regression that connects a raw TCP peer, sends garbage,
and then connects a real SSH client to the same server. The test
asserts the bad peer is cleaned up and the following SSH session still
authenticates and closes normally.

That gives us direct coverage for per-session failure isolation and
exercises key-exchange teardown under churn without relying on a longer
soak run.
The suite still had a blind spot around short-lived non-SSH peers. A
single bad connection was covered, but repeated raw connects and abrupt
closes were not, which left per-session cleanup and server isolation
under-tested.

This adds one regression that proves a garbage-speaking TCP peer does
not poison the next real SSH client, and another that churns repeated
raw connect-close cycles before opening a good SSH session.

That gives direct coverage for bad-peer isolation and rapid session
teardown without changing the transport itself, so future regressions
in accept, attach, or close paths should fail quickly.
The server accepted sessions and channels without explicit limits,
and several connection-local setup failures during accept could tear
down the whole listener. That left the binding exposed to resource
exhaustion and made one bad peer capable of poisoning otherwise
healthy traffic.

This adds configurable server admission limits for concurrent
sessions and channels per session, bounds the accept work done in a
single poll turn, and rejects over-limit channel opens without
failing the surrounding session. It also downgrades accepted-socket
setup failures so they close or discard only the affected
connection instead of escalating to a server-wide failure.

That keeps admission pressure local to the offending peer and gives
the server predictable resource bounds while preserving service for
healthy clients on the same listener.
Kristian Larsson added 7 commits March 20, 2026 00:09
The SSH test module mixed wrapper functions with top-level actor tests,
but Acton was auto-discovering the actors directly. That bypassed the
wrapper logic that skips deterministic tests in stress mode and made it
hard to control timeout-sensitive cases consistently.

This change switches the module to an explicit env-test registry and a
manual test runner entrypoint. The wrappers now own test selection, so
stress-only skips and per-case harness behavior apply to the tests they
were written for.

That is the correct model for this module because several of the harder
SSH edge cases need different harness behavior in deterministic runs and
in stress runs, and auto-discovery was silently ignoring that split.
Attached server sessions could remain in key exchange forever. The
pre-attach timer was stopped as soon as the session actor attached, but
the auth timer was only started after key exchange completed, so a peer
that stalled between those phases had no timeout at all.

This change keeps the attach timer scoped to the pre-attach phase,
starts a dedicated timeout for attached key exchange, and only arms the
auth timer after key exchange succeeds.

That is correct because the server now enforces a timeout for each
session phase without overlapping their responsibilities, so a client
cannot evade both limits by attaching and then never completing key
exchange.
The key exchange timeout stress case could report a failure even when
the server closed the session for the expected reason. Under heavy
stress, the test's own watchdog could win a narrow race against the
close callback and turn a correct late close into a spurious error.

This change gives the test a grace window before it declares failure.
The key exchange timeout still has to happen, but the harness now allows
for scheduler and callback latency under high concurrency.

That is correct because the test is supposed to prove the timeout state
machine, not enforce a tight wall-clock deadline on when the callback is
delivered under load.
Long stress still had three sharp edges after the earlier transport
work. The server handed a generated host key to ssh_bind and then kept
freeing the same key itself, channel drive could synthesize EOF from
ssh_channel_is_eof() before libssh had finished delivering buffered
data callbacks, and the longest stress cases still burned time on bind
collisions and outer test timeouts rather than on the transport they
were meant to exercise.

This change clears the wrapper host-key pointer once IMPORT_KEY hands
ownership to ssh_bind, removes eager EOF notification from the live
client and server channel drive loops, and widens the retry and timeout
budget for the long-running concurrent-write and key-exchange timeout
stress cases.

That matches libssh ownership and callback contracts more closely and
keeps whole-module stress focused on real SSH failures instead of
spurious double frees, premature EOF delivery, or avoidable bind and
timing noise.
SSH client and server actors could be dropped without an
explicit close, leaving native state alive with no cleanup
path from the Acton edge. The binding also needed terminal
callbacks to see a live wrapper long enough to clear native
ids before user code could re-enter those handles.

This change adds __cleanup__ hooks to the public SSH wrapper
actors, native cleanup entry points in the C binding, and
opt-in GC cleanup probes for client and server wrappers. It
also stores client and server backpointers as hidden GC
pointers and clears exposed native ids before terminal
callbacks run.

Explicit close remains the primary resource-management path,
but abandoned top-level SSH wrappers now release native state
eventually instead of leaking it. Clearing ids before close
callbacks also makes late re-entry on terminal events a
safe no-op instead of touching freed native handles.
Server session attach still exposed stale native state across
the deferred actor handoff. The server passed a raw session
pointer through the pending-session callback, and
ServerSession attach drove libssh before the server had been
told that the session actor was ready. Under rapid disconnect
churn that could reorder session close ahead of session
ready, and it left the attach path dependent on a raw
pointer remaining valid until the deferred attach ran.

This change gives each pending server session a stable
token, resolves affinity and attach against the server's live
session list, and only drives the attached session after
the ready callback has been queued. It also keeps closed
client and server channel contexts on retired lists owned
by the session or client and frees them only when the
owning transport finalizes.

That keeps the async attach handoff from chasing stale
session pointers, preserves the expected ready-before-close
ordering at the Acton boundary, and avoids reusing channel
callback userdata while late close processing is still
unwinding.
Timer-driven keepalive packets were one of the few remaining
SSH transport paths without direct regression coverage. The
suite already stressed channel close, multiplexed writes, auth
churn, and raw disconnects, but it did not hold an
authenticated session open long enough for repeated keepalives
to overlap with channel progress.

Add a keepalive traffic tester that drives one channel over a
session with aggressive client and server keepalive intervals
while writes are still in flight. The test keeps the session
alive long enough for multiple keepalive ticks, verifies that
the full payload reaches the server, and requires the reply
and close path to complete cleanly.

That gives the transport suite coverage for timer-driven
libssh traffic interleaving with normal channel I/O, which is
exactly the control-path overlap that would otherwise stay
invisible until long-running stress.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a66c745189

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread build.act.json
"libssh": {
"url": "https://github.com/actonlang/libssh/archive/refs/heads/zig-build.tar.gz",
"hash": "122076b4676ec7d777e5efc24ae4b7f1ab4fa3df407ab9649fdd55f9fc594ca053ec",
"path": "../acton-deps/libssh",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restore a resolvable libssh dependency source

Pointing zig_dependencies.libssh to "../acton-deps/libssh" makes builds depend on a sibling directory that is not part of this repository, so clean checkouts/CI environments fail at dependency resolution before compilation. This commit also removed the previously self-contained url/hash source, so there is no fallback for users who do not have that local path.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant