fix(net): surface gvproxy bind errors through FFI#612
Conversation
47aefa2 to
0872494
Compare
|
Where the error is produced — go func() {
vn, err := virtualnetwork.New(tapConfig) // :440 the bind happens here (EADDRINUSE, EACCES, …)
if err != nil {
initErr <- err // :443 hand the error out of the goroutine
return
}
initErr <- nil // :446 success
// ...
}()Where it's captured — if err := <-initErr; err != nil { // :534 wait for the bind result
setErr(err) // copy the reason into *errOut
cancel(); delete(instances, id); listener.Close(); os.Remove(socketPath)
return -1 // :547 ← turn the failure into -1
}Pre-fix there was no The Rust side ( |
| os.Remove(socketPath) | ||
| }() | ||
|
|
||
| // Wait for virtualnetwork.New to complete before returning a valid id. |
There was a problem hiding this comment.
will this hurt performances in normal cases?
There was a problem hiding this comment.
gvproxy_create_latency_test integrated for normal case (virtual network.New wait < 0.5ms threshold), usually 0.1~0.3ms on Amazon c8i.xlarge ubuntu 22.04
A/B test (latency increase because of this waiting) also in this range
thx for the instruction
`virtualnetwork.New` failures (e.g. host port EADDRINUSE when another
process holds the port a user passed to `-p HOST:GUEST`) used to die
inside a goroutine at gvproxy-bridge/main.go:417, while the surrounding
`gvproxy_create` had already returned a valid id. The Rust runtime
shipped the broken socket downstream, the guest booted, and the
failure surfaced ~20s later as "DNS lookup … i/o timeout" from inside
the guest — multiple layers from the root cause.
Add an `initErr` channel so `gvproxy_create` waits for the
`virtualnetwork.New` result before returning. On failure it tears
down the instance and returns -1; the Rust runtime maps that to
`Network("gvproxy_create failed")` so the user sees a clear, named,
fail-fast error in well under a second.
Regression guard: `src/cli/tests/gvproxy_port_conflict.rs` —
non-boxlite `TcpListener` holds an ephemeral host port, then
`boxlite run -p PORT:80 alpine:latest true` must exit non-zero with
stderr containing "gvproxy_create failed".
The previous commit surfaced bind failures as `gvproxy_create failed`,
but that string alone doesn't tell the user *why* — they still have to
read box debug files to find "address already in use".
Extend `gvproxy_create` with an `errOut **C.char` out-parameter; on
each -1 return path the Go side writes the underlying `err.Error()` as
a heap-allocated C string. The Rust caller reads, frees, and folds
the detail into `BoxliteError::Network`, so `boxlite run` stderr now
reads:
Error: network error: gvproxy_create failed:
cannot add network services: listen tcp 0.0.0.0:27499:
bind: address already in use
instead of an opaque "gvproxy_create failed".
Regression guard: `gvproxy_port_conflict_fails_fast_with_named_error`
now also asserts stderr contains "address already in use" — locks down
the FFI plumbing against future regressions that would re-collapse
the message.
The 3 assertions guard distinct regression classes — boxlite's exit code, the named gvproxy error, and the underlying OS-level bind detail. With `assert!` short-circuiting, a regression that breaks all three only shows the first failure on the test panic, forcing iterative debugging. Collect each check into a Vec and emit a single combined panic at the end so a maintainer sees every level that regressed in one test run. Failure mode on plain main (verified): "3 of 3 checks failed: - L1 [boxlite rc != 0]: ... - L2 [stderr names gvproxy]: ... - L3 [stderr carries OS detail]: ..."
Companion to gvproxy_port_conflict_fails_fast_with_named_error, exercising a different `virtualnetwork.New` bind failure mode: EACCES when a non-root caller maps a privileged host port (e.g. -p 80:80), instead of EADDRINUSE when the port is busy. Same FFI/error plumbing, different kernel error string. Confirms the fix surfaces whatever the kernel returned at bind time rather than hard-coding a port-conflict shortcut. Verified on main worktree: 3 of 3 soft-assertions fire (boxlite rc=0, stderr contains neither "gvproxy_create failed" nor "permission denied" — just qcow2 warns and "Auto-stopping non-detached box"). On the fix branch, all three pass. Skipped when the runner has permission to bind <1024 (root / CAP_NET_BIND_SERVICE / lowered ip_unprivileged_port_start) — the test premise doesn't hold there.
0872494 to
93cdbc3
Compare
The fix makes gvproxy_create block on `<-initErr` until virtualnetwork.New finishes, prompting the review question of normal-case cost. virtualnetwork.New is the upper bound on that wait; this pins its median (over N runs, warmed, robust to one-off GC/scheduler spikes) under a 0.5ms budget — observed ~120µs. Two-side verified: a 600µs injection pushes the median to ~1.2ms and fails. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Surface gvproxy
virtualnetwork.Newbind failures (port-in-use, privileged-port, …) through the GoinitErrchannel → FFIerrOut→BoxliteError::Networkinstead of swallowing them into an opaque box failureTest plan
Two regression tests, each verified two-sided (fix reverted vs applied):
gvproxy_port_conflict_fails_fast_with_named_error— host pre-binds the port, thenboxlite run -pcollides (EADDRINUSE).gvproxy_privileged_port_fails_fast_with_named_error— non-rootboxlite run -p 80:80(EACCES).rc=0— boots despite the bind failurerc≠0— fails fastgvproxy_create failed, no OS reasongvproxy_create failed: … address already in use/… permission deniedPre-fix the bind error was swallowed in a goroutine
logrusline and the box booted with broken networking; post-fix it interrupts startup with the named OS reason.