diff --git a/.gitignore b/.gitignore index 7426f7e..a01f895 100644 --- a/.gitignore +++ b/.gitignore @@ -4,3 +4,4 @@ externals/ lib/modules/ *.o *.bin +__pycache__ diff --git a/README.md b/README.md index 52ad34e..631c551 100644 --- a/README.md +++ b/README.md @@ -71,7 +71,7 @@ The build signs `build/elfuse` before use. Override the signing identity with `--sysroot`, and attaching `gdb` / `lldb` to the built-in stub. - [docs/testing.md](docs/testing.md): build prerequisites, the `make check` flow, the QEMU cross-check matrix, and fixture handling. -- [docs/internals.md](docs/internals.md): canonical technical reference — +- [docs/internals.md](docs/internals.md): canonical technical reference -- HVF constraints, EL1 shim and HVC protocol, page-table splitting, syscall translation tables, threads/futex, fork/clone IPC, signals, ptrace, and the GDB stub. @@ -109,7 +109,7 @@ Boundaries to be aware of: virtualization. `/proc`, `/dev`, and mount data are compatibility views. - HVF allows one VM per host process, so Linux-style `fork` is implemented via `posix_spawn` plus state transfer (a fast CoW path is used when - available — see [docs/internals.md](docs/internals.md)). + available -- see [docs/internals.md](docs/internals.md)). - `MAP_SHARED` is treated as `MAP_PRIVATE`; this matches single-process guest semantics and unblocks tools that expect file-backed mappings. - Unsupported syscalls return Linux-style errors rather than silently diff --git a/docs/internals.md b/docs/internals.md index b92e68d..e9ca130 100644 --- a/docs/internals.md +++ b/docs/internals.md @@ -82,7 +82,7 @@ Apple HVF imposes a handful of constraints that shape the rest of the design: - System registers cannot be set via `MSR` from the guest because `HCR_EL2.TSC=1` traps all `MSR` writes. Boot-time sysreg installation (RES1 bits, MMU enable, TTBR0, etc.) goes through HVC #4 from the EL1 - shim. Runtime EL0 sysreg traps — `MSR TPIDR_EL0` and similar — are + shim. Runtime EL0 sysreg traps -- `MSR TPIDR_EL0` and similar -- are handled by the HVC #12 system-instruction trap path. - Only `HV_SYS_REG_*` constants from Hypervisor.framework may be used for register IDs. @@ -116,9 +116,9 @@ interp_base - varies: Dynamic linker (g->interp_base, --sysroot only) The guest size is determined by the VM's configured IPA width (capped at 40-bit / 1 TiB): -- 36-bit IPA (64 GiB) — native AArch64 on Apple M2: `mmap_limit ≈ 56 GiB`, +- 36-bit IPA (64 GiB) -- native AArch64 on Apple M2: `mmap_limit ≈ 56 GiB`, `interp_base ≈ 60 GiB` -- 40-bit IPA (1 TiB) — native AArch64 on Apple M3 and later: +- 40-bit IPA (1 TiB) -- native AArch64 on Apple M3 and later: `mmap_limit ≈ 1016 GiB`, `interp_base ≈ 1020 GiB` Both `mmap_limit` and `interp_base` are computed at runtime from `guest_size` @@ -356,7 +356,7 @@ are supported per VM. - The classic ops: `FUTEX_WAIT`, `FUTEX_WAKE`, `FUTEX_WAIT_BITSET`, `FUTEX_WAKE_BITSET`, `FUTEX_REQUEUE`, `FUTEX_CMP_REQUEUE`, `FUTEX_WAKE_OP`. - A subset of priority-inheritance ops: `FUTEX_LOCK_PI`, `FUTEX_UNLOCK_PI`, - `FUTEX_TRYLOCK_PI`. Priority semantics are not actually inherited — the + `FUTEX_TRYLOCK_PI`. Priority semantics are not actually inherited -- the ops behave as ordinary mutex acquire/release, which is enough for glibc and musl to make forward progress. - `futex_waitv` (syscall 449) for batch waits across up to 128 futex @@ -423,7 +423,7 @@ state transfer: When `g->shm_fd >= 0` the guest memory is file-backed (`mkstemp` + `unlink`, `MAP_SHARED`). Fork sends the backing fd over `SCM_RIGHTS`: -- Parent stays on `MAP_SHARED` and does NOT remap — HVF caches the host +- Parent stays on `MAP_SHARED` and does NOT remap -- HVF caches the host VA→PA mapping from `hv_vm_map`, and a `MAP_FIXED` remap does not update Stage-2, so a remapping parent would observe stale pages. - Child maps the fd `MAP_PRIVATE`, producing an instant CoW clone with zero @@ -431,7 +431,7 @@ When `g->shm_fd >= 0` the guest memory is file-backed (`mkstemp` + `unlink`, - The IPC header sets `has_shm = 1` and `num_regions = 0`, skipping memory serialization entirely. - Child calls `guest_init_from_shm()` instead of `guest_init()`, and must - restore `g->ttbr0` from the IPC header — `guest_init_from_shm` zeroes the + restore `g->ttbr0` from the IPC header -- `guest_init_from_shm` zeroes the struct, and without `ttbr0` page-table walks fail for all high VAs. This path is roughly 50× faster than the legacy IPC copy path on large guest @@ -496,7 +496,7 @@ Key points: `alarm()` for its per-iteration vCPU watchdog. - `signal_check_timer()` is called from the vCPU loop after each syscall. - After `SYSCALL_EXEC_HAPPENED`, the vCPU loop verifies that `ELR_EL1` is - non-zero — a defensive check against HVF register-sync bugs. + non-zero -- a defensive check against HVF register-sync bugs. - Each `thread_entry_t` carries its own `blocked` mask. `rt_sigprocmask` operates on `current_thread->blocked`, and child threads inherit the parent's mask at clone time. @@ -511,11 +511,11 @@ which the tracer reads and writes registers through the snapshot protocol. In `src/syscall/proc.c`: -- `PTRACE_SEIZE` — attach without stopping; sets `ptraced = 1`. -- `PTRACE_CONT` — resume the stopped tracee, optionally injecting a signal. -- `PTRACE_INTERRUPT` — force the tracee into ptrace-stop via +- `PTRACE_SEIZE` -- attach without stopping; sets `ptraced = 1`. +- `PTRACE_CONT` -- resume the stopped tracee, optionally injecting a signal. +- `PTRACE_INTERRUPT` -- force the tracee into ptrace-stop via `hv_vcpus_exit()`. -- `PTRACE_GETREGSET` / `PTRACE_SETREGSET` (`NT_PRSTATUS`) — read or write +- `PTRACE_GETREGSET` / `PTRACE_SETREGSET` (`NT_PRSTATUS`) -- read or write the tracee's register snapshot. Writes are applied on resume. ### Snapshot Protocol @@ -601,14 +601,14 @@ correctly. `elf_resolve_interp()` in `src/core/elf.c` is shared between `src/debug/` is split by role: -- `gdbstub.c` — session lifecycle, stop/resume flow, packet dispatch -- `gdbstub-rsp.c` — RSP packet transport and hex helpers -- `gdbstub-reg.c` — register snapshot layout, restore flow, `target.xml` +- `gdbstub.c` -- session lifecycle, stop/resume flow, packet dispatch +- `gdbstub-rsp.c` -- RSP packet transport and hex helpers +- `gdbstub-reg.c` -- register snapshot layout, restore flow, `target.xml` The stub runs in all-stop mode. Because Hypervisor.framework register access must happen on the owning thread, the stopped vCPU snapshots its own state; the GDB-handler thread reads and updates the snapshot, and the owning thread -restores the modified state on resume — the same pattern used by ptrace. +restores the modified state on resume -- the same pattern used by ptrace. The split mirrors the architectural boundary: transport and encoding are independent of guest execution; register layout is independent of socket I/O; @@ -618,10 +618,10 @@ stop/resume sequencing remains tightly coupled to process and thread state. `elfuse` uses several layers of validation: -- `make check` — fast guest tests plus the BusyBox applet smoke suite. -- `make test-busybox` — applet coverage in isolation. -- `make test-gdbstub` — debugger integration. -- `make test-matrix` — cross-checks elfuse against QEMU on the same corpus. +- `make check` -- fast guest tests plus the BusyBox applet smoke suite. +- `make test-busybox` -- applet coverage in isolation. +- `make test-gdbstub` -- debugger integration. +- `make test-matrix` -- cross-checks elfuse against QEMU on the same corpus. The rule for contributors is simple: match the validation depth to the subsystem you changed. Procfs, process state, dynamic linking, and debugging diff --git a/docs/testing.md b/docs/testing.md index 1021f1b..d113342 100644 --- a/docs/testing.md +++ b/docs/testing.md @@ -33,6 +33,7 @@ The most useful development targets are: ```sh make elfuse make check +make test-rosetta-all make test-gdbstub make test-matrix make lint @@ -46,6 +47,10 @@ What they do: applet smoke suite. The BusyBox binary is auto-resolved from `externals/test-fixtures/aarch64-musl/staticbin/bin/busybox` if present, or downloaded into `build/busybox` on first run. +- `make test-rosetta-all`: Rosetta-specific x86_64 acceptance scripts + (`test-rosetta-cli`, `test-rosetta-failure-modes`, + `test-rosetta-statics`, `test-rosetta-alpine`, + `test-rosetta-audit`, `test-rosetta-jit`, `test-rosetta-glibc`) - `make test-busybox`: just the BusyBox suite, useful when iterating on a single applet failure without rerunning the unit suite - `make test-fuse-alpine`: validate guest `/dev/fuse` + `mount("fuse")` @@ -63,8 +68,8 @@ make elfuse make check ``` -For changes that touch procfs, path handling, `/dev`, FUSE, networking, dynamic linking, or -guest process semantics, run the matrix as well: +For changes that touch procfs, path handling, `/dev`, FUSE, networking, dynamic +linking, or guest process semantics, run the matrix as well: ```sh make test-matrix @@ -76,31 +81,172 @@ iterate on a single applet failure without rerunning the unit suite. ## Test Matrix -The matrix driver lives in `tests/test-matrix.sh`. It runs the same guest test -corpus in two execution modes: +The matrix driver lives in `tests/test-matrix.sh`. It currently covers three +execution modes: - `elfuse-aarch64`: every binary is executed via `build/elfuse` on macOS - `qemu-aarch64`: the same binaries run natively inside an Alpine `aarch64-linux-musl` minirootfs booted by `qemu-system-aarch64` +- `elfuse-x86_64`: Rosetta-for-Linux acceptance scripts against the staged + Alpine x86_64 fixture tree The goal is not to compare performance. The goal is to compare guest-observable behavior against a ground-truth Linux AArch64 environment so that any divergence in syscall translation, procfs emulation, or process semantics is caught early. +The x86_64 mode is narrower: it aggregates the Rosetta-specific acceptance +scripts and their per-binary summaries into the same matrix runner, including +the Rosetta thread/signal audit smoke, the LuaJIT guest-JIT probe, and the +glibc dynamic-binary acceptance helper. -Run a single mode with `bash tests/test-matrix.sh elfuse-aarch64` or -`bash tests/test-matrix.sh qemu-aarch64`; `all` runs both back-to-back. +Run a single mode with `bash tests/test-matrix.sh elfuse-aarch64`, +`bash tests/test-matrix.sh qemu-aarch64`, or +`bash tests/test-matrix.sh elfuse-x86_64`; `all` runs all three back-to-back. Fixture handling is self-contained: - On first use, `tests/fetch-fixtures.sh` downloads the required Alpine packages and the `linux-virt` kernel into `externals/test-fixtures/` and assembles an initramfs. Subsequent runs are zero-config. -- The same fixture tree is reused by both matrix modes. +- The same fixture tree is reused across the matrix modes. +- When Rosetta mode is requested and the translator is installed, + `tests/test-matrix.sh` auto-fetches the x86_64 fixture tree + (`INCLUDE_X86_64=1`) on demand. - QEMU mode requires `qemu-system-aarch64` on `PATH` (Homebrew `qemu` provides it). - musl is the only Alpine libc; the glibc-dynamic suite is skipped unless `GUEST_GLIBC_*` environment variables point at an external sysroot. +## Rosetta Limitations + +`elfuse-x86_64` is expected to inherit two Rosetta-internal limitations that are +not treated as elfuse regressions: + +- `SA_RESETHAND` is not reset reliably because Rosetta shadows guest signal + handler state internally. This matches the vendored `externals/hyper-linux` + reference behavior. +- `clone(..., CLONE_SETTLS, tls=0, ...)` can hang. The upstream reproducer is + the raw-thread path in `externals/hyper-linux/test/test-thread.c`, and the + same limitation is documented in `externals/hyper-linux/hl.1`. + +The x86_64 matrix branch is therefore a Rosetta acceptance gate, not a claim +that translated guests fully match native Linux thread and signal semantics. + +## x86_64 Acceptance Inventory and Per-Host Baselines + +The `elfuse-x86_64` matrix mode aggregates seven sub-suites. Each one +emits a deterministic per-binary pass list; the matrix runner sums +those into a single `Results:` line and compares against a per-host +baseline. The exact labels each sub-suite emits, and the contract +they verify, are: + +- `tests/test-rosetta-cli.sh` (4): `rosetta-disabled-flag`, + `rosetta-disabled-env`, `rosetta-gdb`, `rosetta-default` -- + command-line gating of the translator path (opt-out flag, env + override, `--gdb` rejection, install-hint surface). + +- `tests/test-rosetta-failure-modes.sh` (3): `no-rosetta-flag`, + `no-rosetta-env`, `gdb-x86_64` -- command-line rejection paths. + Self-contained against a synthesized minimal x86_64 ELF; no + external fixture tree required. The dynamic-linker bring-up and + mid-process execve scenarios that used to live here are now + exclusively in the glibc and statics suites against the vendored + rootfs (see `glibc-hello` / `glibc-hello-via-ldso` and + `env-execve`). + +- `tests/test-rosetta-statics.sh` (20): `echo`, `true`, `false`, + `printenv`, `expr-zero`, `expr-mul`, `basename`, `dirname`, + `stat-self`, `factor`, `seq`, `sha256sum`, `md5sum`, `uname-m`, `arch`, + `busybox-arch-subcommand`, `date-utc`, `id-u`, `nproc`, + `env-execve` -- statically-linked Alpine busybox applets, + exercising VZ ioctl gate, `/proc/self/exe` redirect, high-VA mmap, + and the kbuf alias. + +- `tests/test-rosetta-alpine.sh` (33): `cat-fruits-first-line`, + `wc-l-fruits`, `wc-l-lines`, `wc-c-lines`, `ls-data`, `stat-data`, + `find-by-name`, `du-sk-data`, `sha256-fruits`, + `sha256-lines-matches-host`, `sha512-lines`, `md5-fruits`, + `cksum-fruits`, `sort-first`, `sort-reverse-first`, `pipe-sort-wc`, + `pipe-tr-uppercase`, `pipe-cat-grep`, `pipe-sed-subst`, + `pipe-awk-field`, `head-n3`, `tail-n3`, `pipe-sort-uniq`, + `pipe-cut-field`, `pipe-rev`, `tac-reverse-first-line`, `seq-1-5`, + `seq-step`, `factor-prime`, `factor-composite`, `diff-identical`, + `diff-differs`, `pipe-base64-decode` -- broader file I/O, text + processing, and host-shell pipelines stitched through Rosetta on + every stage. + +- `tests/test-rosetta-audit.sh` (2): `audit-known-limitations`, + `tls0-known-hang` -- bookkeeping probe that asserts the documented + Rosetta shadowing failures (above) remain the only divergences; + fails loudly if a new threading/signal-state edge case starts + diverging. + +- `tests/test-rosetta-jit.sh` (2): `luajit-trace`, + `luajit-coroutine` -- guest-side JIT under translation + (LuaJIT trace emission + coroutine allocation), covering the + small-mprotect RW->RX and per-thread icache observation path that + rosetta's own JIT does not exercise. + +- `tests/test-rosetta-glibc.sh` (7): `glibc-hello`, + `glibc-hello-via-ldso`, `glibc-hello-list`, `glibc-dlopen`, + `glibc-tls`, `glibc-gdtls`, `glibc-pthread-tls` -- + dynamically-linked glibc x86_64 binary acceptance through + `--sysroot` against the staged minimal glibc rootfs under + `externals/test-fixtures/x86_64-glibc/rootfs/`. The first three + cover load-time `PT_INTERP` resolution and `ld.so --list` + introspection. `glibc-dlopen` runs `dlopen("libm.so.6")` plus a + `dlsym(sqrt)` round-trip to exercise the runtime fresh-`.so`-mmap + codepath, which is distinct from the load-time path the first + three probes touch. `glibc-tls` reads and writes two + initial-exec `__thread` variables (one integer, one pointer) so a + broken FS-register to `TPIDR_EL0` translation surfaces as a + value mismatch rather than as a silent skip. `glibc-gdtls` + `dlopen`s a companion `libgdtls.so` whose `__thread` variable + must use the general-dynamic model (calls `__tls_get_addr`); + this is the only probe that exercises that lowering path, which + the initial-exec probe cannot reach. `glibc-pthread-tls` + `pthread_create`s a worker thread that reads and writes its own + `__thread` slot; the probe asserts the worker saw its own + default value (not the main thread's overwritten marker) and that + the main thread's slot survives the worker's write, so a broken + per-thread `TPIDR_EL0` setup on additional threads surfaces as + isolation failure rather than as a silent crash. + +Total: 71 expected passes, 0 expected failures. + +### Per-Host Baseline Capture + +The matrix runner keys its `elfuse-x86_64` baseline by detected host +SoC class. Two classes matter because `sys_mmap_fixed_high_va` takes +different paths under different IPA widths: + +- `apple-m1-m2`: 36-bit native IPA, exercises the overflow-segment + path. Captured on this codebase against Apple M1 hardware + (MacBookAir10,1). The seven sub-suites land at 71/0/0. + +- `apple-m3-plus`: 40-bit native IPA, exercises the bisected-slab + path (and the M5 slab-bisection variant). Currently held equal to + `apple-m1-m2` pending operator capture on real M3+ hardware. When + that capture lands, only the + `EXPECTED_MIN_PASS[elfuse-x86_64:apple-m3-plus]` and + `EXPECTED_FAIL[elfuse-x86_64:apple-m3-plus]` entries in + `tests/test-matrix.sh` move; the M1/M2 row stays intact. + +- `apple-unknown`: fallback for SoC brand strings the detector does + not recognise. Inherits the M1/M2 numbers and triggers a one-line + warning so a new SoC does not silently graft onto an existing row. + +Class detection reads `sysctl -n machdep.cpu.brand_string` and matches +against `Apple M1`/`Apple M2` (M1/M2) and `Apple M3`/`Apple M4`/`Apple +M5` (M3+). To exercise the M3+ row from an M1/M2 host (and vice +versa) without changing the detector, set +`MATRIX_HOST_CLASS_OVERRIDE=apple-m3-plus` (or `apple-m1-m2`, +`apple-unknown`) before invoking `tests/test-matrix.sh`. + +When the seven sub-suites grow or trim a test, the per-sub-suite +counts in the comment block above `EXPECTED_MIN_PASS` and the +inventory list above must move in the same commit so the per-host +baseline stays in sync with reality. + ## Test Inventory The repository contains several layers of validation: diff --git a/mk/common.mk b/mk/common.mk index 746c6ae..084390f 100644 --- a/mk/common.mk +++ b/mk/common.mk @@ -1,4 +1,4 @@ -# mk/common.mk — Generic build rules +# mk/common.mk -- Generic build rules # # Per-file compilation with automatic dependency tracking, verbosity # control, and kernel-style build output. Inspired by libiui's build @@ -18,7 +18,7 @@ $(BUILD_DIR): # Automatic header dependency generation (-MMD -MP) DEPFLAGS = -MMD -MP -MF $(BUILD_DIR)/$(subst /,_,$*).d -# Pattern rules — source to object. +# Pattern rules -- source to object. # GENERATED_HEADERS are order-only prerequisites so clean builds have the # build-generated includes available before compilation. .d files track the # real header dependencies after the first compile. Generators whose output diff --git a/mk/config.mk b/mk/config.mk index 0c18aa9..232da91 100644 --- a/mk/config.mk +++ b/mk/config.mk @@ -19,13 +19,19 @@ NATIVE_TESTS := tests/test-multi-vcpu.c tests/test-rwx.c SPECIAL_TEST_SRCS := tests/test-lowbase-mem.c SPECIAL_TEST_BINS := $(BUILD_DIR)/test-lowbase-mem-200000 $(BUILD_DIR)/test-lowbase-mem-300000 +# x86_64-only sources that back the vendored Rosetta fixtures in +# tests/fixtures/rosetta/. They are not buildable with the aarch64 +# cross-toolchain and would fail link with undefined dlopen/pthread +# symbols even if compiled, so exclude them from the aarch64 glob. +ROSETTA_X86_64_SRCS := $(wildcard tests/x86_64-glibc-*.c tests/x86_64-rosetta-*.c) + ifdef GUEST_TEST_BINARIES TEST_DIR := $(GUEST_TEST_BINARIES)/bin TEST_DEPS := TEST_HELLO_DEP := else TEST_DIR := $(BUILD_DIR) - TEST_C_SRCS := $(filter-out $(NATIVE_TESTS) $(SPECIAL_TEST_SRCS),$(wildcard tests/*.c)) + TEST_C_SRCS := $(filter-out $(NATIVE_TESTS) $(SPECIAL_TEST_SRCS) $(ROSETTA_X86_64_SRCS),$(wildcard tests/*.c)) TEST_C_BINS := $(patsubst tests/%.c,$(BUILD_DIR)/%,$(TEST_C_SRCS)) TEST_DEPS := $(BUILD_DIR)/test-hello $(TEST_C_BINS) $(SPECIAL_TEST_BINS) TEST_HELLO_DEP := $(BUILD_DIR)/test-hello diff --git a/mk/shim.mk b/mk/shim.mk index d14ad23..235c635 100644 --- a/mk/shim.mk +++ b/mk/shim.mk @@ -28,7 +28,7 @@ $(BUILD_DIR)/shim_blob.h: $(BUILD_DIR)/shim.bin cmp -s "$$tmp" "$@" 2>/dev/null || mv "$$tmp" "$@"; \ rm -f "$$tmp" -# Version header — regenerates when HEAD or index changes. +# Version header -- regenerates when HEAD or index changes. # cmp trick avoids unnecessary rebuilds when version string is unchanged. VERSION_DEPS := $(wildcard .git/HEAD .git/index) mk/config.mk $(BUILD_DIR)/version.h: $(VERSION_DEPS) | $(BUILD_DIR) diff --git a/mk/tests.mk b/mk/tests.mk index f04a6f5..71014b3 100644 --- a/mk/tests.mk +++ b/mk/tests.mk @@ -5,7 +5,8 @@ test-dynamic test-dynamic-coreutils test-glibc-dynamic \ test-glibc-coreutils test-perf \ test-rosetta-cli test-rosetta-statics test-rosetta-failure-modes \ - test-rosetta-alpine test-rosetta-all bench-rosetta \ + test-rosetta-alpine test-rosetta-audit test-rosetta-jit \ + test-rosetta-glibc test-rosetta-all bench-rosetta \ test-matrix test-matrix-elfuse-aarch64 test-matrix-qemu-aarch64 \ test-full test-multi-vcpu test-rwx test-sysroot-rename \ test-case-collision test-case-collision-fallback test-sysroot-create-paths \ @@ -152,9 +153,19 @@ test-rosetta-failure-modes: $(ELFUSE_BIN) test-rosetta-alpine: $(ELFUSE_BIN) $(call RUN_OPTIONAL_SKIP77,bash tests/test-rosetta-alpine.sh $(ELFUSE_BIN),test-rosetta-alpine) +test-rosetta-audit: $(ELFUSE_BIN) + $(call RUN_OPTIONAL_SKIP77,bash tests/test-rosetta-audit.sh $(ELFUSE_BIN),test-rosetta-audit) + +test-rosetta-jit: $(ELFUSE_BIN) + $(call RUN_OPTIONAL_SKIP77,bash tests/test-rosetta-jit.sh $(ELFUSE_BIN),test-rosetta-jit) + +test-rosetta-glibc: $(ELFUSE_BIN) + $(call RUN_OPTIONAL_SKIP77,bash tests/test-rosetta-glibc.sh $(ELFUSE_BIN),test-rosetta-glibc) + ## Run every Rosetta-specific test target in sequence. test-rosetta-all: test-rosetta-cli test-rosetta-failure-modes \ - test-rosetta-statics test-rosetta-alpine + test-rosetta-statics test-rosetta-alpine \ + test-rosetta-audit test-rosetta-jit test-rosetta-glibc ## Wall-clock bench harness for x86_64-via-Rosetta workloads. Prints ## best-of-N samples plus the aarch64 reference where available. Set @@ -443,8 +454,8 @@ test-perf: $(ELFUSE_BIN) $(PERF_DEPS) ## Alias for test-perf perf: test-perf -# Test matrix (elfuse + qemu, aarch64) -## Run full test matrix (all modes: elfuse + qemu, aarch64) +# Test matrix (elfuse aarch64 + qemu aarch64 + elfuse x86_64/Rosetta) +## Run full test matrix (all modes: elfuse-aarch64, qemu-aarch64, elfuse-x86_64) test-matrix: $(ELFUSE_BIN) $(TEST_DEPS) @bash tests/test-matrix.sh all @@ -456,8 +467,7 @@ test-matrix-elfuse-aarch64: $(ELFUSE_BIN) $(TEST_DEPS) test-matrix-qemu-aarch64: $(ELFUSE_BIN) $(TEST_DEPS) @bash tests/test-matrix.sh qemu-aarch64 -## Probe the x86_64-via-Rosetta matrix wiring. Fails closed until the runtime -## and fixture corpus are complete enough to execute real coverage. +## Run test matrix: elfuse x86_64-via-Rosetta mode test-matrix-elfuse-x86_64: $(ELFUSE_BIN) $(TEST_DEPS) @bash tests/test-matrix.sh elfuse-x86_64 diff --git a/src/core/bootstrap.c b/src/core/bootstrap.c index eb61c63..c6522df 100644 --- a/src/core/bootstrap.c +++ b/src/core/bootstrap.c @@ -288,7 +288,7 @@ static bool build_boot_regions(mem_region_t *regions, * guest memory; rosetta itself reads the target via fd 3 once it is * running. Adding those segments to the page-table builder would emit * ghost L2/L3 entries at the binary's x86_64 link address (typically - * 0x400000) pointing into uninitialised primary-buffer GPAs. The + * 0x400000) pointing into uninitialized primary-buffer GPAs. The * rosetta image's own segments are registered by rosetta_prepare's * separate region append in the bootstrap caller. */ @@ -704,7 +704,6 @@ int guest_bootstrap_rosetta_post_reset(guest_t *g, mem_region_t regions[MAX_BOOT_REGIONS]; int nregions = 0; rosetta_result_t rr; - if (rosetta_prepare(g, elf_host_path, regions, &nregions, MAX_BOOT_REGIONS, verbose, &rr) < 0) { log_error("rosetta_prepare failed during exec re-bootstrap"); @@ -712,7 +711,7 @@ int guest_bootstrap_rosetta_post_reset(guest_t *g, } /* build_boot_regions skips ELF segments when g->is_rosetta is set, so a - * zero-initialised guest_bootstrap_t is enough to drive it here. + * zero-initialized guest_bootstrap_t is enough to drive it here. */ guest_bootstrap_t boot_stub; memset(&boot_stub, 0, sizeof(boot_stub)); @@ -729,9 +728,9 @@ int guest_bootstrap_rosetta_post_reset(guest_t *g, } g->ttbr0 = ttbr0; - /* Re-publish /proc/self/maps style metadata. Mirrors the bootstrap path - * so the post-exec view reports rosetta-as-anonymous-mapping plus the - * heap, stack, stack-guard, shim, and shim-data. + /* Re-publish /proc/self/maps style metadata. Mirrors the bootstrap path so + * the post-exec view reports rosetta-as-anonymous-mapping plus the heap, + * stack, stack-guard, shim, and shim-data. */ register_elf_segment_regions(g, &rr.rosetta_info, 0, g->rosetta_guest_base - g->rosetta_va_base, diff --git a/src/core/guest.c b/src/core/guest.c index d00a49a..6393b00 100644 --- a/src/core/guest.c +++ b/src/core/guest.c @@ -535,6 +535,7 @@ void guest_destroy(guest_t *g) } } g->nregions = 0; + g->npreannounced = 0; /* Close the shm fd if guest memory owns one (parent with shm backing) */ if (g->shm_fd >= 0) { close(g->shm_fd); @@ -950,7 +951,7 @@ int guest_map_va_range(guest_t *g, if (l2[l2_idx] & PT_VALID) { /* Block already mapped -- caller may want guest_update_perms / * guest_split_block instead. Skip silently to mirror upstream's - * sys_mmap_high_va "reuse existing GPA" behaviour. + * sys_mmap_high_va "reuse existing GPA" behavior. */ continue; } @@ -970,7 +971,7 @@ int guest_install_kbuf_user_alias(guest_t *g) if (!g || !g->kbuf_gpa || !g->ttbr0) { log_error( "guest_install_kbuf_user_alias: kbuf or ttbr0 not " - "initialised"); + "initialized"); return -1; } @@ -1785,6 +1786,38 @@ int guest_region_add_ex_owned_gpa(guest_t *g, return 0; } +int guest_preannounce(guest_t *g, + uint64_t start, + uint64_t end, + int prot, + int flags, + uint64_t offset, + const char *name) +{ + if (g->npreannounced >= GUEST_MAX_PREANNOUNCED) + return -1; + + int i = g->npreannounced; + while (i > 0 && g->preannounced[i - 1].start > start) { + g->preannounced[i] = g->preannounced[i - 1]; + i--; + } + + guest_region_t *r = &g->preannounced[i]; + memset(r, 0, sizeof(*r)); + r->start = start; + r->end = end; + r->gpa_base = start; + r->prot = prot; + r->flags = flags; + r->offset = offset; + r->backing_fd = -1; + if (name) + str_copy_trunc(r->name, name, sizeof(r->name)); + g->npreannounced++; + return 0; +} + void guest_region_remove(guest_t *g, uint64_t start, uint64_t end) { int i = 0; @@ -2027,6 +2060,7 @@ static void guest_region_clear(guest_t *g) } } g->nregions = 0; + g->npreannounced = 0; } /* Page table builder. */ @@ -2538,7 +2572,7 @@ static uint64_t *find_l2_entry(guest_t *g, uint64_t va) return &l2[l2_idx]; } -/* Split a 2MiB L2 block descriptor into 512 × 4KiB L3 page descriptors. +/* Split a 2MiB L2 block descriptor into 512 x 4KiB L3 page descriptors. * The caller provides the L2 entry via find_l2_entry. * Extracts the output IPA from the existing descriptor. */ diff --git a/src/core/guest.h b/src/core/guest.h index 2dda869..5429392 100644 --- a/src/core/guest.h +++ b/src/core/guest.h @@ -161,6 +161,13 @@ typedef struct { */ #define GUEST_MAX_REGIONS 4096 +/* Preannounced regions appear only in /proc/self/maps and are NOT consulted by + * mmap / mprotect / munmap conflict checks. Used for runtimes such as Rosetta + * that snapshot a code map from /proc/self/maps before they reserve or remap + * their own address ranges via MAP_FIXED_NOREPLACE. + */ +#define GUEST_MAX_PREANNOUNCED 16 + /* HVF stage-2 mapping segment. The slab is mapped to HVF in pieces so that * file-backed MAP_SHARED regions can have real host-VA overlays applied via * mmap MAP_FIXED|MAP_SHARED of a file fd. HVF requires hv_vm_unmap to target @@ -392,6 +399,8 @@ typedef struct { /* Semantic region tracking for munmap/mprotect/proc-self-maps */ guest_region_t regions[GUEST_MAX_REGIONS]; int nregions; /* Number of active regions */ + guest_region_t preannounced[GUEST_MAX_PREANNOUNCED]; + int npreannounced; /* /proc/self/maps-only shadow regions */ /* HVF stage-2 segment list: the union of segments[0..n_segments) covers the * live IPA range that is currently hv_vm_map'd to HVF. Sorted by ipa. @@ -956,6 +965,25 @@ int guest_region_add_ex_owned_gpa(guest_t *g, const char *name, int owned_backing_fd); +/* Add a preannounced region that appears in /proc/self/maps only. + * These entries are kept separate from regions[] so they do not cause + * -EEXIST on guest MAP_FIXED_NOREPLACE reservations. + * + * No producer wires this up today. The storage, fork-IPC, and + * /proc/self/maps consumer are kept as scaffolding for runtimes that + * consult /proc/self/maps before reserving VA ranges. Preannouncing + * the x86_64 image during Rosetta bring-up was tried and rejected: it + * perturbed Rosetta's internal allocation tracker. The hook stays + * until a workload needs an advertise-only entry. + */ +int guest_preannounce(guest_t *g, + uint64_t start, + uint64_t end, + int prot, + int flags, + uint64_t offset, + const char *name); + /* Remove all region coverage in [start, end). Regions fully contained are * deleted; partially overlapping regions are trimmed or split. */ diff --git a/src/core/rosetta.c b/src/core/rosetta.c index 7805490..6f442c7 100644 --- a/src/core/rosetta.c +++ b/src/core/rosetta.c @@ -6,7 +6,7 @@ * * rosetta_prepare loads the Apple Rosetta binary into the primary buffer at * a low GPA and exposes it at its statically-linked high VA (0x800000000000) - * via a non-identity mem_region_t.va_base. The TTBR1 kbuf is initialised at + * via a non-identity mem_region_t.va_base. The TTBR1 kbuf is initialized at * a 256 MiB window just below the rosetta image. rosetta_finalize wires the * bootstrap-visible pieces needed to enter the translator: fd 3 setup, * binfmt-style argv construction, cmdline refresh, and the TTBR0 kbuf alias. @@ -801,7 +801,7 @@ static ssize_t rosettad_send_fd(int sock, uint8_t payload, int send_fd) /* Translate subprocess */ -/* Spawn `elfuse rosettad translate ` and wait for it +/* Spawn 'elfuse rosettad translate ' and wait for it * to exit. Returns 0 if the translator exited successfully and the * output file is non-empty, -1 otherwise. */ @@ -885,7 +885,7 @@ static int translate_via_rosettad(const char *in_path, const char *out_path) * (hit -> return cached fd), or spawn the translator and publish the * result. Returns an O_RDONLY fd pointing at the AOT file on success, * -1 on any failure. *out_digest is always written when the SHA-256 - * succeeds; the caller passes it back to rosetta so subsequent `d` + * succeeds; the caller passes it back to rosetta so subsequent 'd' * lookups reuse the same key. */ static int rosettad_translate(int bin_fd, diff --git a/src/core/rosetta.h b/src/core/rosetta.h index 34b95f5..f728ea4 100644 --- a/src/core/rosetta.h +++ b/src/core/rosetta.h @@ -169,7 +169,7 @@ typedef struct { /* First-pass rosetta setup, runs before guest_build_page_tables(): parse * the rosetta binary, place its segments in the primary buffer (or reload - * into the existing placement on execve), initialise the TTBR1 kbuf, and + * into the existing placement on execve), initialize the TTBR1 kbuf, and * append page-table regions for the builder. A single non-identity * mem_region_t covers the rosetta image, mapping its statically-linked high * VA to the chosen low GPA via mem_region_t.va_base. diff --git a/src/core/shim.S b/src/core/shim.S index f51756d..7c1dbd2 100644 --- a/src/core/shim.S +++ b/src/core/shim.S @@ -248,7 +248,7 @@ bad_exception: * registers X9-X15 are NOT saved by the compiler across SVC calls. The shim * must save/restore ALL 31 GPRs. * - * Stack frame: 256 bytes (16 pairs × 16 bytes) + * Stack frame: 256 bytes (16 pairs x 16 bytes) * [sp+0] x0,x1 [sp+16] x2,x3 [sp+32] x4,x5 * [sp+48] x6,x7 [sp+64] x8,x9 [sp+80] x10,x11 * [sp+96] x12,x13 [sp+112] x14,x15 [sp+128] x16,x17 @@ -605,7 +605,7 @@ tlbi_selective: * TLBI VAE1IS takes a Xt operand of (VA[55:12] | (ASID << 48)). The * guest runs single-ASID at EL0, so just shift the VA right by 12. * Issue all TLBI ops, then a single DSB ISH + IC IALLU + DSB + ISB - * matches broadcast semantics (preserves I-cache invalidation behaviour + * matches broadcast semantics (preserves I-cache invalidation behavior * for callers like file-backed mmap of executable pages). * * Defensive: if x10 == 0, skip the loop. The per-vCPU host-side diff --git a/src/debug/crashreport.c b/src/debug/crashreport.c index 3ef4c3b..2c4b789 100644 --- a/src/debug/crashreport.c +++ b/src/debug/crashreport.c @@ -170,13 +170,9 @@ void crash_report(hv_vcpu_t vcpu, fprintf(stderr, "\n"); if (vcpu) { - fprintf(stderr, "## Registers\n"); - uint64_t pc = 0, cpsr = 0; hv_vcpu_get_reg(vcpu, HV_REG_PC, &pc); hv_vcpu_get_reg(vcpu, HV_REG_CPSR, &cpsr); - fprintf(stderr, "PC = 0x%016llx CPSR = 0x%016llx\n", - (unsigned long long) pc, (unsigned long long) cpsr); uint64_t esr = 0, far_reg = 0, elr = 0, spsr = 0, sctlr = 0, sp_el0 = 0, tpidr = 0; @@ -188,6 +184,24 @@ void crash_report(hv_vcpu_t vcpu, hv_vcpu_get_sys_reg(vcpu, HV_SYS_REG_SP_EL0, &sp_el0); hv_vcpu_get_sys_reg(vcpu, HV_SYS_REG_TPIDR_EL0, &tpidr); + /* The Rosetta breadcrumb has its own section header so + * downstream parsers can keep treating "## Registers" as the + * first line of the register section. Emitting the banner + * inline above that header used to break that assumption. + */ + if ((g && g->is_rosetta) || proc_rosetta_active()) { + fprintf(stderr, "## Rosetta\n"); + fprintf(stderr, + "via Apple Rosetta: aarch64 PC=0x%016llx " + "ELR=0x%016llx TPIDR_EL0=0x%016llx\n\n", + (unsigned long long) pc, (unsigned long long) elr, + (unsigned long long) tpidr); + } + + fprintf(stderr, "## Registers\n"); + fprintf(stderr, "PC = 0x%016llx CPSR = 0x%016llx\n", + (unsigned long long) pc, (unsigned long long) cpsr); + fprintf(stderr, "ESR = 0x%016llx EC=0x%02x (%s)\n", (unsigned long long) esr, (unsigned) ((esr >> 26) & 0x3f), esr_ec_name(esr)); diff --git a/src/debug/gdbstub.c b/src/debug/gdbstub.c index a8f3e5b..4fb7009 100644 --- a/src/debug/gdbstub.c +++ b/src/debug/gdbstub.c @@ -757,7 +757,7 @@ static void handle_thread_info(int first) } /* Build comma-separated hex thread ID list. - * Worst case: MAX_THREADS(64) × 17 chars (16 hex + comma) + 1 prefix + NUL. + * Worst case: MAX_THREADS(64) x 17 chars (16 hex + comma) + 1 prefix + NUL. */ char reply[2048]; int pos = 0; diff --git a/src/runtime/fork-state.c b/src/runtime/fork-state.c index ed50f16..f9746cd 100644 --- a/src/runtime/fork-state.c +++ b/src/runtime/fork-state.c @@ -428,7 +428,9 @@ static int fork_ipc_send_backing_fds(int ipc_sock, int fork_ipc_send_process_state(int ipc_sock, const guest_region_t *regions_snapshot, - uint32_t num_guest_regions) + uint32_t num_guest_regions, + const guest_region_t *preannounced_snapshot, + uint32_t num_preannounced) { char cwd[LINUX_PATH_MAX] = {0}; getcwd(cwd, sizeof(cwd)); @@ -476,6 +478,14 @@ int fork_ipc_send_process_state(int ipc_sock, num_guest_regions * sizeof(guest_region_t)) < 0) return -1; + if (fork_ipc_write_all(ipc_sock, &num_preannounced, + sizeof(num_preannounced)) < 0) + return -1; + if (num_preannounced > 0 && + fork_ipc_write_all(ipc_sock, preannounced_snapshot, + num_preannounced * sizeof(guest_region_t)) < 0) + return -1; + if (fork_ipc_send_backing_fds(ipc_sock, regions_snapshot, num_guest_regions) < 0) return -1; @@ -656,6 +666,27 @@ int fork_ipc_recv_process_state(int ipc_fd, guest_t *g, signal_state_t *sig) } g->nregions = (int) num_guest_regions; + uint32_t num_preannounced = 0; + if (fork_ipc_read_all(ipc_fd, &num_preannounced, sizeof(num_preannounced)) < + 0) { + log_error("fork-child: failed to read preannounced count"); + return -1; + } + uint32_t recv_preannounced = num_preannounced; + if (recv_preannounced > GUEST_MAX_PREANNOUNCED) + recv_preannounced = GUEST_MAX_PREANNOUNCED; + if (recv_preannounced > 0 && + fork_ipc_read_all(ipc_fd, g->preannounced, + recv_preannounced * sizeof(guest_region_t)) < 0) { + log_error("fork-child: failed to read preannounced regions"); + return -1; + } + if (num_preannounced > recv_preannounced && + fork_ipc_drain_bytes(ipc_fd, (num_preannounced - recv_preannounced) * + sizeof(guest_region_t)) < 0) + return -1; + g->npreannounced = (int) recv_preannounced; + /* Capture parent state before clearing the inherited overlay/backing fd * fields. parent_had_fd lets recv_backing_fds iterate in the same order the * sender used (regions with backing_fd >= 0); the parent_ovl_* arrays let diff --git a/src/runtime/fork-state.h b/src/runtime/fork-state.h index 8386dca..53233d1 100644 --- a/src/runtime/fork-state.h +++ b/src/runtime/fork-state.h @@ -98,5 +98,7 @@ int fork_ipc_recv_fd_table(int ipc_fd, guest_t *g); int fork_ipc_send_process_state(int ipc_sock, const guest_region_t *regions_snapshot, - uint32_t num_guest_regions); + uint32_t num_guest_regions, + const guest_region_t *preannounced_snapshot, + uint32_t num_preannounced); int fork_ipc_recv_process_state(int ipc_fd, guest_t *g, signal_state_t *sig); diff --git a/src/runtime/forkipc.c b/src/runtime/forkipc.c index 52847cb..a2cfe9c 100644 --- a/src/runtime/forkipc.c +++ b/src/runtime/forkipc.c @@ -1226,6 +1226,7 @@ int64_t sys_clone(hv_vcpu_t vcpu, mmap_fork_anon_shared_txn_t *anon_shared_txn = NULL; guest_region_t *regions_snapshot = NULL; + guest_region_t preannounced_snapshot[GUEST_MAX_PREANNOUNCED]; /* Convert MAP_SHARED|MAP_ANONYMOUS regions that have no backing fd * into memfd-backed overlay regions. The conversion seeds a private @@ -1402,6 +1403,11 @@ int64_t sys_clone(hv_vcpu_t vcpu, } memcpy(regions_snapshot, g->regions, snap_sz); } + int npreannounced_snapshot = g->npreannounced; + if (npreannounced_snapshot > 0) { + memcpy(preannounced_snapshot, g->preannounced, + (size_t) npreannounced_snapshot * sizeof(guest_region_t)); + } if (fork_ipc_send_fd_table(ipc_sock) < 0) { log_error("clone: failed to send fd table"); @@ -1409,8 +1415,10 @@ int64_t sys_clone(hv_vcpu_t vcpu, } uint32_t num_guest_regions = (uint32_t) nregions_snapshot; + uint32_t num_preannounced = (uint32_t) npreannounced_snapshot; if (fork_ipc_send_process_state(ipc_sock, regions_snapshot, - num_guest_regions) < 0) { + num_guest_regions, preannounced_snapshot, + num_preannounced) < 0) { log_error("clone: failed to send process state"); goto fail_snapshot; } diff --git a/src/runtime/futex.c b/src/runtime/futex.c index 0d02d6b..683d95c 100644 --- a/src/runtime/futex.c +++ b/src/runtime/futex.c @@ -73,8 +73,8 @@ static _Atomic int futex_interrupt_requested = 0; #define FUTEX_WAKE_BITSET 10 /* Strips the FUTEX_PRIVATE_FLAG (0x80) and FUTEX_CLOCK_REALTIME bits so the - * dispatch switch sees only the base operation. Emulation does not - * differentiate private vs shared futexes (single-process guest). + * dispatch switch sees only the base operation. Emulation doesn't differentiate + * private vs shared futexes (single-process guest). */ #define FUTEX_CMD_MASK 0x7F @@ -97,14 +97,16 @@ static _Atomic int futex_interrupt_requested = 0; * * os_sync_available is set in futex_init() when the runtime supports the * os_sync_wait_on_address family (macOS 14.4+). Plain FUTEX_WAIT remains on - * the bucket path until Darwin can preserve Linux's -EAGAIN race semantics, - * so os_sync_wait_enabled stays false for now and the wake-side helper stays + * the bucket path until Darwin can preserve Linux's -EAGAIN race semantics, so + * os_sync_wait_enabled stays false for now and the wake-side helper stays * dormant too. * * The wait quantum is capped at 100 ms so proc_exit_group_requested() and * futex_interrupt_pending() get noticed promptly without a process-wide * broadcast channel. The 1-second EINTR simulation that the bucket path uses - * for shutdown-stalled multi-threaded runtimes is preserved here. + * for shutdown-stalled multi-threaded runtimes is preserved here, but only + * once more than one guest thread is active. Single-threaded guests should not + * see synthetic EINTR churn on indefinite waits. */ #if ELFUSE_HAVE_OS_SYNC_WAIT_ON_ADDRESS static bool os_sync_available; @@ -114,6 +116,11 @@ static bool os_sync_wait_enabled; #define FUTEX_OS_SYNC_POLL_CAP_NS (100ULL * 1000 * 1000) #define FUTEX_OS_SYNC_EINTR_SIM_MS 1000 +static inline bool futex_should_simulate_periodic_eintr(void) +{ + return !thread_is_single_active(); +} + /* Hash table */ #define FUTEX_BUCKETS 64 @@ -267,25 +274,25 @@ static int futex_make_deadline(guest_t *g, } /* Compute the relative wait quantum until an absolute CLOCK_REALTIME deadline, - * capped at cap_ns. Operates on (sec, nsec) pairs to avoid overflowing - * int64_t when delta_sec * NSEC_PER_SEC could exceed INT64_MAX: - * linux_timespec_is_valid() accepts tv_sec up to FUTEX_TIMESPEC_SEC_MAX - * (== INT64_MAX/4), and the absolute-deadline path forwards that value - * unchanged into the host timespec. Multiplying tv_sec * 1e9 first would - * overflow signed arithmetic for adversarial guest inputs. + * capped at cap_ns. Operates on (sec, nsec) pairs to avoid overflowing int64_t + * when delta_sec * NSEC_PER_SEC could exceed INT64_MAX: linux_timespec_is_valid + * accepts tv_sec up to FUTEX_TIMESPEC_SEC_MAX (== INT64_MAX/4), and the + * absolute-deadline path forwards that value unchanged into the host timespec. + * Multiplying tv_sec * 1e9 first would overflow signed arithmetic for + * adversarial guest inputs. * - * Borrow-normalize the (delta_sec, delta_nsec) pair before comparing so a - * caller who hits delta_sec == 1 with delta_nsec < 0 (e.g., deadline tv_nsec - * just past now tv_nsec when now is near a second boundary) does not get - * billed the full cap when only a few nanoseconds remain. After the borrow - * delta_nsec lives in [0, NSEC_PER_SEC); a single borrow always suffices - * because both inputs are normalized. + * Borrow-normalize the (delta_sec, delta_nsec) pair before comparing so caller + * who hits delta_sec == 1 with delta_nsec < 0 (e.g., deadline tv_nsec just past + * now tv_nsec when now is near a second boundary) does not get billed the full + * cap when only a few nanoseconds remain. After the borrow delta_nsec lives in + * [0, NSEC_PER_SEC); a single borrow always suffices because both inputs are + * normalized. * - * Once delta_sec >= 1 (post-borrow) the cap (~100 ms) dominates regardless - * of delta_nsec, so the function returns cap_ns. delta_sec == 0 falls - * through to min(delta_nsec, cap_ns). delta_sec < 0 (or delta_sec == 0 and - * delta_nsec == 0) means the deadline has elapsed; return 0 so the caller - * surfaces ETIMEDOUT without re-arming. + * Once delta_sec >= 1 (post-borrow) the cap (~100 ms) dominates regardless of + * delta_nsec, so the function returns cap_ns. delta_sec == 0 falls through to + * min(delta_nsec, cap_ns). delta_sec < 0 (or delta_sec == 0 and delta_nsec == + * 0) means the deadline has elapsed; return 0 so the caller surfaces ETIMEDOUT + * without re-arming. */ static uint64_t futex_remaining_ns(const struct timespec *deadline, uint64_t cap_ns) @@ -386,7 +393,9 @@ static int64_t futex_os_sync_wait(guest_t *g, return -LINUX_EAGAIN; struct timeval wait_start; - if (!has_timeout) + bool simulate_periodic_eintr = + !has_timeout && futex_should_simulate_periodic_eintr(); + if (simulate_periodic_eintr) gettimeofday(&wait_start, NULL); /* Bound consecutive EFAULT retries. Apple documents EFAULT as transient @@ -430,7 +439,7 @@ static int64_t futex_os_sync_wait(guest_t *g, if (proc_exit_group_requested() || futex_interrupt_pending()) return -LINUX_EINTR; - if (!has_timeout) { + if (simulate_periodic_eintr) { struct timeval now; gettimeofday(&now, NULL); long elapsed_ms = (now.tv_sec - wait_start.tv_sec) * 1000 + @@ -530,7 +539,9 @@ static int64_t futex_wait(guest_t *g, * condition and retrying. */ struct timeval wait_start; - if (!has_timeout) + bool simulate_periodic_eintr = + !has_timeout && futex_should_simulate_periodic_eintr(); + if (simulate_periodic_eintr) gettimeofday(&wait_start, NULL); while (!__atomic_load_n(&waiter.woken, __ATOMIC_ACQUIRE)) { @@ -555,18 +566,20 @@ static int64_t futex_wait(guest_t *g, break; } - /* Simulate periodic signal delivery: return -EINTR after 1 second - * of blocking. This prevents deadlocks in multi-threaded runtimes - * that rely on signal-interrupted futex_wait for scheduler context - * switching. + /* Simulate periodic signal delivery only for multi-threaded + * guests. Single-threaded glibc startup paths can legitimately + * park in FUTEX_WAIT forever until a real wake arrives, and + * synthetic EINTR here breaks that contract. */ - struct timeval now; - gettimeofday(&now, NULL); - long elapsed_ms = (now.tv_sec - wait_start.tv_sec) * 1000 + - (now.tv_usec - wait_start.tv_usec) / 1000; - if (elapsed_ms >= 1000) { - ret = -LINUX_EINTR; - break; + if (simulate_periodic_eintr) { + struct timeval now; + gettimeofday(&now, NULL); + long elapsed_ms = (now.tv_sec - wait_start.tv_sec) * 1000 + + (now.tv_usec - wait_start.tv_usec) / 1000; + if (elapsed_ms >= FUTEX_OS_SYNC_EINTR_SIM_MS) { + ret = -LINUX_EINTR; + break; + } } } } diff --git a/src/runtime/procemu.c b/src/runtime/procemu.c index 3acf491..c1b2e31 100644 --- a/src/runtime/procemu.c +++ b/src/runtime/procemu.c @@ -79,6 +79,64 @@ static char proc_tmpdir[128]; static bool proc_tmpdir_ok; static pthread_mutex_t proc_tmpdir_lock = PTHREAD_MUTEX_INITIALIZER; +typedef struct { + uint64_t start, end; + int prot, flags; + uint64_t offset; + char name[64]; +} maps_entry_t; + +static void maps_entry_insert(maps_entry_t *entries, + int *nentries, + uint64_t start, + uint64_t end, + int prot, + int flags, + uint64_t offset, + const char *name) +{ + if (*nentries >= MAPS_ENTRY_MAX || end <= start) + return; + + int i = *nentries; + while (i > 0 && entries[i - 1].start > start) { + entries[i] = entries[i - 1]; + i--; + } + + maps_entry_t *e = &entries[i]; + e->start = start; + e->end = end; + e->prot = prot; + e->flags = flags; + e->offset = offset; + if (name && name[0]) + str_copy_trunc(e->name, name, sizeof(e->name)); + else + e->name[0] = '\0'; + (*nentries)++; +} + +static void maps_entries_merge_adjacent(maps_entry_t *entries, int *nentries) +{ + if (*nentries <= 1) + return; + + int out = 0; + for (int i = 1; i < *nentries; i++) { + if (entries[i].start == entries[out].end && + entries[i].prot == entries[out].prot && + entries[i].flags == entries[out].flags && + entries[i].offset == entries[out].offset && + strcmp(entries[i].name, entries[out].name) == 0) { + entries[out].end = entries[i].end; + continue; + } + entries[++out] = entries[i]; + } + *nentries = out + 1; +} + /* Synthetic /sys/devices/system/cpu directory backing store. Populated lazily * on first access (Java GC, Go runtime, libnuma probe these to size thread * pools). Layout matches the minimal subset Linux exposes: @@ -1793,48 +1851,69 @@ int proc_intercept_open(const guest_t *g, int off = 0; /* Build a flat array of (va_start, va_end, prot, flags, offset, name) - * from regions[] with merging. + * from regions[] plus /proc/self/maps-only preannounced[] entries. + * preannounced[] is intentionally NOT consulted by mmap conflict + * detection, so advertise-only Rosetta/JIT regions do not trip + * MAP_FIXED_NOREPLACE with -EEXIST. */ - typedef struct { - uint64_t start, end; - int prot, flags; - uint64_t offset; - char name[64]; - } maps_entry_t; maps_entry_t entries[MAPS_ENTRY_MAX]; int nentries = 0; - /* Convert regions[] to maps entries (identity-mapped) */ - for (int i = 0; i < g->nregions && nentries < MAPS_ENTRY_MAX - 1; i++) { + /* Convert regions[] to maps entries. regions[] is already sorted by + * start address; merge contiguous runs that came from one mmap. + */ + for (int i = 0; i < g->nregions && nentries < MAPS_ENTRY_MAX; i++) { const guest_region_t *r = &g->regions[i]; uint64_t start = r->start & ~0xFFFULL; uint64_t end = (r->end + 0xFFF) & ~0xFFFULL; - /* Try to merge with previous entry if contiguous and same - * prot/flags/name. This collapses many 2 MiB blocks into a single - * maps line, matching real Linux kernel behavior. - */ - if (nentries > 0) { - maps_entry_t *prev = &entries[nentries - 1]; - if (start == prev->end && r->prot == prev->prot && - r->flags == prev->flags && !strcmp(r->name, prev->name)) { - prev->end = end; + if (nentries > 0 && entries[nentries - 1].end == start && + entries[nentries - 1].prot == r->prot && + entries[nentries - 1].flags == r->flags && + entries[nentries - 1].offset == r->offset && + !strcmp(entries[nentries - 1].name, r->name)) { + entries[nentries - 1].end = end; + continue; + } + maps_entry_insert(entries, &nentries, start, end, r->prot, r->flags, + r->offset, r->name); + } + + /* Add preannounced entries only while they still have an uncovered + * tail. Once the union of live regions covers the full advertised + * interval, suppress the shadow entry so /proc/self/maps shows only + * the realized split VMAs. A partial union must stay visible because + * some reserved-but-not-realized span remains to advertise. + */ + for (int i = 0; i < g->npreannounced && nentries < MAPS_ENTRY_MAX; + i++) { + const guest_region_t *r = &g->preannounced[i]; + bool shadowed = false; + uint64_t covered_end = r->start; + + for (int j = 0; j < g->nregions; j++) { + const guest_region_t *live = &g->regions[j]; + + if (live->end <= covered_end) continue; + if (live->start > covered_end) + break; + + covered_end = live->end; + if (covered_end >= r->end) { + shadowed = true; + break; } } - maps_entry_t *e = &entries[nentries++]; - e->start = start; - e->end = end; - e->prot = r->prot; - e->flags = r->flags; - e->offset = r->offset; - if (r->name[0]) { - str_copy_trunc(e->name, r->name, sizeof(e->name)); - } else { - e->name[0] = '\0'; - } + if (shadowed) + continue; + + maps_entry_insert(entries, &nentries, r->start & ~0xFFFULL, + (r->end + 0xFFFULL) & ~0xFFFULL, r->prot, + r->flags, r->offset, r->name); } + maps_entries_merge_adjacent(entries, &nentries); /* Emit lines after merging so buffer accounting is centralized. */ for (int i = 0; i < nentries && off < (int) sizeof(buf) - 256; i++) { @@ -2185,7 +2264,7 @@ int proc_intercept_open(const guest_t *g, (uint64_t) vm_stat.inactive_count * page_size / 1024; uint64_t purgeable_kb = (uint64_t) vm_stat.purgeable_count * page_size / 1024; - /* Available ≈ free + inactive + purgeable (Linux heuristic) */ + /* Available ~= free + inactive + purgeable (Linux heuristic) */ avail_kb = free_kb + inactive_kb + purgeable_kb; if (avail_kb > total_kb) avail_kb = total_kb; diff --git a/src/runtime/thread.h b/src/runtime/thread.h index a8d35ab..7a566ff 100644 --- a/src/runtime/thread.h +++ b/src/runtime/thread.h @@ -105,8 +105,8 @@ typedef struct { * resume, dirty changes are applied back to the vCPU. */ uint8_t gdb_reg_snapshot[788]; /* Register snapshot for GDB - * Layout: 31×GPR(8) + SP(8) + PC(8) - * + CPSR(4) + 32×V(16) + FPSR(4) + FPCR(4) + * Layout: 31xGPR(8) + SP(8) + PC(8) + * + CPSR(4) + 32xV(16) + FPSR(4) + FPCR(4) */ bool gdb_regs_dirty; /* GDB handler modified snapshot */ diff --git a/src/syscall/io.c b/src/syscall/io.c index 60f6f34..ee183dd 100644 --- a/src/syscall/io.c +++ b/src/syscall/io.c @@ -94,8 +94,9 @@ static void termios_copy_cc_to_linux(uint8_t linux_cc[19], const cc_t mac_cc[]) { for (int i = 0; i < 19; i++) { int mac_idx = linux_mac_cc[i]; - // cppcheck-suppress negativeIndex - // RANGE_CHECK guards mac_idx >= 0 before the array access. + /* cppcheck-suppress negativeIndex + * RANGE_CHECK guards mac_idx >= 0 before the array access. + */ linux_cc[i] = RANGE_CHECK(mac_idx, 0, NCCS) ? mac_cc[mac_idx] : 0; } } @@ -186,7 +187,7 @@ static int64_t rosetta_vz_ioctl(guest_t *g, uint64_t request, uint64_t arg) return 1; } case ROSETTA_VZ_CAPS: { - /* caps is zero-initialised: VZ_SECONDARY and the trailing NUL of any + /* caps is zero-initialized: VZ_SECONDARY and the trailing NUL of any * partially-copied binary path are already in place. */ uint8_t caps[ROSETTA_CAPS_SIZE] = {0}; diff --git a/src/syscall/mem.c b/src/syscall/mem.c index 32620c2..092720d 100644 --- a/src/syscall/mem.c +++ b/src/syscall/mem.c @@ -70,6 +70,38 @@ typedef struct { uint64_t start, end; } remove_range_t; +typedef struct { + uint64_t start; + uint64_t end; + uint64_t gpa_base; + int prot; + int flags; + uint64_t offset; + int backing_fd; + bool overlay_active; + uint64_t overlay_start; + uint64_t overlay_end; + char name[sizeof(((guest_region_t *) 0)->name)]; +} region_snapshot_t; + +static int capture_region_snapshots(guest_t *g, + uint64_t start, + uint64_t end, + region_snapshot_t *snaps, + int max_snaps); +static void close_region_snapshots(region_snapshot_t *snaps, int n); +static int restore_snapshot_overlays_in_place(guest_t *g, + const region_snapshot_t *snaps, + int n); +static int restore_snapshot_page_tables(guest_t *g, + uint64_t start, + uint64_t end, + const region_snapshot_t *snaps, + int n); +static int restore_region_snapshots(guest_t *g, + region_snapshot_t *snaps, + int n); + static int region_count_after_removes(const guest_t *g, const remove_range_t *ranges, int nranges) @@ -383,14 +415,76 @@ static bool region_range_overlaps(const guest_t *g, return first < g->nregions && g->regions[first].start < end; } -static int64_t sys_mmap_fixed_high_va(guest_t *g, - uint64_t addr, - uint64_t length, - int prot, - int flags, - guest_fd_t fd, - uint64_t offset, - bool is_noreplace) +static bool high_va_replaceable_region(const guest_region_t *r) +{ + return r && !region_has_live_overlay(r) && + (r->flags & LINUX_MAP_SHARED) == 0; +} + +static bool high_va_replaceable_gpa_base(guest_t *g, + uint64_t start, + uint64_t end, + uint64_t *out_gpa_base, + int *out_flags, + uint64_t *out_offset) +{ + int idx = -1; + for (int i = 0; i < g->nregions; i++) { + if (g->regions[i].end <= start) + continue; + if (g->regions[i].start > start) + return false; + idx = i; + break; + } + if (idx < 0) + return false; + + uint64_t cursor = start; + uint64_t gpa_cursor = 0; + bool first = true; + + for (int i = idx; i < g->nregions && cursor < end; i++) { + const guest_region_t *r = &g->regions[i]; + uint64_t seg_start = (r->start < cursor) ? cursor : r->start; + uint64_t seg_end = (r->end > end) ? end : r->end; + if (seg_start != cursor || seg_end <= seg_start || + !high_va_replaceable_region(r)) + return false; + + if (first) { + uint64_t intra_region = cursor - r->start; + gpa_cursor = r->gpa_base + intra_region; + if (out_flags) + *out_flags = r->flags; + if (out_offset) + *out_offset = r->offset + intra_region; + first = false; + } else if (r->gpa_base != gpa_cursor) { + return false; + } + + cursor = seg_end; + gpa_cursor += seg_end - seg_start; + } + + if (cursor != end || first) + return false; + + if (out_gpa_base) + *out_gpa_base = gpa_cursor - (end - start); + return true; +} + +static int64_t sys_mmap_high_va(guest_t *g, + uint64_t addr, + uint64_t length, + int prot, + int flags, + guest_fd_t fd, + uint64_t offset, + bool replace_existing, + bool is_noreplace) { int64_t ret = -LINUX_ENOMEM; @@ -403,28 +497,55 @@ static int64_t sys_mmap_fixed_high_va(guest_t *g, int host_backing_fd = -1; int track_backing_fd = -1; bool close_host_backing_fd = false; - /* High-water mark of VA installed by the mapping loop; reachable from - * the fail label so the rollback knows what to invalidate. Must be - * initialized before any goto fail that runs before the loop. + /* High-water mark of VA installed by the mapping loop; reachable from the + * fail label so the rollback knows what to invalidate. Must be initialized + * before any goto fail that runs before the loop. */ uint64_t va_installed_end = 0; - /* If a fresh block has been block-mapped (live RW/RX over the full - * 2 MiB) but has not yet had its L3 split-inherited entries zeroed, - * the rollback must clear the full block, not just [addr, addr+length). - * Tracks at most one in-flight fresh block at a time; UINT64_MAX means - * no in-flight fresh block needs full-scope rollback. + + /* If a fresh block has been block-mapped (live RW/RX over the full 2 MiB) + * but has not yet had its L3 split-inherited entries zeroed, the rollback + * must clear the full block, not just [addr, addr+length). Tracks at most + * one in-flight fresh block at a time; UINT64_MAX means no in-flight fresh + * block needs full-scope rollback. */ uint64_t inflight_fresh_block_va = UINT64_MAX; + uint64_t replaced_gpa_base = 0; + int replaced_flags = 0; + uint64_t replaced_offset = 0; + bool replaced_region_removed = false; + region_snapshot_t *replaced_snaps = NULL; + int replaced_nsnaps = 0; + bool replaced_ptes_modified = false; + uint8_t *map_host = NULL; + /* When the high-VA replacement reuses an existing host backing, + * populate_existing is about to clobber map_host with memset / pread + * before guest_install_va_pages and guest_region_add_ex_owned_gpa commit. + * Snapshot the original bytes so the fail path can restore them; without + * this, a late-step failure leaves the guest's old mapping pointing at + * corrupted memory. + */ + uint8_t *replaced_bytes_snap = NULL; + bool replaced_bytes_dirty = false; + + /* Sibling vCPUs may otherwise observe transient zeroes, partial file + * contents, or rollback bytes while populate_existing rewrites map_host in + * place. The overlay paths already use the same pattern; mmap_lock only + * serializes memory syscalls, not vCPU execution. Track whether the + * siblings_quiesced bracket is open so the success-return and the + * fail-path both resume. + */ + bool siblings_quiesced = false; if (!is_anon && is_shared) return -LINUX_ENODEV; /* Reject wrap before reusing addr + length anywhere below. The caller - * page-rounds length, but addr is guest-supplied and a huge length - * against a high VA can still overflow. Also reject the case where - * addr + length is too close to UINT64_MAX for ALIGN_UP to round up - * the 2 MiB boundary without wrapping to 0 (which would make va_end - * smaller than va_start and underflow backing_span). + * page-rounds length, but addr is guest-supplied and a huge length against + * a high VA can still overflow. Also reject the case where addr + length is + * too close to UINT64_MAX for ALIGN_UP to round up the 2 MiB boundary + * without wrapping to 0 (which would make va_end smaller than va_start and + * underflow backing_span). */ if (length == 0 || addr > UINT64_MAX - length) return -LINUX_ENOMEM; @@ -434,26 +555,74 @@ static int64_t sys_mmap_fixed_high_va(guest_t *g, if (guest_kbuf_user_va_overlap(addr, length)) return -LINUX_ENOMEM; + /* Set when this call enters the replace-an-existing-mapping branch + * (region_range_overlaps + replaceable + snapshots captured). Used + * everywhere the function needs to decide between the fresh-allocation + * path and the reuse-the-existing-backing path. The earlier proxy + * (replaced_gpa_base != 0) mis-classified replacements targeting a + * region backed at GPA 0 (a valid guest physical address) as fresh + * allocations, which silently bypassed the byte snapshot, region + * remove, and rollback restore work. + */ + bool replacing_existing = false; + + /* Cap the byte-snapshot allocation that populate_existing needs for + * rollback. The mapping itself can still be arbitrarily large in the + * fresh-allocation path; only the replace-an-existing branch needs the + * host-side malloc, so the cap only applies to replacement. 256 MiB is + * comfortably above realistic Rosetta dynamic-linker reservations and + * far below the multi-GiB malloc bombs a hostile guest could otherwise + * force. Reject early with -ENOMEM so the caller falls back to a smaller + * MAP_FIXED footprint rather than triggering the host OOM killer. + */ + enum { HIGH_VA_SNAPSHOT_MAX = (size_t) 256 << 20 }; + + if (region_range_overlaps(g, addr, addr + length) && + length > HIGH_VA_SNAPSHOT_MAX) + return -LINUX_ENOMEM; + if (region_range_overlaps(g, addr, addr + length)) { if (is_noreplace) return -LINUX_EEXIST; - /* High-VA MAP_FIXED replacement is still limited to fresh ranges. - * Replacing partially-overlapping non-identity mappings needs a more - * complete VA-aware rollback path than the low-VA slab code uses. - */ - return -LINUX_ENOMEM; + if (!replace_existing) + return -LINUX_ENOMEM; + if (!high_va_replaceable_gpa_base(g, addr, addr + length, + &replaced_gpa_base, &replaced_flags, + &replaced_offset)) + return -LINUX_ENOMEM; + if (!region_has_capacity_after_removes( + g, &(remove_range_t) {addr, addr + length}, 1, 1)) + return -LINUX_ENOMEM; + replaced_snaps = malloc(GUEST_MAX_REGIONS * sizeof(*replaced_snaps)); + if (!replaced_snaps) + return -LINUX_ENOMEM; + replaced_nsnaps = capture_region_snapshots( + g, addr, addr + length, replaced_snaps, GUEST_MAX_REGIONS); + if (replaced_nsnaps < 0) { + free(replaced_snaps); + return replaced_nsnaps; + } + replacing_existing = true; } uint64_t va_start = ALIGN_DOWN(addr, BLOCK_2MIB); uint64_t va_end = ALIGN_UP(addr + length, BLOCK_2MIB); uint64_t backing_span = va_end - va_start; - uint64_t backing_gpa_start = ALIGN_UP( - (g->mmap_end > g->mmap_next) ? g->mmap_end : g->mmap_next, BLOCK_2MIB); - uint64_t backing_limit = - g->kbuf_gpa ? g->kbuf_gpa : (g->interp_base - INFRA_RESERVE); - if (backing_gpa_start >= backing_limit || - backing_span > backing_limit - backing_gpa_start) - return -LINUX_ENOMEM; + uint64_t backing_gpa_start = 0; + uint64_t backing_limit = 0; + + if (replacing_existing) { + backing_gpa_start = replaced_gpa_base - (addr - va_start); + } else { + backing_gpa_start = + ALIGN_UP((g->mmap_end > g->mmap_next) ? g->mmap_end : g->mmap_next, + BLOCK_2MIB); + backing_limit = + g->kbuf_gpa ? g->kbuf_gpa : (g->interp_base - INFRA_RESERVE); + if (backing_gpa_start >= backing_limit || + backing_span > backing_limit - backing_gpa_start) + return -LINUX_ENOMEM; + } if (!is_anon) { if (fuse_fd_refuse_mmap(fd)) { @@ -497,10 +666,17 @@ static int64_t sys_mmap_fixed_high_va(guest_t *g, int map_perms = (prot == LINUX_PROT_NONE) ? MEM_PERM_RW : prot_to_perms(prot); + if (replacing_existing) { + map_host = host_ptr_for_gpa(g, backing_gpa_start + (addr - va_start)); + if (!map_host) + goto fail; + goto populate_existing; + } + /* Mapping loop installs PT state in block-sized steps. Any L1/L2 tables - * newly allocated during this call are left in place on rollback: they - * are zero descriptors after invalidation and harmless until reused by - * a later mmap. + * newly allocated during this call are left in place on rollback: they are + * zero descriptors after invalidation and harmless until reused by a later + * mmap. */ va_installed_end = va_start; @@ -513,10 +689,10 @@ static int64_t sys_mmap_fixed_high_va(guest_t *g, memset(host, 0, BLOCK_2MIB); /* Detect freshness BEFORE guest_map_va_range so the decision is not - * confused by a prior high-VA mmap into the same 2 MiB block. A - * fresh block needs its split-inherited L3 entries zeroed so gap - * pages do not silently inherit block-level perms; a pre-existing - * block must be left alone so earlier mappings into the same block + * confused by a prior high-VA mmap into the same 2 MiB block. A fresh + * block needs its split-inherited L3 entries zeroed so gap pages do not + * silently inherit block-level perms; a pre-existing block must be left + * alone so earlier mappings into the same block * survive. */ bool fresh_block = !guest_va_block_mapped(g, va); @@ -525,17 +701,16 @@ static int64_t sys_mmap_fixed_high_va(guest_t *g, goto fail; va_installed_end = va + BLOCK_2MIB; - /* Fresh blocks are live with full-2 MiB block-level perms from - * here until guest_invalidate_ptes zeros the split-inherited L3 - * entries. If split or invalidate fails in between, the rollback - * must scrub the entire block; record it for the fail path. + /* Fresh blocks are live with full-2 MiB block-level perms from here + * until guest_invalidate_ptes zeros the split-inherited L3 entries. + * If split or invalidate fails in between, the rollback must scrub + * the entire block; record it for the fail path. */ if (fresh_block) inflight_fresh_block_va = va; - /* Always split so guest_install_va_pages can write 4 KiB L3 PTEs - * for the actual mapped range; pre-existing tables make split a - * no-op. + /* Always split so guest_install_va_pages can write 4 KiB L3 PTEs for + * the actual mapped range; pre-existing tables make split a no-op. */ if (guest_split_block(g, va) < 0) goto fail; @@ -543,20 +718,49 @@ static int64_t sys_mmap_fixed_high_va(guest_t *g, if (fresh_block) { if (guest_invalidate_ptes(g, va, va + BLOCK_2MIB) < 0) goto fail; - /* L3 entries are zeroed; the block is no longer live at - * 2 MiB scope and the narrow rollback is sufficient. + /* L3 entries are zeroed; the block is no longer live at 2 MiB scope + * and the narrow rollback is sufficient. */ inflight_fresh_block_va = UINT64_MAX; } } - uint8_t *map_host = - host_ptr_for_gpa(g, backing_gpa_start + (addr - va_start)); + map_host = host_ptr_for_gpa(g, backing_gpa_start + (addr - va_start)); if (!map_host) goto fail; - if (!is_anon && prot != LINUX_PROT_NONE) { +populate_existing: + /* Snapshot the existing host backing before the destructive write so + * a later guest_install_va_pages / guest_region_add failure can + * restore the guest's original mapping bytes from the fail path + * instead of leaving it pointing at zeroed-or-partially-written + * memory. The fresh-allocation path lands here too, but its map_host + * sits on a brand-new GPA range that no guest mapping currently + * observes, so the snapshot is only needed when replacing_existing. + */ + if (replacing_existing && (is_anon || prot != LINUX_PROT_NONE)) { + replaced_bytes_snap = malloc(length); + if (!replaced_bytes_snap) { + ret = -LINUX_ENOMEM; + goto fail; + } + /* Quiesce siblings before the snapshot read so the memcpy + * cannot see torn writes from another vCPU running guest code + * on the existing mapping, and so the destructive memset / + * pread below stays invisible to concurrent readers until the + * region tables commit (or the fail path restores the bytes). + */ + thread_quiesce_siblings(); + siblings_quiesced = true; + memcpy(replaced_bytes_snap, map_host, length); + } + + if (is_anon) { + memset(map_host, 0, length); + replaced_bytes_dirty = replacing_existing; + } else if (prot != LINUX_PROT_NONE) { memset(map_host, 0, length); + replaced_bytes_dirty = replacing_existing; uint8_t *dst = map_host; size_t remaining = length; off_t file_off = (off_t) offset; @@ -565,7 +769,12 @@ static int64_t sys_mmap_fixed_high_va(guest_t *g, if (nr < 0) { if (errno == EINTR) continue; - break; + /* Real host I/O failure (not EINTR); previously the loop broke + * without setting ret and the syscall returned a "successful" + * partially-zero mapping. + */ + ret = linux_errno(); + goto fail; } if (nr == 0) break; @@ -575,35 +784,46 @@ static int64_t sys_mmap_fixed_high_va(guest_t *g, } } - /* Install L3 PTEs for the actual mapped range. Fresh blocks were - * fully invalidated in the loop above so their gap pages do not - * inherit block-level perms; pre-existing blocks are left untouched - * so prior high-VA mmaps into the same 2 MiB block survive. + /* Install L3 PTEs for the actual mapped range. Fresh blocks were fully + * invalidated in the loop above so their gap pages do not inherit + * block-level perms; pre-existing blocks are left untouched so prior + * high-VA mmaps into the same 2 MiB block survive. * - * PROT_NONE still needs an explicit invalidate for the requested - * pages: when the range lands inside a reused 2 MiB block, leaving - * the inherited L3 descriptors intact would make the new guard range + * PROT_NONE still needs an explicit invalidate for the requested pages: + * when the range lands inside a reused 2 MiB block, leaving the + * inherited L3 descriptors intact would make the new guard range * spuriously accessible. */ if (prot == LINUX_PROT_NONE) { + replaced_ptes_modified = replacing_existing; if (guest_invalidate_ptes(g, addr, addr + length) < 0) goto fail; } else { uint64_t gpa_for_addr = backing_gpa_start + (addr - va_start); + replaced_ptes_modified = replacing_existing; if (guest_install_va_pages(g, addr, length, gpa_for_addr, prot_to_perms(prot)) < 0) goto fail; } uint64_t backing_gpa_end = backing_gpa_start + backing_span; - if (backing_gpa_end > g->mmap_next) - g->mmap_next = backing_gpa_end; - if (backing_gpa_end > g->mmap_end) - g->mmap_end = backing_gpa_end; + if (!replacing_existing) { + if (backing_gpa_end > g->mmap_next) + g->mmap_next = backing_gpa_end; + if (backing_gpa_end > g->mmap_end) + g->mmap_end = backing_gpa_end; + } uint64_t gpa_base = backing_gpa_start + (addr - va_start); - if (!region_has_capacity_after_removes(g, NULL, 0, 1)) + if (!region_has_capacity_after_removes( + g, + replacing_existing ? &(remove_range_t) {addr, addr + length} : NULL, + replacing_existing ? 1 : 0, 1)) goto fail; + if (replacing_existing) { + guest_region_remove(g, addr, addr + length); + replaced_region_removed = true; + } if (guest_region_add_ex_owned_gpa(g, addr, addr + length, gpa_base, prot, flags, offset, NULL, track_backing_fd) < 0) @@ -612,18 +832,36 @@ static int64_t sys_mmap_fixed_high_va(guest_t *g, if (close_host_backing_fd && host_backing_fd >= 0) close(host_backing_fd); host_fd_ref_close(&backing_ref); + if (replaced_snaps) { + close_region_snapshots(replaced_snaps, replaced_nsnaps); + free(replaced_snaps); + } + if (replaced_bytes_snap) { + free(replaced_bytes_snap); + replaced_bytes_snap = NULL; + } + if (siblings_quiesced) + thread_resume_siblings(); return (int64_t) addr; fail: - /* Roll back PT state installed by this call. The success path - * preserves pre-existing 2 MiB blocks (so prior high-VA mmaps in the - * same block survive); the rollback must respect that same - * invariant. Two cases: + /* If populate_existing already overwrote the original mapping's bytes, the + * snapshot has the pre-replacement contents; copy them back before any + * later cleanup so the guest's old mapping comes out of rollback pointing + * at the same data it had before this + * call. The snapshot is freed unconditionally below. + */ + if (replaced_bytes_dirty && replaced_bytes_snap && map_host) + memcpy(map_host, replaced_bytes_snap, length); + + /* Roll back PT state installed by this call. The success path preserves + * pre-existing 2 MiB blocks (so prior high-VA mmaps in the same block + * survive); the rollback must respect that same invariant. Two cases: * - * 1. An in-flight fresh block: block-mapped at full-2 MiB perms but - * not yet invalidated. Zero the entire 2 MiB so no stray RW/RX - * mapping survives across the failure. + * 1. An in-flight fresh block: block-mapped at full-2 MiB perms but not + * yet invalidated. Zero the entire 2 MiB so no stray RW/RX mapping + * survives across the failure. * 2. The requested subrange [addr, addr+length): pre-existing * blocks and completed fresh blocks were only ever written * inside this range by guest_install_va_pages, so a narrow @@ -639,7 +877,7 @@ static int64_t sys_mmap_fixed_high_va(guest_t *g, if (guest_invalidate_ptes(g, inflight_fresh_block_va, inflight_fresh_block_va + BLOCK_2MIB) < 0) { log_error( - "sys_mmap_fixed_high_va: rollback invalidate failed for " + "sys_mmap_high_va: rollback invalidate failed for " "fresh block [0x%llx, 0x%llx)", (unsigned long long) inflight_fresh_block_va, (unsigned long long) (inflight_fresh_block_va + BLOCK_2MIB)); @@ -648,7 +886,7 @@ static int64_t sys_mmap_fixed_high_va(guest_t *g, if (va_installed_end > va_start) { if (guest_invalidate_ptes(g, addr, addr + length) < 0) { log_error( - "sys_mmap_fixed_high_va: rollback invalidate failed for " + "sys_mmap_high_va: rollback invalidate failed for " "VA [0x%llx, 0x%llx)", (unsigned long long) addr, (unsigned long long) (addr + length)); @@ -656,9 +894,48 @@ static int64_t sys_mmap_fixed_high_va(guest_t *g, } if (track_backing_fd >= 0) close(track_backing_fd); + /* Restore region/PTE snapshots when this call mutated regions[] or + * the page tables; otherwise just drop the snapshot allocation. + * Whichever path runs, the common cleanup below frees snapshots + * and fds and resumes siblings, so a restore failure only needs + * to override the returned errno -- it must not skip cleanup. + * (Earlier code used replaced_gpa_base != 0 as the proxy for + * "replacing existing", which mis-classified replacements over a + * GPA-0-backed region; replacing_existing is now set explicitly + * when snapshots are captured.) + */ + if (replaced_snaps && replacing_existing && replaced_region_removed) { + int restore_err = + restore_region_snapshots(g, replaced_snaps, replaced_nsnaps); + if (restore_err == 0 && replaced_ptes_modified) + restore_err = restore_snapshot_page_tables( + g, addr, addr + length, replaced_snaps, replaced_nsnaps); + if (restore_err < 0) + ret = restore_err; + } else if (replaced_snaps && replacing_existing && replaced_ptes_modified) { + int restore_err = restore_snapshot_page_tables( + g, addr, addr + length, replaced_snaps, replaced_nsnaps); + if (restore_err < 0) + ret = restore_err; + else + (void) restore_snapshot_overlays_in_place(g, replaced_snaps, + replaced_nsnaps); + } + if (replaced_snaps) { + close_region_snapshots(replaced_snaps, replaced_nsnaps); + free(replaced_snaps); + } + if (replaced_bytes_snap) + free(replaced_bytes_snap); if (close_host_backing_fd && host_backing_fd >= 0) close(host_backing_fd); host_fd_ref_close(&backing_ref); + /* Close the siblings_quiesced bracket as the very last step, so + * the byte restore + region/PTE restore + fd cleanup all complete + * before any sibling vCPU resumes guest execution. + */ + if (siblings_quiesced) + thread_resume_siblings(); return ret; } @@ -737,20 +1014,6 @@ static int restore_file_overlay_range(guest_t *g, return 0; } -typedef struct { - uint64_t start; - uint64_t end; - uint64_t gpa_base; - int prot; - int flags; - uint64_t offset; - int backing_fd; - bool overlay_active; - uint64_t overlay_start; - uint64_t overlay_end; - char name[sizeof(((guest_region_t *) 0)->name)]; -} region_snapshot_t; - typedef struct { uint64_t overlay_start; uint64_t overlay_len; @@ -1532,8 +1795,8 @@ int64_t sys_mmap(guest_t *g, return -LINUX_ENOMEM; if (addr >= g->guest_size) - return sys_mmap_fixed_high_va(g, addr, length, prot, flags, fd, - offset, is_noreplace); + return sys_mmap_high_va(g, addr, length, prot, flags, fd, offset, + true, is_noreplace); /* High-VA MAP_FIXED (rosetta's JIT slabs at 240 TiB, code caches * at 85 TiB, etc.) is not safe to expose yet. The previous draft @@ -1737,20 +2000,51 @@ int64_t sys_mmap(guest_t *g, uint8_t *dst = (uint8_t *) g->host_base + result_off; size_t remaining = length; off_t file_off = offset; + bool read_io_err = false; + int saved_errno = 0; while (remaining > 0) { ssize_t nr = pread(host_backing_fd, dst, remaining, file_off); if (nr < 0) { if (errno == EINTR) continue; - break; /* partial read is acceptable (zeroed tail) */ + /* Real host I/O error (not EINTR). EOF zero- + * fill stays an accepted outcome (nr == 0 + * below); an I/O failure returning a + * "successful" partially-zero mapping is not. + * Restore the prior region/PTE state and + * surface the errno to the caller. + */ + read_io_err = true; + saved_errno = errno; + break; } if (nr == 0) - break; /* EOF */ + break; /* EOF; remaining bytes stay zeroed */ dst += nr; remaining -= (size_t) nr; file_off += nr; } + if (read_io_err) { + int restore_err = restore_region_snapshots( + g, replaced_snaps, replaced_nsnaps); + if (restore_err == 0) + restore_err = restore_snapshot_page_tables( + g, result_off, result_off + length, replaced_snaps, + replaced_nsnaps); + if (track_backing_fd >= 0) + close(track_backing_fd); + if (restore_err < 0) { + dispose_region_snapshots(&replaced_snaps, + &replaced_nsnaps); + host_fd_ref_close(&backing_ref); + return restore_err; + } + dispose_region_snapshots(&replaced_snaps, &replaced_nsnaps); + host_fd_ref_close(&backing_ref); + errno = saved_errno; + return linux_errno(); + } } } else { /* Restore slab backing under any pre-existing MAP_SHARED file @@ -1788,6 +2082,13 @@ int64_t sys_mmap(guest_t *g, * file backing once the final guest range is known. */ if (!is_fixed) { + if (g->is_rosetta && addr >= g->guest_size && + addr <= 0x0000FFFFFFFFFFFFULL) { + int64_t high_hint = sys_mmap_high_va(g, addr, length, prot, flags, + fd, offset, false, false); + if (high_hint >= 0) + return high_hint; + } if (needs_exec && !(prot & LINUX_PROT_WRITE)) { /* PROT_EXEC without PROT_WRITE: allocate from the RX mmap region. * Apple HVF enforces W^X on 2MiB block page table entries, so @@ -2001,12 +2302,14 @@ int64_t sys_mmap(guest_t *g, size_t remaining = length; off_t file_off = offset; bool read_err = false; + int saved_errno = 0; while (remaining > 0) { ssize_t nr = pread(host_backing_fd, dst, remaining, file_off); if (nr < 0) { if (errno == EINTR) continue; read_err = true; + saved_errno = errno; break; } if (nr == 0) @@ -2015,8 +2318,14 @@ int64_t sys_mmap(guest_t *g, remaining -= (size_t) nr; file_off += nr; } - if (read_err && remaining == length) { - /* Total failure (no bytes read). Undo the mapping. */ + if (read_err) { + /* Any host I/O error (total OR partial) is fatal. The + * previous "remaining == length" gate silently kept + * partial-read mappings with a zeroed tail when an + * I/O error fired mid-stream, which made truncated + * file contents visible to the guest as a successful + * mmap. + */ int rollback_err = rollback_fresh_mmap_allocation( g, result_off, length, false, 0, 0, saved_mmap_next, saved_mmap_end, saved_mmap_rx_next, saved_mmap_rx_end, @@ -2026,6 +2335,7 @@ int64_t sys_mmap(guest_t *g, host_fd_ref_close(&backing_ref); if (rollback_err < 0) return rollback_err; + errno = saved_errno; return linux_errno(); } } @@ -2702,23 +3012,15 @@ int64_t sys_madvise(guest_t *g, uint64_t addr, uint64_t length, int advice) uint64_t zend = (r->end < end) ? r->end : end; memset((uint8_t *) g->host_base + zstart, 0, zend - zstart); if (!(r->flags & LINUX_MAP_ANONYMOUS)) { - uint64_t file_off = r->offset + (zstart - r->start); - uint8_t *dst = (uint8_t *) g->host_base + zstart; - size_t remaining = zend - zstart; - while (remaining > 0) { - ssize_t nr = - pread(r->backing_fd, dst, remaining, (off_t) file_off); - if (nr < 0) { - if (errno == EINTR) - continue; - return linux_errno(); - } - if (nr == 0) - break; /* EOF: tail stays zero per mmap rules. */ - dst += nr; - file_off += nr; - remaining -= (size_t) nr; - } + /* EOF leaves the tail zero per mmap rules; the helper + * already returns 0 in that case after stopping the + * read loop. + */ + int err = read_file_range_to_guest( + g, zstart, r->backing_fd, r->offset + (zstart - r->start), + zend - zstart); + if (err < 0) + return err; } } return 0; @@ -2758,7 +3060,7 @@ int64_t sys_madvise(guest_t *g, uint64_t addr, uint64_t length, int advice) case LINUX_MADV_PAGEOUT: /* Advisory hints: accept silently. Linux walks vmas and returns * -ENOMEM for any unmapped sub-range; mirror that for fidelity. - * No host swap means PAGEOUT/COLD do not actually evict — keeping + * No host swap means PAGEOUT/COLD do not actually evict -- keeping * data in place is a stricter guarantee than Linux's. */ if (!madvise_range_mapped(g, off, length)) @@ -3005,7 +3307,10 @@ static int64_t pwrite_all_at(int fd, while (len > 0) { size_t chunk = len > (uint64_t) SSIZE_MAX ? (size_t) SSIZE_MAX : (size_t) len; - ssize_t nw = pwrite(fd, src, chunk, (off_t) file_off); + ssize_t nw; + do { + nw = pwrite(fd, src, chunk, (off_t) file_off); + } while (nw < 0 && errno == EINTR); if (nw < 0) return linux_errno(); if (nw == 0) @@ -3032,8 +3337,10 @@ static int64_t sync_shared_aliases_range(guest_t *g, size_t chunk_len = (size_t) (chunk_end - chunk_start); memset(original, 0, chunk_len); - ssize_t nr = - pread(backing_fd, original, chunk_len, (off_t) chunk_start); + ssize_t nr; + do { + nr = pread(backing_fd, original, chunk_len, (off_t) chunk_start); + } while (nr < 0 && errno == EINTR); if (nr < 0) return linux_errno(); @@ -3104,7 +3411,10 @@ static int64_t refresh_shared_region_range(guest_t *g, while (len > 0) { size_t chunk = len > (uint64_t) SSIZE_MAX ? (size_t) SSIZE_MAX : (size_t) len; - ssize_t nr = pread(r->backing_fd, buf, chunk, (off_t) file_off); + ssize_t nr; + do { + nr = pread(r->backing_fd, buf, chunk, (off_t) file_off); + } while (nr < 0 && errno == EINTR); if (nr < 0) return linux_errno(); if (nr == 0) diff --git a/src/syscall/poll.c b/src/syscall/poll.c index 9509fa6..09507c5 100644 --- a/src/syscall/poll.c +++ b/src/syscall/poll.c @@ -779,7 +779,7 @@ int64_t sys_epoll_ctl(guest_t *g, int epfd, int op, int fd, uint64_t event_gva) * deletes even when oneshot_armed: with multi-filter EPOLLONESHOT, only * the filter that fired was removed by EV_ONESHOT; the other filter is * still registered and must be cleaned. Issue each delete in its own - * kevent call so an ENOENT on one filter does not abort the other — + * kevent call so an ENOENT on one filter does not abort the other -- * with a single batched call and NULL eventlist, kevent stops at the * first failed change and leaks the survivor. */ diff --git a/src/syscall/proc.c b/src/syscall/proc.c index 51ea50f..bfa640e 100644 --- a/src/syscall/proc.c +++ b/src/syscall/proc.c @@ -1007,7 +1007,7 @@ static void drain_external_guest_signal(void) } } -/* HVC #4 (set sysreg) register index → hv_sys_reg_t mapping. +/* HVC #4 (set sysreg) register index -> hv_sys_reg_t mapping. * Index must match the encoding the shim writes to X0 in shim.S; out-of-range * IDs trip the HVC #4 default branch in vcpu_run_loop(). */ diff --git a/src/syscall/proc.h b/src/syscall/proc.h index 6b1e161..9dff428 100644 --- a/src/syscall/proc.h +++ b/src/syscall/proc.h @@ -86,7 +86,7 @@ const char *proc_get_elfuse_path(void); void proc_set_rosetta_enabled(bool enabled); bool proc_rosetta_enabled(void); -/* Runtime indicator: true once the guest_t has been initialised in rosetta +/* Runtime indicator: true once the guest_t has been initialized in rosetta * mode. Distinct from proc_rosetta_enabled which reflects the user opt-in. * Code paths that lack direct guest_t access (proc_intercept_readlink) can * branch on the runtime state without threading g through every signature. diff --git a/src/syscall/sys.c b/src/syscall/sys.c index ee4cde2..9166850 100644 --- a/src/syscall/sys.c +++ b/src/syscall/sys.c @@ -146,7 +146,7 @@ static void sysinfo_refresh_cached_locked(time_t now_sec) } } - /* Load averages (× 65536 for fixed-point). */ + /* Load averages (x 65536 for fixed-point). */ double loadavg[3]; if (getloadavg(loadavg, 3) == 3) { cached_sysinfo.loads[0] = (uint64_t) (loadavg[0] * 65536.0); diff --git a/src/syscall/syscall.c b/src/syscall/syscall.c index 4548021..68cad6d 100644 --- a/src/syscall/syscall.c +++ b/src/syscall/syscall.c @@ -71,10 +71,9 @@ * meaningful (Apple Silicon TSO toggle); older platforms simply leave actlr * at 0, which falls through to PR_SET_MEM_MODEL_DEFAULT. * - * The guard checks the SDK version rather than the macro presence: on - * macOS 15+ the symbol is an enumerator (not a #define), so a plain - * #ifndef would always fire and shadow the SDK name with a macro of the - * same spelling. + * The guard checks the SDK version rather than the macro presence: on macOS 15+ + * the symbol is an enumerator (not a #define), so a plain #ifndef would always + * fire and shadow the SDK name with a macro of the same spelling. */ #if __MAC_OS_X_VERSION_MAX_ALLOWED < 150000 #define HV_SYS_REG_ACTLR_EL1 ((hv_sys_reg_t) 0xc081) @@ -99,18 +98,18 @@ void syscall_init(void) wakeup_pipe_init(); } -/* Memory syscall implementations (sys_brk, sys_mmap, sys_mremap, etc.) - * are in syscall/mem.c. FD table in syscall/fdtable.c. Errno/flag - * translation in syscall/translate.c. +/* Memory syscall implementations (sys_brk, sys_mmap, sys_mremap, etc.) are in + * syscall/mem.c. FD table in syscall/fdtable.c. Errno/flag translation in + * syscall/translate.c. */ /* Syscall handler table. */ -/* Each sc_xxx wrapper adapts one (or a group of fall-through) case(s) from - * the old switch into a uniform signature. Returns int64_t result, or a - * sentinel: SYSCALL_EXEC_HAPPENED for exec/sigreturn, (INT64_MIN | code) - * for exit/exit_group. Wrappers that need mmap_lock acquire it internally. - * Wrappers that need the vCPU handle use current_thread->vcpu. +/* Each sc_xxx wrapper adapts one (or a group of fall-through) case(s) from the + * old switch into a uniform signature. Returns int64_t result, or a sentinel: + * SYSCALL_EXEC_HAPPENED for exec/sigreturn, (INT64_MIN | code) for + * exit/exit_group. Wrappers that need mmap_lock acquire it internally. Wrappers + * that need the vCPU handle use current_thread->vcpu. */ typedef int64_t (*syscall_handler_t)(guest_t *g, uint64_t x0, @@ -122,40 +121,40 @@ typedef int64_t (*syscall_handler_t)(guest_t *g, bool verbose); /* Exit sentinel: high bits mark this as an exit, low 8 bits carry the code. - * Cannot collide with SYSCALL_EXEC_HAPPENED (-0x10000): exit sentinel - * has INT64_MIN's sign bit set, exec sentinel does not. + * Cannot collide with SYSCALL_EXEC_HAPPENED (-0x10000): exit sentinel has + * INT64_MIN's sign bit set, exec sentinel does not. */ #define SC_EXIT_SENTINEL INT64_MIN /* Wrapper macros. * - * These macros eliminate the boilerplate of ~120 sc_xxx wrappers that - * follow one of three patterns: + * These macros eliminate the boilerplate of ~120 sc_xxx wrappers that follow + * one of three patterns: * * SC_FORWARD(name, expr): cast args and forward to sys_xxx / signal_xxx * SC_LOCKED(name, expr): same, but hold mmap_lock during the call * SC_STUB(name, val): return a constant (alias for SC_FORWARD) * * All parameters are marked (void) to suppress -Wunused-parameter. - * The body expression may reference g, x0–x5, and verbose freely. + * The body expression may reference g, x0-x5, and verbose freely. */ /* clang-format off */ #define SC_FORWARD(name, body) \ - static int64_t name(guest_t *g, uint64_t x0, uint64_t x1, uint64_t x2, \ + static int64_t name(guest_t *g, uint64_t x0, uint64_t x1, uint64_t x2, \ uint64_t x3, uint64_t x4, uint64_t x5, bool verbose) \ { \ - (void) g; (void) x0; (void) x1; (void) x2; \ - (void) x3; (void) x4; (void) x5; (void) verbose; \ + (void) g; (void) x0; (void) x1; (void) x2; \ + (void) x3; (void) x4; (void) x5; (void) verbose; \ return (body); \ } #define SC_LOCKED(name, body) \ - static int64_t name(guest_t *g, uint64_t x0, uint64_t x1, uint64_t x2, \ + static int64_t name(guest_t *g, uint64_t x0, uint64_t x1, uint64_t x2, \ uint64_t x3, uint64_t x4, uint64_t x5, bool verbose) \ { \ - (void) g; (void) x0; (void) x1; (void) x2; \ - (void) x3; (void) x4; (void) x5; (void) verbose; \ + (void) g; (void) x0; (void) x1; (void) x2; \ + (void) x3; (void) x4; (void) x5; (void) verbose; \ pthread_mutex_lock(&mmap_lock); \ int64_t r = (body); \ pthread_mutex_unlock(&mmap_lock); \ @@ -363,22 +362,22 @@ SC_FORWARD(sc_futex, sys_futex(g, x0, (int) x1, (uint32_t) x2, x3, x4, (uint32_t /* Sync. * - * Linux sync(2) flushes all dirty buffers. Forwarding to host sync() - * stalls because the guest slab is mmap'd MAP_SHARED to an internal - * tempfile (g->shm_fd) for the CoW fork fast path: a global flush has - * to walk multi-GB of demand-paged dirty pages from that tempfile, plus - * the same from any other elfuse process running on the host. The slab - * tempfile is implementation detail; the guest never opened it. Iterate - * the guest fd table and the region overlay backing fds, dup each under - * its lock, release the lock, and fsync the dups outside any guest lock - * so a slow disk cannot stall concurrent mmap/fd operations on other - * threads. fsync on non-regular fds returns EINVAL on macOS, which is - * benign and ignored. Always returns 0 to mirror sync(2)'s "void" spirit. + * Linux sync(2) flushes all dirty buffers. Forwarding to host sync() stalls + * because the guest slab is mmap'd MAP_SHARED to internal tempfile (g->shm_fd) + * for the CoW fork fast path: a global flush has to walk multi-GB of + * demand-paged dirty pages from that tempfile, plus the same from any other + * elfuse process running on the host. The slab tempfile is implementation + * detail; the guest never opened it. Iterate the guest fd table and the region + * overlay backing fds, dup each under its lock, release the lock, and fsync the + * dups outside any guest lock so a slow disk cannot stall concurrent mmap/fd + * operations on other threads. fsync on non-regular fds returns EINVAL on + * macOS, which is benign and ignored. Always returns 0 to mirror sync(2)'s + * "void" spirit. */ -/* Inline fallback: under malloc failure the bulk-dup path cannot proceed, - * so iterate one fd at a time, dupping under the matching lock and fsync - * outside it. Slower (acquires/releases fd_lock per regular fd) but keeps - * sync(2) honest under memory pressure instead of silently no-opping. +/* Inline fallback: under malloc failure the bulk-dup path cannot proceed, so + * iterate one fd at a time, dupping under the matching lock and fsync outside + * it. Slower (acquires/releases fd_lock per regular fd) but keeps sync(2) + * honest under memory pressure instead of silently no-opping. */ static void sc_sync_fdtable_inline(void) { @@ -399,9 +398,9 @@ static void sc_sync_fdtable_inline(void) static void sc_sync_regions_inline(guest_t *g) { /* Region count can change under us once mmap_lock is released, so - * resnapshot under the lock each iteration; the i index is a live - * cursor into g->regions[] so a concurrent insertion (always at the - * sorted position) cannot make us skip an entry permanently. + * resnapshot under the lock each iteration; the i index is a live cursor + * into g->regions[] so a concurrent insertion (always at the sorted + * position) cannot make us skip an entry permanently. */ for (int i = 0;; i++) { pthread_mutex_lock(&mmap_lock); @@ -1727,6 +1726,7 @@ int syscall_dispatch(hv_vcpu_t vcpu, guest_t *g, int *exit_code, bool verbose) if (tp != FD_REGULAR && tp != FD_STDIO && tp != FD_PIPE && tp != FD_SOCKET) goto slow_path; + /* Proc-backed fds may need synthetic read/write handling (for * example, oom_* rereads recompute content on each read and proc * dirfds steer relative *at() resolution). Keep them on the slow diff --git a/src/utils.h b/src/utils.h index 1a1fc08..c0c3a7d 100644 --- a/src/utils.h +++ b/src/utils.h @@ -151,7 +151,7 @@ static inline int fd_set_nonblock(int fd) /* Carry overflow/underflow between tv_nsec and tv_sec so the result is a * canonical timespec with 0 <= tv_nsec < 1e9. Uses div/mod (which truncate * toward zero in C99) plus a single borrow so the LONG_MIN case never - * negates tv_nsec — that would be undefined behavior. + * negates tv_nsec -- that would be undefined behavior. * * NSEC_PER_SEC is also defined by mach/clock_types.h and dispatch/time.h * on macOS; the guard avoids redefinition warnings when those system @@ -200,7 +200,7 @@ static inline uint64_t bit_mask64_low(unsigned int n) return n >= 64 ? UINT64_MAX : (BIT64(n) - 1); } -/* Position of the lowest set bit. word must be non-zero — __builtin_ctzll +/* Position of the lowest set bit. word must be non-zero -- __builtin_ctzll * is undefined on zero. Range: 0..63. */ static inline int bit_ctz64(uint64_t word) diff --git a/tests/bench-rosetta.sh b/tests/bench-rosetta.sh index 48e031a..4886e44 100755 --- a/tests/bench-rosetta.sh +++ b/tests/bench-rosetta.sh @@ -29,7 +29,7 @@ esac FIXTURES="${FIXTURES_DIR:-externals/test-fixtures}" STATICBIN_LONG="${FIXTURES}/x86_64-musl/staticbin/bin" -ROSETTA_PATH=/Library/Apple/usr/libexec/oah/RosettaLinux/rosetta +ROSETTA_PATH="${MATRIX_ROSETTA_TRANSLATOR:-/Library/Apple/usr/libexec/oah/RosettaLinux/rosetta}" SHORTDIR=/tmp/elfuse-br if [ ! -x "$ROSETTA_PATH" ]; then diff --git a/tests/fetch-fixtures.sh b/tests/fetch-fixtures.sh index 36751c5..6edc295 100755 --- a/tests/fetch-fixtures.sh +++ b/tests/fetch-fixtures.sh @@ -53,14 +53,16 @@ INITRAMFS="${FIXTURES}/initramfs.cpio.gz" # Pinned package versions (Alpine 3.21). When bumping ALPINE_VERSION, refresh # these by querying the repo's APKINDEX. declare -A PKGS=( - ["main:linux-virt"]="6.12.90-r0" + ["main:linux-virt"]="6.12.91-r0" ["main:busybox-static"]="1.37.0-r14" ["main:dropbear"]="2024.86-r0" ["main:zlib"]="1.3.2-r0" ["main:utmps-libs"]="0.1.2.3-r2" ["main:skalibs-libs"]="2.14.3.0-r0" ["main:musl"]="1.2.5-r11" + ["main:musl-dev"]="1.2.5-r11" ["main:musl-utils"]="1.2.5-r11" + ["main:libgcc"]="14.2.0-r4" ["main:libcrypto3"]="3.3.7-r0" ["main:acl-libs"]="2.3.2-r1" ["main:libattr"]="2.5.2-r2" @@ -82,6 +84,7 @@ declare -A PKGS=( ["main:ncurses-terminfo-base"]="6.5_p20241006-r3" ["main:lua5.4"]="5.4.7-r0" ["main:lua5.4-libs"]="5.4.7-r0" + ["main:luajit"]="2.1_p20240815-r0" ["main:jq"]="1.7.1-r0" ["main:oniguruma"]="6.9.9-r0" ["main:sqlite"]="3.48.0-r4" @@ -91,7 +94,7 @@ declare -A PKGS=( # Subset whose binaries are exposed as standalone "static-bins" suite paths. # Most are dynamic but link only against musl/zlib/etc., already in rootfs/. -# Applet list (hardcoded — busybox 1.37 inventory). Busybox does not have +# Applet list (hardcoded -- busybox 1.37 inventory). Busybox does not have # b2sum / numfmt / base32; those tests fall through to the dynamic-coreutils # suite where the real coreutils binary is available. STATIC_APPLETS=( @@ -212,10 +215,10 @@ main() cp -R "$modstage/lib/modules" "$ROOTFS/lib/" 2> /dev/null rm -rf "$modstage" - # /init (custom — no openrc, just bring up minimum services for ssh). + # /init (custom -- no openrc, just bring up minimum services for ssh). cat > "${ROOTFS}/init" << 'EOF' #!/bin/sh -# Custom init — sets up enough for dropbear ssh + 9p shared mounts. +# Custom init -- sets up enough for dropbear ssh + 9p shared mounts. set +e exec /dev/console 2>&1 @@ -243,7 +246,7 @@ ifconfig lo 127.0.0.1 up ifconfig eth0 10.0.2.15 netmask 255.255.255.0 up || echo "qemu-runner: eth0 up failed" route add default gw 10.0.2.2 2>/dev/null -# Dropbear — pre-baked host keys, pubkey auth only (passwords disabled). +# Dropbear -- pre-baked host keys, pubkey auth only (passwords disabled). mkdir -p /etc/dropbear /var/empty /var/log chmod 700 /root /root/.ssh chown -R 0:0 /root diff --git a/tests/fixtures/rosetta/README.md b/tests/fixtures/rosetta/README.md new file mode 100644 index 0000000..8a73e1a --- /dev/null +++ b/tests/fixtures/rosetta/README.md @@ -0,0 +1,45 @@ +Rosetta x86_64 test fixtures vendored for self-contained matrix coverage. + +- `x86_64-rosetta-audit` + - static x86_64 Linux ELF built from `tests/x86_64-rosetta-audit.c` +- `x86_64-rosetta-tls0` + - static x86_64 Linux ELF built from `tests/x86_64-rosetta-tls0.c` +- `x86_64-glibc-rootfs.tar.gz` + - minimal x86_64 glibc rootfs used by `tests/test-rosetta-glibc.sh` + - contains `hello-dynamic`, `dlopen-probe`, `tls-probe`, + `gdtls-probe`, `pthread-tls-probe`, the glibc loader, `libc.so.6`, + `libm.so.6`, and `libgdtls.so` + - `hello-dynamic` built from `tests/x86_64-glibc-hello.c` + - `dlopen-probe` built from `tests/x86_64-glibc-dlopen.c` + - `tls-probe` built from `tests/x86_64-glibc-tls.c` + - `gdtls-probe` built from `tests/x86_64-glibc-gdtls.c` + - `libgdtls.so` built from `tests/x86_64-glibc-gdtls-lib.c` + - `pthread-tls-probe` built from `tests/x86_64-glibc-pthread-tls.c` + +These fixtures exist so `make test-rosetta-all` and +`bash tests/test-matrix.sh elfuse-x86_64` do not require a private build host, +`ld.lld`, or an ad hoc local cross-toolchain. + +The cited `tests/*.c` sources are not wired into any in-tree build rule +(the elfuse Makefile builds aarch64 host binaries; these fixtures are +x86_64 Linux ELFs). When one of them changes, the binary has to be +rebuilt out of tree on an x86_64 Linux host and the result re-vendored +here. Rough recipe: + +``` +# On an x86_64 Linux host with gcc + the matching glibc dev headers: +gcc -O2 -o hello-dynamic tests/x86_64-glibc-hello.c +gcc -O2 -ldl -o dlopen-probe tests/x86_64-glibc-dlopen.c +gcc -O2 -o tls-probe tests/x86_64-glibc-tls.c +gcc -O2 -fPIC -shared -o libgdtls.so tests/x86_64-glibc-gdtls-lib.c +gcc -O2 -ldl -o gdtls-probe tests/x86_64-glibc-gdtls.c +gcc -O2 -pthread -o pthread-tls-probe tests/x86_64-glibc-pthread-tls.c +gcc -O2 -static -o x86_64-rosetta-audit tests/x86_64-rosetta-audit.c +gcc -O2 -static -o x86_64-rosetta-tls0 tests/x86_64-rosetta-tls0.c +# Stage the matching ld.so / libc.so.6 / libm.so.6 from the same host +# into a rootfs/ tree alongside libgdtls.so under lib/x86_64-linux-gnu/, +# then tar -czf x86_64-glibc-rootfs.tar.gz rootfs/. +``` + +The two static audit fixtures and the rootfs tarball then drop into +this directory verbatim. diff --git a/tests/fixtures/rosetta/x86_64-glibc-rootfs.tar.gz b/tests/fixtures/rosetta/x86_64-glibc-rootfs.tar.gz new file mode 100644 index 0000000..1029278 Binary files /dev/null and b/tests/fixtures/rosetta/x86_64-glibc-rootfs.tar.gz differ diff --git a/tests/fixtures/rosetta/x86_64-rosetta-audit b/tests/fixtures/rosetta/x86_64-rosetta-audit new file mode 100755 index 0000000..39ee37d Binary files /dev/null and b/tests/fixtures/rosetta/x86_64-rosetta-audit differ diff --git a/tests/fixtures/rosetta/x86_64-rosetta-tls0 b/tests/fixtures/rosetta/x86_64-rosetta-tls0 new file mode 100755 index 0000000..923a18e Binary files /dev/null and b/tests/fixtures/rosetta/x86_64-rosetta-tls0 differ diff --git a/tests/lib/rosetta-test.sh b/tests/lib/rosetta-test.sh new file mode 100644 index 0000000..b49a750 --- /dev/null +++ b/tests/lib/rosetta-test.sh @@ -0,0 +1,72 @@ +# Shared reporting helpers for the tests/test-rosetta-*.sh scripts. +# +# Copyright 2026 elfuse contributors +# SPDX-License-Identifier: Apache-2.0 +# +# shellcheck shell=bash +# +# Sources tests/lib/test-runner.sh and exposes report_pass / report_fail +# / report_skip on top of test_report so per-binary output matches the +# matrix runner's aarch64 format ([ OK ] / [ FAIL ] / [ SKIP ] aligned +# to TEST_LABEL_WIDTH). Each Rosetta script still owns its pass/fail +# /skip/total counters; this lib only centralizes the report sites and +# the trailing Results: summary line that tests/test-matrix.sh scrapes. + +# Align the LABEL column with tests/test-matrix.sh so the aggregated +# matrix output looks uniform across aarch64 and x86_64 modes. +: "${TEST_LABEL_WIDTH:=45}" + +_rosetta_test_lib_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +# shellcheck source=tests/lib/test-runner.sh +. "${_rosetta_test_lib_dir}/test-runner.sh" + +# report_pass / report_fail / report_skip accept a single label argument +# matching the original Rosetta helpers' single-string contract; the +# whole label (including any embedded "rc=..." detail) is shown in the +# LABEL column so existing call sites do not need a manual split. + +report_pass() +{ + test_report ok "$1" + pass=$((pass + 1)) +} + +report_fail() +{ + test_report fail "$1" + fail=$((fail + 1)) +} + +report_skip() +{ + test_report skip "$1" + skip=$((skip + 1)) +} + +# Emit the canonical Results line that tests/test-matrix.sh's +# suite_summary_fields regex consumes. Optional first argument +# overrides the (of N) field when the script tracks total +# independently of pass+fail+skip (the existing Rosetta scripts +# do, because in-script skips do not bump total). +report_summary() +{ + local total="${1:-$((pass + fail + skip))}" + printf '\n' + printf 'Results: %s passed, %s failed, %s skipped (of %s)\n' \ + "$pass" "$fail" "$skip" "$total" +} + +# Locate timeout(1) on macOS hosts: not built in, but Homebrew coreutils +# ships it as 'timeout' (and the legacy 'gtimeout' alias). Sets the +# TIMEOUT shell variable in the caller. On failure, prints an install +# hint and exits 77 (suite-skip), matching what every per-script copy +# of this block used to do. +require_timeout() +{ + TIMEOUT="$(command -v timeout 2> /dev/null \ + || command -v gtimeout 2> /dev/null || true)" + if [ -z "$TIMEOUT" ]; then + printf 'timeout(1) not found in PATH; install via: brew install coreutils\n' >&2 + exit 77 + fi +} diff --git a/tests/lib/test-runner.sh b/tests/lib/test-runner.sh index a0892a9..40b97bf 100644 --- a/tests/lib/test-runner.sh +++ b/tests/lib/test-runner.sh @@ -10,10 +10,10 @@ : "${TEST_LABEL_WIDTH:=14}" : "${TEST_TIMEOUT:=10}" -# Resolve a working `timeout` binary. macOS doesn't ship one, so fall back to +# Resolve a working 'timeout' binary. macOS doesn't ship one, so fall back to # GNU coreutils' gtimeout. Wrap as a function so callers keep using the bare -# name `timeout`. Resolution order: TIMEOUT_BIN env override, `timeout` on -# PATH, `gtimeout` on PATH, then Homebrew's stable opt symlinks for ARM and +# name 'timeout'. Resolution order: TIMEOUT_BIN env override, 'timeout' on +# PATH, 'gtimeout' on PATH, then Homebrew's stable opt symlinks for ARM and # Intel macOS (the install prefix differs between the two). if [ -n "${TIMEOUT_BIN:-}" ]; then timeout() @@ -34,7 +34,7 @@ elif ! command -v timeout > /dev/null 2>&1; then done fi if [ -n "$_timeout_bin" ]; then - # shellcheck disable=SC2317 # Invoked indirectly via `timeout` callers. + # shellcheck disable=SC2317 # Invoked indirectly via 'timeout' callers. eval "timeout() { \"$_timeout_bin\" \"\$@\"; }" else echo "test-runner: no 'timeout' or 'gtimeout' in PATH." >&2 @@ -137,7 +137,7 @@ run() return fi - # Wrap every invocation in `timeout` so a hanging guest tool cannot + # Wrap every invocation in 'timeout' so a hanging guest tool cannot # freeze the entire suite. run_pipe and run_timeout already do this; # the omission here used to let a deadlocked elfuse syscall path # hang make check forever. diff --git a/tests/qemu-runner.sh b/tests/qemu-runner.sh index 3687ea5..d0b4d24 100755 --- a/tests/qemu-runner.sh +++ b/tests/qemu-runner.sh @@ -1,5 +1,5 @@ #!/usr/bin/env bash -# qemu-runner.sh — Boot qemu-system-aarch64 with the elfuse test fixtures +# qemu-runner.sh -- Boot qemu-system-aarch64 with the elfuse test fixtures # initramfs and provide qemu_exec() for command execution over ssh. # # Sourced by tests/test-matrix.sh (for the qemu-aarch64 mode) but also @@ -162,7 +162,7 @@ qemu_start() # Each call opens a fresh ssh connection. Avoids ControlMaster pitfalls # (master dying mid-suite cascades rc=255 to every later command) at the -# cost of ~100ms handshake overhead per call — a flat ~15s across the +# cost of ~100ms handshake overhead per call -- a flat ~15s across the # full matrix, well within the suite's tolerance. _qemu_ssh_raw() { @@ -226,7 +226,7 @@ qemu_stop() # script exit. trap 'qemu_stop' EXIT -# CLI driver: when run directly, support `qemu-runner.sh start|exec|stop`. +# CLI driver: when run directly, support 'qemu-runner.sh start|exec|stop'. if [ "${BASH_SOURCE[0]:-$0}" = "$0" ]; then cmd="${1:-help}" shift || true diff --git a/tests/test-busybox.sh b/tests/test-busybox.sh index 6f89935..7f70d38 100755 --- a/tests/test-busybox.sh +++ b/tests/test-busybox.sh @@ -1,5 +1,5 @@ #!/usr/bin/env bash -# test-busybox.sh — Busybox 1.37.0 applet smoke tests for elfuse +# test-busybox.sh -- Busybox 1.37.0 applet smoke tests for elfuse # # Copyright 2026 elfuse contributors # Copyright 2025 Moritz Angermann, zw3rk pte. ltd. diff --git a/tests/test-coreutils.sh b/tests/test-coreutils.sh index 157efce..b1f4448 100755 --- a/tests/test-coreutils.sh +++ b/tests/test-coreutils.sh @@ -1,5 +1,5 @@ #!/usr/bin/env bash -# test-coreutils.sh — GNU coreutils integration suite for elfuse +# test-coreutils.sh -- GNU coreutils integration suite for elfuse # # Copyright 2026 elfuse contributors # Copyright 2025 Moritz Angermann, zw3rk pte. ltd. diff --git a/tests/test-cross-fork-mapshared.c b/tests/test-cross-fork-mapshared.c index 7df2b11..0cc2789 100644 --- a/tests/test-cross-fork-mapshared.c +++ b/tests/test-cross-fork-mapshared.c @@ -132,7 +132,7 @@ static bool send_byte(int fd) return n == 1; } -/* Test 1: File-backed MAP_SHARED — parent and child see each other's +/* Test 1: File-backed MAP_SHARED -- parent and child see each other's * writes through the same disk file without msync. */ static void test_file_backed_cross_fork(void) @@ -253,7 +253,7 @@ static void test_file_backed_cross_fork(void) close(fd); } -/* Test 2: Anonymous MAP_SHARED — typical parent-child IPC pattern +/* Test 2: Anonymous MAP_SHARED -- typical parent-child IPC pattern * (Postgres, multi-process daemons). elfuse must convert the region * to memfd-backed at fork time so both sides observe writes. */ @@ -339,7 +339,7 @@ static void test_anon_shared_cross_fork(void) munmap(p, 4096); } -/* Test 3: shm-backed MAP_SHARED via /dev/shm — same as test 1 but +/* Test 3: shm-backed MAP_SHARED via /dev/shm -- same as test 1 but * exercises the shm path (musl/glibc shm_open emulation in elfuse). */ static void test_shm_cross_fork(void) diff --git a/tests/test-dynamic-coreutils.sh b/tests/test-dynamic-coreutils.sh index 55db318..6165bdb 100755 --- a/tests/test-dynamic-coreutils.sh +++ b/tests/test-dynamic-coreutils.sh @@ -1,5 +1,5 @@ #!/usr/bin/env bash -# test-dynamic-coreutils.sh — Dynamically-linked GNU coreutils test suite for elfuse +# test-dynamic-coreutils.sh -- Dynamically-linked GNU coreutils test suite for elfuse # # Copyright 2026 elfuse contributors # Copyright 2025 Moritz Angermann, zw3rk pte. ltd. diff --git a/tests/test-epoll-edge.c b/tests/test-epoll-edge.c index 3b83456..c968466 100644 --- a/tests/test-epoll-edge.c +++ b/tests/test-epoll-edge.c @@ -491,7 +491,7 @@ int main(void) if (n != 1 || (out.events & EPOLLOUT)) { FAIL("expected only IN to fire (OUT must not be ready)"); } else { - /* MOD to IN-only — drops OUT entirely. The implementation + /* MOD to IN-only -- drops OUT entirely. The implementation * must remove the surviving EVFILT_WRITE; with the kevent * batched-delete bug it would leak. */ diff --git a/tests/test-fuse-alpine.sh b/tests/test-fuse-alpine.sh index 537ab12..cd2b364 100755 --- a/tests/test-fuse-alpine.sh +++ b/tests/test-fuse-alpine.sh @@ -1,5 +1,5 @@ #!/usr/bin/env bash -# test-fuse-alpine.sh — Validate guest FUSE inside the Alpine musl sysroot. +# test-fuse-alpine.sh -- Validate guest FUSE inside the Alpine musl sysroot. # # Copyright 2026 elfuse contributors # SPDX-License-Identifier: Apache-2.0 diff --git a/tests/test-futex-pi.c b/tests/test-futex-pi.c index 59ca4f9..66b84b7 100644 --- a/tests/test-futex-pi.c +++ b/tests/test-futex-pi.c @@ -200,10 +200,54 @@ static void test_pi_dead_owner(void) /* Test 3: EINTR injection after ~1s */ +/* Sibling that keeps the guest in a multi-threaded state for the duration of + * the EINTR probe. The synthetic EINTR injection in futex_wait only fires + * while thread_is_single_active() is false; a single-threaded guest must be + * allowed to park in FUTEX_WAIT indefinitely so it does not break glibc + * startup paths. The probe therefore has to run with at least one other guest + * thread alive. + * + * The sibling sleeps on a timed futex_wait against keepalive_word with a + * 5-second timeout. The timeout dodges the EINTR injection ('!has_timeout' is + * what gates the sim), and 5 s is long enough to outlast the worst-case parent + * EINTR window (1 s with up to 100 ms poll jitter, plus a safety margin). After + * the parent's probe returns, the parent flips keepalive_word and wakes the + * sibling. + */ +static volatile int sibling_keepalive __attribute__((aligned(4))) = 1; +static char sibling_stack_buf[8192] __attribute__((aligned(16))); + +static void sibling_alive_thread(void) +{ + struct timespec ts = {5, 0}; + while (__atomic_load_n(&sibling_keepalive, __ATOMIC_SEQ_CST) == 1) { + raw_syscall6(__NR_futex, (long) &sibling_keepalive, + FUTEX_WAIT | FUTEX_PRIVATE, 1, (long) &ts, 0, 0); + } + raw_exit(0); +} + static void test_futex_eintr(void) { TEST("futex_wait EINTR after ~1s"); + /* Spawn the sibling so thread_is_single_active() is false during the wait. + * CLONE flags match test_pi_dead_owner. + */ + sibling_keepalive = 1; + void *sibling_top = sibling_stack_buf + sizeof(sibling_stack_buf); + int sibling_tid_val = 0; + long sret = raw_clone(0x7d0f00, sibling_top, &sibling_tid_val, 0, + (int *) &sibling_tid_val); + if (sret < 0) { + FAIL("sibling clone failed"); + return; + } + if (sret == 0) { + sibling_alive_thread(); + raw_exit(1); /* unreachable */ + } + /* Create a futex word that no one will wake. * futex_wait with no timeout should return -EINTR after ~1 second * (elfuse's simulated periodic signal delivery). @@ -219,7 +263,16 @@ static void test_futex_eintr(void) long elapsed_ms = (t1.tv_sec - t0.tv_sec) * 1000 + (t1.tv_usec - t0.tv_usec) / 1000; - /* Expect -EINTR (Linux errno 4) after 800ms–3000ms. + /* Tear down the sibling now that the EINTR check is done. */ + __atomic_store_n(&sibling_keepalive, 0, __ATOMIC_SEQ_CST); + raw_futex_wake((int *) &sibling_keepalive, 1); + for (int i = 0; i < 100; i++) { + if (__atomic_load_n(&sibling_tid_val, __ATOMIC_SEQ_CST) == 0) + break; + usleep(10000); + } + + /* Expect -EINTR (Linux errno 4) after 800ms-3000ms. * The 1s timeout has jitter from 100ms polling intervals. */ if (r == -4 /* -EINTR */ && elapsed_ms >= 800 && elapsed_ms <= 3000) { diff --git a/tests/test-gdbstub.sh b/tests/test-gdbstub.sh index f056a54..dfbfc4b 100755 --- a/tests/test-gdbstub.sh +++ b/tests/test-gdbstub.sh @@ -279,7 +279,7 @@ run_lldb \ -o "memory read 0x400020 --count 6 --format c" \ -o "process kill" \ -o "quit" -# 0x400020 is msg: "hello\n" — should contain 'h','e','l','l','o' +# 0x400020 is msg: "hello\n" -- should contain 'h','e','l','l','o' ok=0 if echo "$LLDB_OUT" | grep -q "hello"; then ok=1 @@ -342,7 +342,7 @@ run_lldb \ -o "register read pc" \ -o "process kill" \ -o "quit" -# Should hit breakpoint at 0x400014 (mov x0, #0 — the exit setup) +# Should hit breakpoint at 0x400014 (mov x0, #0 -- the exit setup) ok=0 if echo "$LLDB_OUT" | grep -qi "0x.*400014"; then ok=1 @@ -456,7 +456,7 @@ run_lldb \ sleep 0.5 ok=0 if ! kill -0 "$elfuse_pid" 2> /dev/null; then - # elfuse exited after detach — guest ran to completion + # elfuse exited after detach -- guest ran to completion ok=1 fi report "detach: guest continues and exits after detach" $ok diff --git a/tests/test-matrix.sh b/tests/test-matrix.sh index 0408ae3..e6a6140 100755 --- a/tests/test-matrix.sh +++ b/tests/test-matrix.sh @@ -24,6 +24,9 @@ set -euo pipefail REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)" FIXTURES="${REPO_ROOT}/externals/test-fixtures" +# Allow tests to point the translator probe at a missing path to exercise +# the non-Rosetta-host skip path without uninstalling the translator. +: "${MATRIX_ROSETTA_TRANSLATOR:=/Library/Apple/usr/libexec/oah/RosettaLinux/rosetta}" MODE="${1:?Usage: $0 }" @@ -54,7 +57,7 @@ ELFUSE="${ELFUSE:-${REPO_ROOT}/build/elfuse}" : "${GUEST_X86_64_GLIBC_SYSROOT:=}" : "${GUEST_X86_64_GLIBC_DYNAMIC_COREUTILS:=}" -# Reuse the shared per-test reporter so the output matches `make check` +# Reuse the shared per-test reporter so the output matches 'make check' # (which drives tests through tests/driver.sh). TEST_LABEL_WIDTH controls the # left-aligned name column and must be set before the source so the helper # library picks it up. @@ -82,6 +85,21 @@ ensure_fixtures() fi } +ensure_x86_fixtures() +{ + if [ ! -x "$MATRIX_ROSETTA_TRANSLATOR" ]; then + return 0 + fi + if [ -x "${FIXTURES}/x86_64-musl/staticbin/bin/busybox" ] \ + && [ -d "${FIXTURES}/x86_64-musl/dyn-bin" ] \ + && [ -f "${FIXTURES}/x86_64-musl/rootfs/lib/ld-musl-x86_64.so.1" ] \ + && [ -x "${FIXTURES}/x86_64-musl/rootfs/usr/bin/luajit" ]; then + return 0 + fi + printf "Fetching x86_64 Rosetta fixtures (one-time download)\n" + INCLUDE_X86_64=1 bash "${REPO_ROOT}/tests/fetch-fixtures.sh" +} + setup_fixtures() { local mode="$1" @@ -139,7 +157,7 @@ run_elfuse() timeout 30 "$ELFUSE" "${args[@]}" "$@" 2> /dev/null } -# `timeout` cannot wrap a shell function, so this runner inlines the path +# 'timeout' cannot wrap a shell function, so this runner inlines the path # rewriting + ssh invocation that qemu_exec would otherwise do. # Repo paths under REPO_ROOT are rewritten to /mnt/host/...; the remote # command is launched with cwd=/mnt/host so unqualified paths in test @@ -195,7 +213,7 @@ run_elfuse_sysroot() # procfs compatibility path exercised by test-io-opt. test-sysfs-cpu asserts # the elfuse stub contract (cache/topology subtree empty, possible == online, # cpuN count == online count) which a real kernel does not honor. All listed -# tests still run in elfuse-aarch64 mode and in `make check`; the qemu +# tests still run in elfuse-aarch64 mode and in 'make check'; the qemu # reference run skips them. QEMU_SKIP="test-thread test-stress test-futex-pi test-io-opt test-sysfs-cpu" @@ -230,7 +248,7 @@ report_timeout() } # Account for an optional binary or fixture being absent. The previous -# pattern (`if [ -e "$bin/X" ]; then test_check ... fi`) silently erased +# pattern ('if [ -e "$bin/X" ]; then test_check ... fi') silently erased # the assertion when X was missing, so the suite summary could report # "all passed" while major coverage blocks never ran. require_binary # always increments skip and emits a skip line, so absences are visible @@ -246,9 +264,57 @@ require_binary() return 1 } +suite_summary_fields() +{ + local output="$1" + printf '%s\n' "$output" \ + | sed -n \ + 's/^Results: \([0-9][0-9]*\) passed, \([0-9][0-9]*\) failed, \([0-9][0-9]*\) skipped (of \([0-9][0-9]*\)).*/\1 \2 \3 \4/p' \ + | tail -n 1 +} + +run_summary_suite() +{ + local label="$1" + shift + + local output rc + if output=$("$@" 2>&1); then + rc=0 + else + rc=$? + fi + printf '%s\n' "$output" + + local fields + fields="$(suite_summary_fields "$output")" + if [ -n "$fields" ]; then + local suite_pass=0 suite_fail=0 suite_skip=0 suite_total=0 + read -r suite_pass suite_fail suite_skip suite_total <<< "$fields" + # Force decimal: a sub-suite that ever emits a zero-padded count + # ('08', '09') would otherwise trip bash's "invalid octal" error + # inside $((...)) and abort the matrix under 'set -e'. + pass=$((pass + 10#${suite_pass:-0})) + fail=$((fail + 10#${suite_fail:-0})) + skip=$((skip + 10#${suite_skip:-0})) + return "$rc" + fi + + if [ "$rc" -eq 77 ]; then + skip_suite "$label" "suite skipped" + return 0 + fi + if [ "$rc" -eq 0 ]; then + pass=$((pass + 1)) + else + fail=$((fail + 1)) + fi + return "$rc" +} + # Suite-level analog of require_binary for whole fixture directories. # The label names the suite that is being skipped. Use this in place of -# bare `printf "SKIP\n"` lines so the skip counter reflects reality. +# bare 'printf "SKIP\n"' lines so the skip counter reflects reality. skip_suite() { local label="$1" reason="$2" @@ -495,8 +561,8 @@ run_coreutils_tests() printf "\nCoreutils encoding%s\n" "$_COREUTILS_SUFFIX" # The if/then form contains require_binary's exit status so missing - # binaries do not propagate as a function-exit-1 under `set -e`. The - # earlier `&& test_check` chain failed the matrix script outright + # binaries do not propagate as a function-exit-1 under 'set -e'. The + # earlier '&& test_check' chain failed the matrix script outright # whenever the LAST optional binary in a function was absent. if require_binary "base32" "$bindir/base32"; then test_check "$runner" "base32" "NBSWY" "$bindir/base32" "$TEST_TMPDIR/hello.txt" @@ -512,6 +578,49 @@ run_coreutils_tests() fi } +run_rosetta_x86_64_suites() +{ + local rc=0 + + printf "Rosetta CLI gating\n" + run_summary_suite "rosetta-cli" \ + bash "${REPO_ROOT}/tests/test-rosetta-cli.sh" "$ELFUSE" || rc=1 + + printf "\nRosetta failure modes\n" + run_summary_suite "rosetta-failure-modes" \ + bash "${REPO_ROOT}/tests/test-rosetta-failure-modes.sh" "$ELFUSE" || rc=1 + + if [ -x "$MATRIX_ROSETTA_TRANSLATOR" ]; then + printf "\nRosetta statics\n" + run_summary_suite "rosetta-statics" \ + bash "${REPO_ROOT}/tests/test-rosetta-statics.sh" "$ELFUSE" || rc=1 + + printf "\nRosetta Alpine corpus\n" + run_summary_suite "rosetta-alpine" \ + bash "${REPO_ROOT}/tests/test-rosetta-alpine.sh" "$ELFUSE" || rc=1 + + printf "\nRosetta thread/signal audit\n" + run_summary_suite "rosetta-audit" \ + bash "${REPO_ROOT}/tests/test-rosetta-audit.sh" "$ELFUSE" || rc=1 + + printf "\nRosetta guest JIT\n" + run_summary_suite "rosetta-jit" \ + bash "${REPO_ROOT}/tests/test-rosetta-jit.sh" "$ELFUSE" || rc=1 + + printf "\nRosetta glibc dynamic\n" + run_summary_suite "rosetta-glibc" \ + bash "${REPO_ROOT}/tests/test-rosetta-glibc.sh" "$ELFUSE" || rc=1 + else + local suite + for suite in rosetta-statics rosetta-alpine rosetta-audit rosetta-jit \ + rosetta-glibc; do + skip_suite "$suite" "Rosetta translator not installed" + done + fi + + return "$rc" +} + run_busybox_tests() { local runner="$1" bb="$2" @@ -631,14 +740,6 @@ run_suite() dyn_runner="run_qemu" ;; elfuse-x86_64) - printf "Unsupported %s: x86_64-via-Rosetta matrix is not runnable yet\n" \ - "$mode" - printf "Rosetta fork, execve, high-VA mmap, and the rosettad bridge " - printf "are in place; what remains is wiring the standalone " - printf "tests/test-rosetta-*.sh suites into a first-class matrix " - printf "branch with per-host pass/fail baselines. Run the " - printf "tests/test-rosetta-*.sh scripts directly in the meantime.\n\n" - return 2 ;; *) echo "Unknown mode: $mode" @@ -646,8 +747,10 @@ run_suite() ;; esac - cleanup_fixtures - setup_fixtures "$mode" + if [ "$mode" != "elfuse-x86_64" ]; then + cleanup_fixtures + setup_fixtures "$mode" + fi printf "\nTesting: %s\n\n" "$mode" @@ -655,6 +758,25 @@ run_suite() fail=0 skip=0 + if [ "$mode" = "elfuse-x86_64" ]; then + run_rosetta_x86_64_suites || true + + local total=$((pass + fail + skip)) + if [ "$fail" -eq 0 ] && [ "$skip" -eq 0 ]; then + printf " All %d tests passed\n\n" "$pass" + else + printf " Results: %d passed, %d failed, %d skipped (of %d)\n\n" \ + "$pass" "$fail" "$skip" "$total" + fi + if [ ! -x "$MATRIX_ROSETTA_TRANSLATOR" ]; then + printf " Expected: elfuse-x86_64 ran host-independent guardrails; " + printf "Rosetta-only suites skipped because the translator is absent.\n\n" + return "$fail" + fi + verify_expected_counts "$mode" + return $? + fi + run_unit_tests "$runner" "$GUEST_TEST_BINARIES" run_coreutils_tests "$runner" "$GUEST_COREUTILS" run_busybox_tests "$runner" "$GUEST_BUSYBOX" @@ -721,22 +843,98 @@ run_suite() return $? } -# Per-mode expected outcome envelope. Each mode lists the minimum pass count -# and the exact failure count the matrix is allowed to report. Skip counts -# are advisory because some skips depend on the host environment (qemu -# running on Apple Silicon vs Linux, x86_64 fixtures present or not). When -# the runtime advances, bump these counts in the same commit that changes -# behaviour so reviewers see the new headline numbers explicitly. -declare -A EXPECTED_MIN_PASS=( - [elfuse - aarch64]=180 - [qemu - aarch64]=180 - [elfuse - x86_64]=0 -) -declare -A EXPECTED_FAIL=( - [elfuse - aarch64]=0 - [qemu - aarch64]=1 - [elfuse - x86_64]=0 -) +# Per-mode expected outcome envelope. Each mode lists the minimum pass +# count and the exact failure count the matrix is allowed to report. +# Skip counts are advisory because some skips depend on the host +# environment (qemu running on Apple Silicon vs Linux, x86_64 fixtures +# present or not). When the runtime advances, bump these counts in the +# same commit that changes behaviour so reviewers see the new headline +# numbers explicitly. +# +# elfuse-x86_64 baseline (71) is the sum of the seven Rosetta sub-suites +# in run_rosetta_x86_64_suites: +# test-rosetta-cli.sh = 4 +# test-rosetta-failure-modes.sh = 3 +# test-rosetta-statics.sh = 20 +# test-rosetta-alpine.sh = 33 +# test-rosetta-audit.sh = 2 +# test-rosetta-jit.sh = 2 +# test-rosetta-glibc.sh = 7 +# The full per-binary inventory and the per-host capture process live +# in docs/testing.md "x86_64 Acceptance Inventory and Per-Host +# Baselines". Bump these counts in the same commit that grows or trims +# any sub-suite's Results line so the matrix gate stays in sync. +# +# Keys MUST match the lookup strings exactly. Every subscript here is +# explicitly quoted ("elfuse-aarch64", etc.) because shfmt parses +# unquoted bareword subscripts as arithmetic, which expands [a-b] to +# [a - b]. Bash then treats the spaced and unspaced forms as different +# keys, so the baseline gate silently goes dead. This regression has +# bitten the tree three times; keep the quotes and do not rewrite this +# block back into the 'declare -A NAME=( ... )' initialiser form +# either (the same arithmetic-rewrite hits subscripts in that form). +declare -A EXPECTED_MIN_PASS +declare -A EXPECTED_FAIL +EXPECTED_MIN_PASS["elfuse-aarch64"]=180 +EXPECTED_FAIL["elfuse-aarch64"]=0 +EXPECTED_MIN_PASS["qemu-aarch64"]=180 +EXPECTED_FAIL["qemu-aarch64"]=1 + +# x86_64 baselines are keyed by detected host SoC class +# (see detect_x86_64_host_class below). The two M-series classes +# diverge inside sys_mmap_fixed_high_va on IPA width: apple-m1-m2 is +# 36-bit (overflow-segment path), apple-m3-plus is 40-bit (bisected +# -slab path on M5). The seven Rosetta sub-suites currently emit fixed +# pass counts regardless of IPA width, so both rows start at 71; an +# operator with M3+ hardware updates the apple-m3-plus row in place +# when their observed counts diverge. apple-unknown is the fallback +# row for SoC strings the detector does not recognise yet. +EXPECTED_MIN_PASS["elfuse-x86_64:apple-m1-m2"]=71 +EXPECTED_FAIL["elfuse-x86_64:apple-m1-m2"]=0 +EXPECTED_MIN_PASS["elfuse-x86_64:apple-m3-plus"]=71 +EXPECTED_FAIL["elfuse-x86_64:apple-m3-plus"]=0 +EXPECTED_MIN_PASS["elfuse-x86_64:apple-unknown"]=71 +EXPECTED_FAIL["elfuse-x86_64:apple-unknown"]=0 + +# Host SoC class detector for x86_64 baseline selection. Reads +# machdep.cpu.brand_string (sysctl), which Apple Silicon Macs publish +# as "Apple M1", "Apple M2 Pro", "Apple M3 Max", etc. The MATRIX_HOST +# _CLASS_OVERRIDE env var exists so the M3+ row can be exercised from +# an M1/M2 host (and vice versa) without modifying the detector. The +# detector intentionally returns a stable apple-unknown rather than +# guessing on never-seen brand strings so new SoCs do not silently +# graft onto an existing row. +# Validate MATRIX_HOST_CLASS_OVERRIDE at script entry. The detector is +# invoked from $(...) inside verify_expected_counts, where an exit only +# terminates the subshell and the parent silently sees an empty class. +# Pre-validating here makes a typo (e.g. "apple-m3" missing -plus) +# fail loudly before any sub-suite runs. +if [ -n "${MATRIX_HOST_CLASS_OVERRIDE:-}" ]; then + case "$MATRIX_HOST_CLASS_OVERRIDE" in + apple-m1-m2 | apple-m3-plus | apple-unknown) ;; + *) + printf 'MATRIX_HOST_CLASS_OVERRIDE: unknown class "%s"; ' \ + "$MATRIX_HOST_CLASS_OVERRIDE" >&2 + printf 'expected one of apple-m1-m2 / apple-m3-plus / apple-unknown\n' >&2 + exit 2 + ;; + esac +fi + +detect_x86_64_host_class() +{ + if [ -n "${MATRIX_HOST_CLASS_OVERRIDE:-}" ]; then + printf '%s\n' "$MATRIX_HOST_CLASS_OVERRIDE" + return 0 + fi + local brand + brand="$(sysctl -n machdep.cpu.brand_string 2> /dev/null || true)" + case "$brand" in + *"Apple M1"* | *"Apple M2"*) printf 'apple-m1-m2\n' ;; + *"Apple M3"* | *"Apple M4"* | *"Apple M5"*) printf 'apple-m3-plus\n' ;; + *) printf 'apple-unknown\n' ;; + esac +} # Known-failure annotations. These are tests that fail by design under a # given mode and are tracked here so the matrix runner can distinguish them @@ -756,28 +954,59 @@ KNOWN_FAILURES_ELFUSE_X86_64="test-signal-thread test-thread test-stress" verify_expected_counts() { local mode="$1" - local exp_min="${EXPECTED_MIN_PASS[$mode]:-}" - local exp_fail="${EXPECTED_FAIL[$mode]:-}" + local key="$mode" + local host_class="" + if [ "$mode" = "elfuse-x86_64" ]; then + host_class="$(detect_x86_64_host_class)" + key="${mode}:${host_class}" + fi + + local exp_min="${EXPECTED_MIN_PASS[$key]:-}" + local exp_fail="${EXPECTED_FAIL[$key]:-}" if [ -z "$exp_min" ] || [ -z "$exp_fail" ]; then - # Mode without a recorded baseline (e.g. an experimental local mode). - # Stay silent so the matrix runner is still usable as a smoke probe. + # No recorded baseline for this key (experimental local mode, or + # an x86_64 host class the detector did not classify). Stay + # silent so the matrix runner remains usable as a smoke probe. return 0 fi + # Uncaptured rows: apple-m3-plus inherits the M1/M2 numbers pending + # operator capture on real M3+ hardware; apple-unknown means the SoC + # brand string did not match a known class at all. In both cases + # surface that the baseline is not authoritative for this host so a + # genuine M3+ divergence is not silently absorbed. + if [ "$mode" = "elfuse-x86_64" ]; then + case "$host_class" in + apple-m3-plus) + printf " Note: elfuse-x86_64 baseline for %s is held equal to\n" \ + "$host_class" + printf " apple-m1-m2 pending capture on real M3+ hardware. If\n" + printf " your numbers diverge, update only the\n" + printf " EXPECTED_*[elfuse-x86_64:apple-m3-plus] rows.\n" + ;; + apple-unknown) + printf " Note: host SoC did not match a known M-series class;\n" + printf " falling back to the elfuse-x86_64:apple-unknown row.\n" + printf " Add the new SoC to detect_x86_64_host_class so future\n" + printf " runs gate against the right baseline.\n" + ;; + esac + fi + local err=0 if [ "$pass" -lt "$exp_min" ]; then printf " Expected-pass deviation: %s saw %d pass, baseline %d.\n" \ - "$mode" "$pass" "$exp_min" + "$key" "$pass" "$exp_min" err=1 fi if [ "$fail" -ne "$exp_fail" ]; then printf " Expected-fail deviation: %s saw %d fail, baseline %d.\n" \ - "$mode" "$fail" "$exp_fail" + "$key" "$fail" "$exp_fail" err=1 fi if [ "$err" -eq 0 ]; then printf " Expected: %s within baseline (>= %d pass, exactly %d fail).\n\n" \ - "$mode" "$exp_min" "$exp_fail" + "$key" "$exp_min" "$exp_fail" else printf " Bump the EXPECTED_* table in tests/test-matrix.sh if this\n" printf " shift is intentional.\n\n" @@ -790,13 +1019,19 @@ verify_expected_counts() # the user iterate before the x86_64 corpus exists, without pulling the # aarch64 musl rootfs and qemu kernel they will not use. case "$MODE" in - elfuse-x86_64) ;; + elfuse-x86_64) + ensure_x86_fixtures + ;; + all) + ensure_fixtures + ensure_x86_fixtures + ;; *) ensure_fixtures ;; esac total_fail=0 if [ "$MODE" = "all" ]; then - for m in elfuse-aarch64 qemu-aarch64; do + for m in elfuse-aarch64 qemu-aarch64 elfuse-x86_64; do run_suite "$m" || total_fail=$((total_fail + $?)) done exit "$total_fail" diff --git a/tests/test-negative.c b/tests/test-negative.c index 2c068c0..6e8f5ca 100644 --- a/tests/test-negative.c +++ b/tests/test-negative.c @@ -395,7 +395,7 @@ static void test_einval(void) { /* TIMER_ABSTIME with a deadline already in the past must return 0 * immediately (no sleep). Use tv_sec=0 to stay in the kernel's - * "valid timespec" space — Linux rejects negative tv_sec with + * "valid timespec" space -- Linux rejects negative tv_sec with * -EINVAL even before the deadline-in-past check. */ struct timespec ts = {.tv_sec = 0, .tv_nsec = 0}; diff --git a/tests/test-perf.sh b/tests/test-perf.sh index 81879af..78f90a6 100755 --- a/tests/test-perf.sh +++ b/tests/test-perf.sh @@ -53,7 +53,7 @@ PERF_FAILED=0 # Collect $RUNS timing samples for a command, print median and stats. # Args: label command... -# Earlier revisions swallowed every sample's exit status with `|| true`, +# Earlier revisions swallowed every sample's exit status with '|| true', # which made a missing native binary, an elfuse crash, or a host SIP # block silently degrade into "median 0 ms PASS". Now any non-zero # sample aborts the timing for that label and flips PERF_FAILED so the @@ -130,7 +130,7 @@ for _ in $(seq 1 100); do cat "$SYSCALL_C" >> "$TMPFILE"; done TMPSIZE=$(wc -c < "$TMPFILE" | tr -d ' ') printf " ${CYAN}(test file: %s bytes)${RESET}\n" "$TMPSIZE" # sh -c spawns a child shell that does not inherit the outer pipefail -# from the script's `set -o pipefail`. Run the pipeline under bash so +# from the script's 'set -o pipefail'. Run the pipeline under bash so # pipefail is available on systems whose /bin/sh is not bash-compatible. benchmark "native cat|wc" bash -c "set -o pipefail; cat '$TMPFILE' | wc -l" benchmark "elfuse cat|wc" bash -c "set -o pipefail; '$ELFUSE' '$TOOL_BIN/cat' '$TMPFILE' | wc -l" diff --git a/tests/test-rosetta-alpine.sh b/tests/test-rosetta-alpine.sh index e985aea..fbae766 100755 --- a/tests/test-rosetta-alpine.sh +++ b/tests/test-rosetta-alpine.sh @@ -33,45 +33,23 @@ esac FIXTURES="${FIXTURES_DIR:-externals/test-fixtures}" STATICBIN_LONG="${FIXTURES}/x86_64-musl/staticbin/bin" ROOTFS="${FIXTURES}/x86_64-musl/rootfs" -ROSETTA_PATH=/Library/Apple/usr/libexec/oah/RosettaLinux/rosetta +ROSETTA_PATH="${MATRIX_ROSETTA_TRANSLATOR:-/Library/Apple/usr/libexec/oah/RosettaLinux/rosetta}" SHORTDIR=/tmp/elfuse-ra STATICBIN="${SHORTDIR}/bin" DATA="${SHORTDIR}/data" +# Shared report_pass / report_fail / report_skip + Results: summary +# emitter. Matches the matrix runner's aarch64 per-binary format so +# tests/test-matrix.sh elfuse-x86_64 output reads uniformly. +# shellcheck source=tests/lib/rosetta-test.sh +. "$(dirname "$0")/lib/rosetta-test.sh" + pass=0 fail=0 skip=0 total=0 -c_green() -{ - printf '\033[0;32m%s\033[0m' "$*" -} -c_red() -{ - printf '\033[0;31m%s\033[0m' "$*" -} -c_yellow() -{ - printf '\033[1;33m%s\033[0m' "$*" -} -report_pass() -{ - printf '%s %s\n' "$(c_green ' PASS:')" "$*" - pass=$((pass + 1)) -} -report_fail() -{ - printf '%s %s\n' "$(c_red ' FAIL:')" "$*" - fail=$((fail + 1)) -} -report_skip() -{ - printf '%s %s\n' "$(c_yellow ' SKIP:')" "$*" - skip=$((skip + 1)) -} - # Pre-flight. if [ ! -x "$ROSETTA_PATH" ]; then printf 'rosetta translator not found at %s\n' "$ROSETTA_PATH" >&2 @@ -87,15 +65,7 @@ if [ ! -x "$ELFUSE" ]; then exit 1 fi -# macOS ships no built-in timeout(1); Homebrew coreutils installs it as -# /opt/homebrew/bin/timeout (and the legacy gtimeout alias). Detect either -# binary so this suite runs on macOS hosts without preconfigured PATH. -TIMEOUT="$(command -v timeout 2> /dev/null || command -v gtimeout 2> /dev/null \ - || true)" -if [ -z "$TIMEOUT" ]; then - printf 'timeout(1) not found in PATH; install via: brew install coreutils\n' >&2 - exit 77 -fi +require_timeout # Stage short-path symlink farm and a small data corpus. rm -rf "$SHORTDIR" @@ -434,9 +404,7 @@ run_pipe "pipe-base64-decode" "rosetta-bridge" \ # Summary # --------------------------------------------------------------------------- -printf '\n' -printf 'Results: %s passed, %s failed, %s skipped (of %s)\n' \ - "$pass" "$fail" "$skip" "$total" +report_summary "$total" if [ "$fail" -gt 0 ]; then exit 1 diff --git a/tests/test-rosetta-audit.sh b/tests/test-rosetta-audit.sh new file mode 100644 index 0000000..c69e2a5 --- /dev/null +++ b/tests/test-rosetta-audit.sh @@ -0,0 +1,71 @@ +#!/usr/bin/env bash +# test-rosetta-audit.sh - Rosetta thread/signal audit smoke +# +# Copyright 2026 elfuse contributors +# SPDX-License-Identifier: Apache-2.0 + +set -euo pipefail + +ELFUSE_INPUT="${1:-build/elfuse}" +case "$ELFUSE_INPUT" in + /*) ELFUSE="$ELFUSE_INPUT" ;; + *) ELFUSE="$(pwd)/$ELFUSE_INPUT" ;; +esac + +FIXTURES="${FIXTURES_DIR:-externals/test-fixtures}" +ROSETTA_PATH="${MATRIX_ROSETTA_TRANSLATOR:-/Library/Apple/usr/libexec/oah/RosettaLinux/rosetta}" +AUDIT_BIN="$(pwd)/tests/fixtures/rosetta/x86_64-rosetta-audit" +TLS0_BIN="$(pwd)/tests/fixtures/rosetta/x86_64-rosetta-tls0" + +# shellcheck source=tests/lib/rosetta-test.sh +. "$(dirname "$0")/lib/rosetta-test.sh" + +pass=0 +fail=0 +skip=0 +total=0 + +if [ ! -x "$ROSETTA_PATH" ]; then + printf 'rosetta translator not found at %s\n' "$ROSETTA_PATH" >&2 + exit 77 +fi +if [ ! -x "$ELFUSE" ]; then + printf 'elfuse binary not found: %s\n' "$ELFUSE" >&2 + exit 1 +fi + +require_timeout + +if [ ! -x "$AUDIT_BIN" ] || [ ! -x "$TLS0_BIN" ]; then + printf 'vendored Rosetta audit fixtures missing under tests/fixtures/rosetta/\n' >&2 + exit 77 +fi + +total=$((total + 1)) +set +e +audit_out="$("$ELFUSE" "$AUDIT_BIN" 2>&1)" +audit_rc=$? +set -e +if [ "$audit_rc" -eq 41 ] && printf '%s\n' "$audit_out" | grep -q 'XFAIL sa-resethand-shadowed'; then + report_pass "audit-known-limitations" +elif [ "$audit_rc" -eq 0 ] && printf '%s\n' "$audit_out" | grep -q 'PASS sa-resethand-reset'; then + report_pass "audit-known-limitations" +else + report_fail "audit-known-limitations: rc=$audit_rc" + printf '%s\n' "$audit_out" >&2 +fi + +total=$((total + 1)) +set +e +"$TIMEOUT" 5 "$ELFUSE" "$TLS0_BIN" > /tmp/elfuse-rosetta-tls0.out 2>&1 +tls0_rc=$? +set -e +if [ "$tls0_rc" -eq 124 ]; then + report_pass "tls0-known-hang" +else + report_fail "tls0-known-hang: rc=$tls0_rc" + cat /tmp/elfuse-rosetta-tls0.out >&2 +fi +rm -f /tmp/elfuse-rosetta-tls0.out + +report_summary "$total" diff --git a/tests/test-rosetta-cli.sh b/tests/test-rosetta-cli.sh index 6e6b31f..563cff1 100755 --- a/tests/test-rosetta-cli.sh +++ b/tests/test-rosetta-cli.sh @@ -8,6 +8,18 @@ set -euo pipefail ELFUSE="${1:-build/elfuse}" +# Shared report_pass / report_fail / report_skip + test-runner.sh +# colors. The matrix runner reads only the Results: line emitted by +# report_summary at the bottom; per-binary lines now match the aarch64 +# modes' [ OK ] / [ FAIL ] format. +# shellcheck source=tests/lib/rosetta-test.sh +. "$(dirname "$0")/lib/rosetta-test.sh" + +pass=0 +fail=0 +skip=0 +total=0 + tmpdir="$(mktemp -d "${TMPDIR:-/tmp}/elfuse-rosetta-cli.XXXXXX")" trap 'rm -rf "$tmpdir"' EXIT @@ -54,19 +66,22 @@ run_expect_fail() local label="$1" local pattern="$2" shift 2 + total=$((total + 1)) local stderr="$tmpdir/${label}.stderr" if "$@" > /dev/null 2> "$stderr"; then - printf 'FAIL %s: command succeeded unexpectedly\n' "$label" >&2 + report_fail "$label (command succeeded unexpectedly)" cat "$stderr" >&2 + report_summary "$total" exit 1 fi if ! grep -q -- "$pattern" "$stderr"; then - printf 'FAIL %s: stderr did not contain %s\n' "$label" "$pattern" >&2 + report_fail "$label (stderr did not contain $pattern)" cat "$stderr" >&2 + report_summary "$total" exit 1 fi - printf 'PASS %s\n' "$label" + report_pass "$label" } # Rosetta is on by default and architecture is auto-detected from the ELF @@ -94,3 +109,5 @@ run_expect_fail "rosetta-gdb" \ run_expect_fail "rosetta-default" \ "requires the Rosetta Linux translator\\|translate produced empty/missing output\\|Translation failed, invalid path or invalid executable\\|VMAllocationTracker\\|Rosetta is only intended to run on Apple Silicon" \ "$ELFUSE" "$x64_elf" + +report_summary "$total" diff --git a/tests/test-rosetta-failure-modes.sh b/tests/test-rosetta-failure-modes.sh index b97683f..362e4ef 100755 --- a/tests/test-rosetta-failure-modes.sh +++ b/tests/test-rosetta-failure-modes.sh @@ -1,20 +1,28 @@ #!/usr/bin/env bash -# test-rosetta-failure-modes.sh - Probe known x86_64-via-Rosetta limits +# +# CLI gating for x86_64-via-Rosetta # # Copyright 2026 elfuse contributors # SPDX-License-Identifier: Apache-2.0 # -# Verifies that known-unsupported scenarios fail with a clear, stable -# error rather than crashing or succeeding silently. Treats the failure -# itself as the test: every probe is expected to exit non-zero AND emit -# a recognisable error fragment. +# Verifies that the three command-line gates around x86_64 guests +# reject as designed. Treats the failure itself as the test: every +# probe is expected to exit non-zero AND emit a recognisable error +# fragment. # # Categories covered: -# 1. Mid-process aarch64 -> x86_64 execve: rejected -ENOEXEC -# 2. Dynamic x86_64 binary (PT_INTERP): "failed to mmap segment" -# 3. --gdb on x86_64 ELF: rejected by main.c -# 4. --no-rosetta with x86_64: rejected at exec.c -# 5. ELFUSE_NO_ROSETTA=1 with x86_64: same rejection via env +# 1. --gdb on x86_64 ELF: rejected by main.c +# 2. --no-rosetta with x86_64: rejected at exec.c +# 3. ELFUSE_NO_ROSETTA=1 with x86_64: same rejection via env +# +# The end-to-end dynamic-linker bring-up under Rosetta is covered by +# tests/test-rosetta-glibc.sh (glibc-hello / glibc-hello-via-ldso), +# and mid-process execve re-bootstrap is covered by +# tests/test-rosetta-statics.sh (env-execve). Those tests carry the +# same code-path scrutiny as the dynamic / execve probes that used to +# live here, against the vendored fixture trees that are always +# present, so this script no longer needs the x86_64-musl Alpine +# corpus and no longer self-stages it. # # Usage: tests/test-rosetta-failure-modes.sh [path/to/elfuse] @@ -26,44 +34,19 @@ case "$ELFUSE_INPUT" in *) ELFUSE="$(pwd)/$ELFUSE_INPUT" ;; esac -FIXTURES="${FIXTURES_DIR:-externals/test-fixtures}" -STATICBIN_LONG="${FIXTURES}/x86_64-musl/staticbin/bin" -DYNBIN_LONG="${FIXTURES}/x86_64-musl/dyn-bin" SHORTDIR=/tmp/elfuse-rfm +# Shared report_pass / report_fail / report_skip + Results: summary +# emitter. Matches the matrix runner's aarch64 per-binary format so +# tests/test-matrix.sh elfuse-x86_64 output reads uniformly. +# shellcheck source=tests/lib/rosetta-test.sh +. "$(dirname "$0")/lib/rosetta-test.sh" + pass=0 fail=0 skip=0 total=0 -c_green() -{ - printf '\033[0;32m%s\033[0m' "$*" -} -c_red() -{ - printf '\033[0;31m%s\033[0m' "$*" -} -c_yellow() -{ - printf '\033[1;33m%s\033[0m' "$*" -} -report_pass() -{ - printf '%s %s\n' "$(c_green ' PASS:')" "$*" - pass=$((pass + 1)) -} -report_fail() -{ - printf '%s %s\n' "$(c_red ' FAIL:')" "$*" - fail=$((fail + 1)) -} -report_skip() -{ - printf '%s %s\n' "$(c_yellow ' SKIP:')" "$*" - skip=$((skip + 1)) -} - # Expect a non-zero exit AND a stderr fragment match. # Args: