Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@ externals/
lib/modules/
*.o
*.bin
__pycache__
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ The build signs `build/elfuse` before use. Override the signing identity with
`--sysroot`, and attaching `gdb` / `lldb` to the built-in stub.
- [docs/testing.md](docs/testing.md): build prerequisites, the `make check`
flow, the QEMU cross-check matrix, and fixture handling.
- [docs/internals.md](docs/internals.md): canonical technical reference
- [docs/internals.md](docs/internals.md): canonical technical reference --
HVF constraints, EL1 shim and HVC protocol, page-table splitting, syscall
translation tables, threads/futex, fork/clone IPC, signals, ptrace, and
the GDB stub.
Expand Down Expand Up @@ -109,7 +109,7 @@ Boundaries to be aware of:
virtualization. `/proc`, `/dev`, and mount data are compatibility views.
- HVF allows one VM per host process, so Linux-style `fork` is implemented
via `posix_spawn` plus state transfer (a fast CoW path is used when
available see [docs/internals.md](docs/internals.md)).
available -- see [docs/internals.md](docs/internals.md)).
- `MAP_SHARED` is treated as `MAP_PRIVATE`; this matches single-process
guest semantics and unblocks tools that expect file-backed mappings.
- Unsupported syscalls return Linux-style errors rather than silently
Expand Down
38 changes: 19 additions & 19 deletions docs/internals.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ Apple HVF imposes a handful of constraints that shape the rest of the design:
- System registers cannot be set via `MSR` from the guest because
`HCR_EL2.TSC=1` traps all `MSR` writes. Boot-time sysreg installation
(RES1 bits, MMU enable, TTBR0, etc.) goes through HVC #4 from the EL1
shim. Runtime EL0 sysreg traps `MSR TPIDR_EL0` and similar are
shim. Runtime EL0 sysreg traps -- `MSR TPIDR_EL0` and similar -- are
handled by the HVC #12 system-instruction trap path.
- Only `HV_SYS_REG_*` constants from Hypervisor.framework may be used for
register IDs.
Expand Down Expand Up @@ -116,9 +116,9 @@ interp_base - varies: Dynamic linker (g->interp_base, --sysroot only)
The guest size is determined by the VM's configured IPA width (capped at
40-bit / 1 TiB):

- 36-bit IPA (64 GiB) native AArch64 on Apple M2: `mmap_limit ≈ 56 GiB`,
- 36-bit IPA (64 GiB) -- native AArch64 on Apple M2: `mmap_limit ≈ 56 GiB`,
`interp_base ≈ 60 GiB`
- 40-bit IPA (1 TiB) native AArch64 on Apple M3 and later:
- 40-bit IPA (1 TiB) -- native AArch64 on Apple M3 and later:
`mmap_limit ≈ 1016 GiB`, `interp_base ≈ 1020 GiB`

Both `mmap_limit` and `interp_base` are computed at runtime from `guest_size`
Expand Down Expand Up @@ -356,7 +356,7 @@ are supported per VM.
- The classic ops: `FUTEX_WAIT`, `FUTEX_WAKE`, `FUTEX_WAIT_BITSET`,
`FUTEX_WAKE_BITSET`, `FUTEX_REQUEUE`, `FUTEX_CMP_REQUEUE`, `FUTEX_WAKE_OP`.
- A subset of priority-inheritance ops: `FUTEX_LOCK_PI`, `FUTEX_UNLOCK_PI`,
`FUTEX_TRYLOCK_PI`. Priority semantics are not actually inherited the
`FUTEX_TRYLOCK_PI`. Priority semantics are not actually inherited -- the
ops behave as ordinary mutex acquire/release, which is enough for glibc
and musl to make forward progress.
- `futex_waitv` (syscall 449) for batch waits across up to 128 futex
Expand Down Expand Up @@ -423,15 +423,15 @@ state transfer:
When `g->shm_fd >= 0` the guest memory is file-backed (`mkstemp` + `unlink`,
`MAP_SHARED`). Fork sends the backing fd over `SCM_RIGHTS`:

- Parent stays on `MAP_SHARED` and does NOT remap HVF caches the host
- Parent stays on `MAP_SHARED` and does NOT remap -- HVF caches the host
VA→PA mapping from `hv_vm_map`, and a `MAP_FIXED` remap does not update
Stage-2, so a remapping parent would observe stale pages.
- Child maps the fd `MAP_PRIVATE`, producing an instant CoW clone with zero
data copy.
- The IPC header sets `has_shm = 1` and `num_regions = 0`, skipping memory
serialization entirely.
- Child calls `guest_init_from_shm()` instead of `guest_init()`, and must
restore `g->ttbr0` from the IPC header `guest_init_from_shm` zeroes the
restore `g->ttbr0` from the IPC header -- `guest_init_from_shm` zeroes the
struct, and without `ttbr0` page-table walks fail for all high VAs.

This path is roughly 50× faster than the legacy IPC copy path on large guest
Expand Down Expand Up @@ -496,7 +496,7 @@ Key points:
`alarm()` for its per-iteration vCPU watchdog.
- `signal_check_timer()` is called from the vCPU loop after each syscall.
- After `SYSCALL_EXEC_HAPPENED`, the vCPU loop verifies that `ELR_EL1` is
non-zero a defensive check against HVF register-sync bugs.
non-zero -- a defensive check against HVF register-sync bugs.
- Each `thread_entry_t` carries its own `blocked` mask. `rt_sigprocmask`
operates on `current_thread->blocked`, and child threads inherit the
parent's mask at clone time.
Expand All @@ -511,11 +511,11 @@ which the tracer reads and writes registers through the snapshot protocol.

In `src/syscall/proc.c`:

- `PTRACE_SEIZE` attach without stopping; sets `ptraced = 1`.
- `PTRACE_CONT` resume the stopped tracee, optionally injecting a signal.
- `PTRACE_INTERRUPT` force the tracee into ptrace-stop via
- `PTRACE_SEIZE` -- attach without stopping; sets `ptraced = 1`.
- `PTRACE_CONT` -- resume the stopped tracee, optionally injecting a signal.
- `PTRACE_INTERRUPT` -- force the tracee into ptrace-stop via
`hv_vcpus_exit()`.
- `PTRACE_GETREGSET` / `PTRACE_SETREGSET` (`NT_PRSTATUS`) read or write
- `PTRACE_GETREGSET` / `PTRACE_SETREGSET` (`NT_PRSTATUS`) -- read or write
the tracee's register snapshot. Writes are applied on resume.

### Snapshot Protocol
Expand Down Expand Up @@ -601,14 +601,14 @@ correctly. `elf_resolve_interp()` in `src/core/elf.c` is shared between

`src/debug/` is split by role:

- `gdbstub.c` session lifecycle, stop/resume flow, packet dispatch
- `gdbstub-rsp.c` RSP packet transport and hex helpers
- `gdbstub-reg.c` register snapshot layout, restore flow, `target.xml`
- `gdbstub.c` -- session lifecycle, stop/resume flow, packet dispatch
- `gdbstub-rsp.c` -- RSP packet transport and hex helpers
- `gdbstub-reg.c` -- register snapshot layout, restore flow, `target.xml`

The stub runs in all-stop mode. Because Hypervisor.framework register access
must happen on the owning thread, the stopped vCPU snapshots its own state;
the GDB-handler thread reads and updates the snapshot, and the owning thread
restores the modified state on resume the same pattern used by ptrace.
restores the modified state on resume -- the same pattern used by ptrace.

The split mirrors the architectural boundary: transport and encoding are
independent of guest execution; register layout is independent of socket I/O;
Expand All @@ -618,10 +618,10 @@ stop/resume sequencing remains tightly coupled to process and thread state.

`elfuse` uses several layers of validation:

- `make check` fast guest tests plus the BusyBox applet smoke suite.
- `make test-busybox` applet coverage in isolation.
- `make test-gdbstub` debugger integration.
- `make test-matrix` cross-checks elfuse against QEMU on the same corpus.
- `make check` -- fast guest tests plus the BusyBox applet smoke suite.
- `make test-busybox` -- applet coverage in isolation.
- `make test-gdbstub` -- debugger integration.
- `make test-matrix` -- cross-checks elfuse against QEMU on the same corpus.

The rule for contributors is simple: match the validation depth to the
subsystem you changed. Procfs, process state, dynamic linking, and debugging
Expand Down
160 changes: 153 additions & 7 deletions docs/testing.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ The most useful development targets are:
```sh
make elfuse
make check
make test-rosetta-all
make test-gdbstub
make test-matrix
make lint
Expand All @@ -46,6 +47,10 @@ What they do:
applet smoke suite. The BusyBox binary is auto-resolved from
`externals/test-fixtures/aarch64-musl/staticbin/bin/busybox` if present, or
downloaded into `build/busybox` on first run.
- `make test-rosetta-all`: Rosetta-specific x86_64 acceptance scripts
(`test-rosetta-cli`, `test-rosetta-failure-modes`,
`test-rosetta-statics`, `test-rosetta-alpine`,
`test-rosetta-audit`, `test-rosetta-jit`, `test-rosetta-glibc`)
- `make test-busybox`: just the BusyBox suite, useful when iterating on a
single applet failure without rerunning the unit suite
- `make test-fuse-alpine`: validate guest `/dev/fuse` + `mount("fuse")`
Expand All @@ -63,8 +68,8 @@ make elfuse
make check
```

For changes that touch procfs, path handling, `/dev`, FUSE, networking, dynamic linking, or
guest process semantics, run the matrix as well:
For changes that touch procfs, path handling, `/dev`, FUSE, networking, dynamic
linking, or guest process semantics, run the matrix as well:

```sh
make test-matrix
Expand All @@ -76,31 +81,172 @@ iterate on a single applet failure without rerunning the unit suite.

## Test Matrix

The matrix driver lives in `tests/test-matrix.sh`. It runs the same guest test
corpus in two execution modes:
The matrix driver lives in `tests/test-matrix.sh`. It currently covers three
execution modes:

- `elfuse-aarch64`: every binary is executed via `build/elfuse` on macOS
- `qemu-aarch64`: the same binaries run natively inside an Alpine
`aarch64-linux-musl` minirootfs booted by `qemu-system-aarch64`
- `elfuse-x86_64`: Rosetta-for-Linux acceptance scripts against the staged
Alpine x86_64 fixture tree

The goal is not to compare performance. The goal is to compare guest-observable
behavior against a ground-truth Linux AArch64 environment so that any divergence
in syscall translation, procfs emulation, or process semantics is caught early.
The x86_64 mode is narrower: it aggregates the Rosetta-specific acceptance
scripts and their per-binary summaries into the same matrix runner, including
the Rosetta thread/signal audit smoke, the LuaJIT guest-JIT probe, and the
glibc dynamic-binary acceptance helper.

Run a single mode with `bash tests/test-matrix.sh elfuse-aarch64` or
`bash tests/test-matrix.sh qemu-aarch64`; `all` runs both back-to-back.
Run a single mode with `bash tests/test-matrix.sh elfuse-aarch64`,
`bash tests/test-matrix.sh qemu-aarch64`, or
`bash tests/test-matrix.sh elfuse-x86_64`; `all` runs all three back-to-back.

Fixture handling is self-contained:

- On first use, `tests/fetch-fixtures.sh` downloads the required Alpine
packages and the `linux-virt` kernel into `externals/test-fixtures/` and
assembles an initramfs. Subsequent runs are zero-config.
- The same fixture tree is reused by both matrix modes.
- The same fixture tree is reused across the matrix modes.
- When Rosetta mode is requested and the translator is installed,
`tests/test-matrix.sh` auto-fetches the x86_64 fixture tree
(`INCLUDE_X86_64=1`) on demand.
- QEMU mode requires `qemu-system-aarch64` on `PATH` (Homebrew `qemu`
provides it).
- musl is the only Alpine libc; the glibc-dynamic suite is skipped unless
`GUEST_GLIBC_*` environment variables point at an external sysroot.

## Rosetta Limitations

`elfuse-x86_64` is expected to inherit two Rosetta-internal limitations that are
not treated as elfuse regressions:

- `SA_RESETHAND` is not reset reliably because Rosetta shadows guest signal
handler state internally. This matches the vendored `externals/hyper-linux`
reference behavior.
- `clone(..., CLONE_SETTLS, tls=0, ...)` can hang. The upstream reproducer is
the raw-thread path in `externals/hyper-linux/test/test-thread.c`, and the
same limitation is documented in `externals/hyper-linux/hl.1`.

The x86_64 matrix branch is therefore a Rosetta acceptance gate, not a claim
that translated guests fully match native Linux thread and signal semantics.

## x86_64 Acceptance Inventory and Per-Host Baselines

The `elfuse-x86_64` matrix mode aggregates seven sub-suites. Each one
emits a deterministic per-binary pass list; the matrix runner sums
those into a single `Results:` line and compares against a per-host
baseline. The exact labels each sub-suite emits, and the contract
they verify, are:

- `tests/test-rosetta-cli.sh` (4): `rosetta-disabled-flag`,
`rosetta-disabled-env`, `rosetta-gdb`, `rosetta-default` --
command-line gating of the translator path (opt-out flag, env
override, `--gdb` rejection, install-hint surface).

- `tests/test-rosetta-failure-modes.sh` (3): `no-rosetta-flag`,
`no-rosetta-env`, `gdb-x86_64` -- command-line rejection paths.
Self-contained against a synthesized minimal x86_64 ELF; no
external fixture tree required. The dynamic-linker bring-up and
mid-process execve scenarios that used to live here are now
exclusively in the glibc and statics suites against the vendored
rootfs (see `glibc-hello` / `glibc-hello-via-ldso` and
`env-execve`).

- `tests/test-rosetta-statics.sh` (20): `echo`, `true`, `false`,
`printenv`, `expr-zero`, `expr-mul`, `basename`, `dirname`,
`stat-self`, `factor`, `seq`, `sha256sum`, `md5sum`, `uname-m`, `arch`,
`busybox-arch-subcommand`, `date-utc`, `id-u`, `nproc`,
`env-execve` -- statically-linked Alpine busybox applets,
exercising VZ ioctl gate, `/proc/self/exe` redirect, high-VA mmap,
and the kbuf alias.

- `tests/test-rosetta-alpine.sh` (33): `cat-fruits-first-line`,
`wc-l-fruits`, `wc-l-lines`, `wc-c-lines`, `ls-data`, `stat-data`,
`find-by-name`, `du-sk-data`, `sha256-fruits`,
`sha256-lines-matches-host`, `sha512-lines`, `md5-fruits`,
`cksum-fruits`, `sort-first`, `sort-reverse-first`, `pipe-sort-wc`,
`pipe-tr-uppercase`, `pipe-cat-grep`, `pipe-sed-subst`,
`pipe-awk-field`, `head-n3`, `tail-n3`, `pipe-sort-uniq`,
`pipe-cut-field`, `pipe-rev`, `tac-reverse-first-line`, `seq-1-5`,
`seq-step`, `factor-prime`, `factor-composite`, `diff-identical`,
`diff-differs`, `pipe-base64-decode` -- broader file I/O, text
processing, and host-shell pipelines stitched through Rosetta on
every stage.

- `tests/test-rosetta-audit.sh` (2): `audit-known-limitations`,
`tls0-known-hang` -- bookkeeping probe that asserts the documented
Rosetta shadowing failures (above) remain the only divergences;
fails loudly if a new threading/signal-state edge case starts
diverging.

- `tests/test-rosetta-jit.sh` (2): `luajit-trace`,
`luajit-coroutine` -- guest-side JIT under translation
(LuaJIT trace emission + coroutine allocation), covering the
small-mprotect RW->RX and per-thread icache observation path that
rosetta's own JIT does not exercise.

- `tests/test-rosetta-glibc.sh` (7): `glibc-hello`,
`glibc-hello-via-ldso`, `glibc-hello-list`, `glibc-dlopen`,
`glibc-tls`, `glibc-gdtls`, `glibc-pthread-tls` --
dynamically-linked glibc x86_64 binary acceptance through
`--sysroot` against the staged minimal glibc rootfs under
`externals/test-fixtures/x86_64-glibc/rootfs/`. The first three
cover load-time `PT_INTERP` resolution and `ld.so --list`
introspection. `glibc-dlopen` runs `dlopen("libm.so.6")` plus a
`dlsym(sqrt)` round-trip to exercise the runtime fresh-`.so`-mmap
codepath, which is distinct from the load-time path the first
three probes touch. `glibc-tls` reads and writes two
initial-exec `__thread` variables (one integer, one pointer) so a
broken FS-register to `TPIDR_EL0` translation surfaces as a
value mismatch rather than as a silent skip. `glibc-gdtls`
`dlopen`s a companion `libgdtls.so` whose `__thread` variable
must use the general-dynamic model (calls `__tls_get_addr`);
this is the only probe that exercises that lowering path, which
the initial-exec probe cannot reach. `glibc-pthread-tls`
`pthread_create`s a worker thread that reads and writes its own
`__thread` slot; the probe asserts the worker saw its own
default value (not the main thread's overwritten marker) and that
the main thread's slot survives the worker's write, so a broken
per-thread `TPIDR_EL0` setup on additional threads surfaces as
isolation failure rather than as a silent crash.

Total: 71 expected passes, 0 expected failures.

### Per-Host Baseline Capture

The matrix runner keys its `elfuse-x86_64` baseline by detected host
SoC class. Two classes matter because `sys_mmap_fixed_high_va` takes
different paths under different IPA widths:

- `apple-m1-m2`: 36-bit native IPA, exercises the overflow-segment
path. Captured on this codebase against Apple M1 hardware
(MacBookAir10,1). The seven sub-suites land at 71/0/0.

- `apple-m3-plus`: 40-bit native IPA, exercises the bisected-slab
path (and the M5 slab-bisection variant). Currently held equal to
`apple-m1-m2` pending operator capture on real M3+ hardware. When
that capture lands, only the
`EXPECTED_MIN_PASS[elfuse-x86_64:apple-m3-plus]` and
`EXPECTED_FAIL[elfuse-x86_64:apple-m3-plus]` entries in
`tests/test-matrix.sh` move; the M1/M2 row stays intact.

- `apple-unknown`: fallback for SoC brand strings the detector does
not recognise. Inherits the M1/M2 numbers and triggers a one-line
warning so a new SoC does not silently graft onto an existing row.

Class detection reads `sysctl -n machdep.cpu.brand_string` and matches
against `Apple M1`/`Apple M2` (M1/M2) and `Apple M3`/`Apple M4`/`Apple
M5` (M3+). To exercise the M3+ row from an M1/M2 host (and vice
versa) without changing the detector, set
`MATRIX_HOST_CLASS_OVERRIDE=apple-m3-plus` (or `apple-m1-m2`,
`apple-unknown`) before invoking `tests/test-matrix.sh`.

When the seven sub-suites grow or trim a test, the per-sub-suite
counts in the comment block above `EXPECTED_MIN_PASS` and the
inventory list above must move in the same commit so the per-host
baseline stays in sync with reality.

## Test Inventory

The repository contains several layers of validation:
Expand Down
4 changes: 2 additions & 2 deletions mk/common.mk
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# mk/common.mk Generic build rules
# mk/common.mk -- Generic build rules
#
# Per-file compilation with automatic dependency tracking, verbosity
# control, and kernel-style build output. Inspired by libiui's build
Expand All @@ -18,7 +18,7 @@ $(BUILD_DIR):
# Automatic header dependency generation (-MMD -MP)
DEPFLAGS = -MMD -MP -MF $(BUILD_DIR)/$(subst /,_,$*).d

# Pattern rules source to object.
# Pattern rules -- source to object.
# GENERATED_HEADERS are order-only prerequisites so clean builds have the
# build-generated includes available before compilation. .d files track the
# real header dependencies after the first compile. Generators whose output
Expand Down
8 changes: 7 additions & 1 deletion mk/config.mk
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,19 @@ NATIVE_TESTS := tests/test-multi-vcpu.c tests/test-rwx.c
SPECIAL_TEST_SRCS := tests/test-lowbase-mem.c
SPECIAL_TEST_BINS := $(BUILD_DIR)/test-lowbase-mem-200000 $(BUILD_DIR)/test-lowbase-mem-300000

# x86_64-only sources that back the vendored Rosetta fixtures in
# tests/fixtures/rosetta/. They are not buildable with the aarch64
# cross-toolchain and would fail link with undefined dlopen/pthread
# symbols even if compiled, so exclude them from the aarch64 glob.
ROSETTA_X86_64_SRCS := $(wildcard tests/x86_64-glibc-*.c tests/x86_64-rosetta-*.c)

ifdef GUEST_TEST_BINARIES
TEST_DIR := $(GUEST_TEST_BINARIES)/bin
TEST_DEPS :=
TEST_HELLO_DEP :=
else
TEST_DIR := $(BUILD_DIR)
TEST_C_SRCS := $(filter-out $(NATIVE_TESTS) $(SPECIAL_TEST_SRCS),$(wildcard tests/*.c))
TEST_C_SRCS := $(filter-out $(NATIVE_TESTS) $(SPECIAL_TEST_SRCS) $(ROSETTA_X86_64_SRCS),$(wildcard tests/*.c))
TEST_C_BINS := $(patsubst tests/%.c,$(BUILD_DIR)/%,$(TEST_C_SRCS))
TEST_DEPS := $(BUILD_DIR)/test-hello $(TEST_C_BINS) $(SPECIAL_TEST_BINS)
TEST_HELLO_DEP := $(BUILD_DIR)/test-hello
Expand Down
2 changes: 1 addition & 1 deletion mk/shim.mk
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ $(BUILD_DIR)/shim_blob.h: $(BUILD_DIR)/shim.bin
cmp -s "$$tmp" "$@" 2>/dev/null || mv "$$tmp" "$@"; \
rm -f "$$tmp"

# Version header regenerates when HEAD or index changes.
# Version header -- regenerates when HEAD or index changes.
# cmp trick avoids unnecessary rebuilds when version string is unchanged.
VERSION_DEPS := $(wildcard .git/HEAD .git/index) mk/config.mk
$(BUILD_DIR)/version.h: $(VERSION_DEPS) | $(BUILD_DIR)
Expand Down
Loading
Loading