Skip to content

Harden Rosetta runtime path for dynamic linking#43

Open
jserv wants to merge 7 commits into
mainfrom
rosetta
Open

Harden Rosetta runtime path for dynamic linking#43
jserv wants to merge 7 commits into
mainfrom
rosetta

Conversation

@jserv
Copy link
Copy Markdown
Contributor

@jserv jserv commented May 26, 2026

The static-binary path through Apple Rosetta has been in tree, but the dynamic-linker bring-up, the fault-rollback story for high-VA fixed mmap, and the crash diagnostics under translation were all left at "good enough for happy path".


Summary by cubic

Enables dynamic linking for x86_64 guests under Rosetta, adds safe rollback for high-VA MAP_FIXED replacements, and improves crash diagnostics. Also introduces a self-contained Rosetta acceptance matrix with glibc/TLS/JIT probes and vendored fixtures, and refines /proc/self/maps preannounce handling.

  • New Features

    • Dynamic linker path for Rosetta with high-VA mapping and rollback safeguards.
    • Preannounced regions used only for /proc/self/maps to avoid MAP_FIXED_NOREPLACE conflicts.
    • Crash reports include a dedicated "Rosetta" section for stable parsing.
    • Rosetta test matrix: new suites (CLI gates, statics, Alpine, audit, JIT, glibc), vendored x86_64 fixtures, and make targets (test-rosetta-*, test-rosetta-all). Translator path can be overridden via MATRIX_ROSETTA_TRANSLATOR.
  • Bug Fixes

    • Correct MAP_FIXED replace-over-GPA-0 detection; snapshot/restore original bytes with a bounded buffer and vCPU quiesce during commit.
    • Distinguish EOF from I/O errors in pread/pwrite loops; reuse the range-reader in MADV_DONTNEED.
    • Limit synthetic FUTEX_WAIT EINTR to multi-threaded guests to preserve single-thread startup behavior.
    • Carry Rosetta preannounce/proc state across fork IPC and expose the translated image via /proc/self/exe.
    • /proc/self/maps preannounce: hide entries only when fully covered by live regions; partial coverage stays visible. Clarified that no producer currently emits preannounce entries.

Written for commit 5720c2e. Summary will update on new commits. Review in cubic

jserv added 6 commits May 26, 2026 08:06
This applies a uniform sweep so source comments stay strictly ASCII.
Substitutions applied on comment lines only:
- em-dash to --
- en-dash to -
- multiplication sign to x
- right arrow to ->
- left arrow to <-
- approximately-equal to ~=
- backticks around inline tokens to single quotes
- C++ // comments to /* */ blocks
- British spellings (behaviour, initialise, initialised) to American

The C comment-shape rule is also satisfied: every multi-line block has
its first sentence on the opening line, continues with ' *', and ends
with the closing delimiter on its own line. shfmt and clang-format
round-trip clean over the touched files.
The synthetic ~1s EINTR injection on indefinite futex_wait is now
gated on thread_is_single_active() being false. The single-threaded
probe in tests/test-futex-pi.c (test_futex_eintr) was written when
the injection fired unconditionally and now hangs forever because
the gate keeps single-threaded guests parked indefinitely on the
unblocked wait, which is the documented contract for glibc startup
paths.

Update the probe to match the new contract: spawn a sibling thread
that parks on a timed FUTEX_WAIT (the has_timeout=true branch dodges
the EINTR injection that would otherwise wake it spuriously), run
the parent's unblocked wait while that sibling is alive, and tear
the sibling down via keepalive-flip + futex_wake once the parent's
wait returns. The 5-second sibling timeout comfortably outlasts the
parent's 1s + 100ms jitter EINTR window.

The probe's assertion (-EINTR after 800ms-3000ms) is unchanged. The
new scaffolding is the only thing required to keep the test green
against the runtime contract that was tightened in the matching
src/runtime/futex.c hunk.
The per-region file-restore loop inside sys_madvise(MADV_DONTNEED)
duplicated the exact EINTR-retry + EOF-tolerant pread shape that
read_file_range_to_guest already provides. Call the helper and drop
the inline 17-line copy; EOF is still acceptable (helper returns 0
without populating the tail, and the surrounding memset already
zeroed the range), and host I/O failure still surfaces as the same
linux_errno() value.
The static-binary path through Apple Rosetta has been in tree, but the
dynamic-linker bring-up, the fault-rollback story for high-VA fixed
mmap, and the crash diagnostics under translation were all left at
"good enough for happy path". This change closes the gaps the recent
multi-model review surfaced:

src/syscall/mem.c (sys_mmap_high_va)
- Replacing an existing mapping over a backing GPA of 0 is now
  correct. The previous proxy of `replaced_gpa_base != 0` mis-classified
  GPA-0-backed regions as fresh allocations, so the byte snapshot,
  region remove, and rollback restore were silently skipped. An
  explicit `replacing_existing` boolean is set when the replaceable-
  region capture succeeds and is used everywhere the old proxy was.
- Snapshot the original bytes before the destructive memset/pread on
  the reused backing so a later guest_install_va_pages or
  guest_region_add_ex_owned_gpa failure restores the guest's old
  mapping verbatim instead of leaving it at zero or partial-file. The
  snapshot is freed unconditionally on every return path.
- Quiesce sibling vCPUs around the destructive write + commit window
  using the same primitive the overlay path uses, so concurrent
  guest readers cannot observe transient zeros, partial file
  contents, or rollback bytes.
- Cap the snapshot allocation at 256 MiB so a guest cannot force an
  arbitrary-size host malloc through a multi-GiB MAP_FIXED
  replacement. The fresh-allocation path is unconstrained; only the
  reuse branch needs the bounded host buffer.
- pread loops on the file-backed branches now distinguish EOF from
  I/O error. The previous "partial read is fine" swallow returned a
  successful mapping with a zero-filled tail even when the host
  read failed mid-stream; the new path rolls back and surfaces the
  errno to the guest. The fix is applied uniformly to every pread
  site in mem.c (low-VA fixed populate, low-VA non-fixed file-backed,
  refresh_shared_region_range, and the MADV_DONTNEED restore), and
  the pwrite loop in pwrite_all_at gains the same EINTR retry.

src/debug/crashreport.c
- The (via Apple Rosetta) breadcrumb now sits under its own
  ## Rosetta section header so any downstream parser keyed on
  ## Registers being the first line of the register section keeps
  working. The PC, ELR, and TPIDR_EL0 values are unchanged.

src/runtime/futex.c
- The synthetic ~1s EINTR injection on indefinite futex_wait is
  narrowed to multi-threaded guests. Single-threaded glibc startup
  paths can legitimately park in FUTEX_WAIT forever until a real
  wake arrives, and the prior unconditional injection broke that
  contract. The contract gates on thread_is_single_active().

src/core/{bootstrap,guest,guest.h}, src/runtime/{fork-state,fork-state.h,forkipc,procemu}
- Adjacent runtime adjustments that the polish above needs: a
  preannounce table consulted only by /proc/self/maps formatting (so
  guest MAP_FIXED_NOREPLACE over an advertise-only range does not
  trip -EEXIST), fork-IPC carriage of the Rosetta-specific path-
  publication state, and the /proc/self/exe redirection that exposes
  the Rosetta translator as the running image under binfmt-misc
  conventions.
Wire the elfuse-x86_64 mode in tests/test-matrix.sh into a first-class
matrix branch that aggregates seven Rosetta acceptance sub-suites,
gates against a per-host-class baseline, and runs self-contained
against vendored fixtures (no SSH to a build host required at test
time).

Matrix runner (tests/test-matrix.sh)
- New detect_x86_64_host_class() reads machdep.cpu.brand_string and
  maps to apple-m1-m2 (36-bit IPA, overflow-segment path),
  apple-m3-plus (40-bit IPA, bisected-slab on M5), or apple-unknown.
- Composite assoc-array keys (elfuse-x86_64:apple-m1-m2, etc.) carry
  per-host expected counts. Every subscript is explicitly quoted so
  shfmt cannot rewrite [a-b] into [a - b] and silently break the gate.
- MATRIX_HOST_CLASS_OVERRIDE env var lets one class exercise another's
  row for testing; the override is validated at script entry so a
  typo exits 2 before any sub-suite runs instead of falling through
  to a no-op gate.
- run_summary_suite aggregates per-sub-suite Results lines into the
  unified pass/fail/skip counters; arithmetic is hardened against
  empty/malformed sub-suite output and forces decimal so a sub-suite
  emitting 08 or 09 does not abort the matrix under set -e.
- ensure_x86_fixtures checks for staticbin/busybox, dyn-bin/,
  ld-musl-x86_64.so.1, and luajit so a stale partial cache triggers
  a re-fetch.
- The mode skips cleanly (suite-level) when the Rosetta translator
  is absent, so make test-matrix all stays green on non-Rosetta
  hosts.

Shared reporter (tests/lib/rosetta-test.sh)
- Sources tests/lib/test-runner.sh and exposes report_pass /
  report_fail / report_skip + report_summary so per-binary output
  across the seven sub-suites matches the aarch64 modes'
  LABEL [ OK ] format. A require_timeout helper centralizes the
  macOS gtimeout fallback and the exit-77 skip when neither
  timeout(1) nor gtimeout is on PATH.

Sub-suites
- test-rosetta-cli.sh, -failure-modes, -statics, -alpine reshape onto
  the shared lib and drop their per-script colour/report helpers.
  failure-modes is trimmed to the three command-line gates (no-
  rosetta-flag, no-rosetta-env, gdb-x86_64) since the dynamic-linker
  and execve-re-bootstrap probes are now covered by glibc-hello and
  env-execve respectively against the vendored fixture tree.
- New test-rosetta-audit.sh asserts the documented Rosetta
  limitations (SA_RESETHAND not reset, CLONE_SETTLS tls=0 hang) are
  still the only divergences from the upstream hyper-linux audit.
- New test-rosetta-jit.sh exercises LuaJIT trace emission and
  coroutine allocation under translation.
- New test-rosetta-glibc.sh runs the seven glibc dynamic-acceptance
  probes against a vendored minimal glibc rootfs: hello,
  hello-via-ldso, hello-list (load-time PT_INTERP + ld.so
  introspection), dlopen of libm.so.6 plus dlsym sqrt round-trip,
  initial-exec TLS through FS-to-TPIDR_EL0, general-dynamic TLS via
  dlopen + __tls_get_addr against a companion libgdtls.so, and
  pthread per-thread TLS isolation.

Vendored fixtures (tests/fixtures/rosetta/)
- x86_64-rosetta-audit and x86_64-rosetta-tls0 static ELFs built
  from the new x86_64-rosetta-*.c sources.
- x86_64-glibc-rootfs.tar.gz containing the seven probe binaries
  plus ld-linux-x86-64.so.2, libc.so.6, libm.so.6, and libgdtls.so.
- README.md documents the rebuild recipe (out-of-tree, on an
  x86_64 Linux host) since these binaries cannot be cross-compiled
  with the aarch64 toolchain the in-tree Makefile targets.

Probe sources (tests/x86_64-*.c)
- x86_64-glibc-hello.c, -dlopen.c, -tls.c, -gdtls-lib.c, -gdtls.c,
  -pthread-tls.c stage the glibc dynamic-linking + TLS surface.
- x86_64-rosetta-audit.c and -rosetta-tls0.c stage the threading +
  signal-shadowing audit and the CLONE_SETTLS-tls=0 hang reproducer.
- gdtls-lib.c pins the TLS model to global-dynamic via
  __attribute__((tls_model("global-dynamic"))) so the probe
  actually exercises the GD lowering path rather than letting the
  compiler relax to local-dynamic.

Build system
- mk/config.mk filters tests/x86_64-*.c out of the aarch64 cross-
  compile glob since those sources need -ldl and -pthread links the
  cross-toolchain does not carry, and they back the vendored
  fixtures rather than the in-tree test corpus.
- mk/tests.mk grows test-rosetta-all and the per-suite targets.
- tests/fetch-fixtures.sh: linux-virt pin bumped from 6.12.90-r0 to
  6.12.91-r0 because the older version was rotated off the Alpine
  CDN, which had been silently breaking INCLUDE_X86_64=1 fetches.
- tests/bench-rosetta.sh: ROSETTA_PATH honors MATRIX_ROSETTA_TRANSLATOR
  so the bench runs through the same translator probe override the
  rest of the rosetta suite uses.
The source-comment side was swept in the prior style commit; carry the
same convention through the tracked Markdown docs so the project ships
uniform ASCII typography across both .c/.h/.S sources and docs/README
pair that the README links to.
@jserv jserv requested a review from Max042004 May 26, 2026 00:15
@jserv
Copy link
Copy Markdown
Contributor Author

jserv commented May 26, 2026

@devarajabc : Help validate on Apple M2 and M5 based machines
@Max042004 : Help validate on Apple M4 based machines

Steps:

make distclean
make test-rosetta-all

cubic-dev-ai[bot]

This comment was marked as resolved.

The elfuse-x86_64 inventory total disagreed with its captured baseline,
guest_preannounce sat in tree without a caller and without an explanation
of why, and the /proc/self/maps shadow predicate fired on any overlap
while its comment claimed full coverage. Each item is fixed at the
smallest defensible scope.

src/runtime/procemu.c (build_proc_self_maps preannounce path)
- Replace the "any overlap suppresses" check with a union-coverage walk
  over the sorted regions[] table. A preannounced entry is shadowed
  only when the union of live regions ends at or past r->end without
  leaving an intermediate gap; partial coverage keeps the advertise
  entry visible so reserved-but-not-realized spans stay observable in
  /proc/self/maps. The walk runs in one O(nregions) pass per
  preannounce row, tracking a covered_end cursor, breaking on the
  first gap and skipping entries that fall entirely behind the cursor.
  Comment rewritten so the predicate, the gap case, and the
  split-VMA mirroring are explicit.

src/core/guest.h (guest_preannounce docstring)
- Document that no producer wires this hook up today. The storage,
  fork-IPC, and consumer plumbing are retained as scaffolding for
  future runtimes that consult /proc/self/maps before reserving VA
  ranges; preannouncing the x86_64 image during Rosetta bring-up was
  tried and rejected because it perturbed the translator's internal
  allocation tracker. A future reader (or review bot) now sees the
  scaffolding posture at the API surface and does not need to chase
  the rationale through history.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant