Skip to content

Speedup vDSO CNTVCT and amortized urandom#48

Open
jserv wants to merge 1 commit into
mainfrom
perf
Open

Speedup vDSO CNTVCT and amortized urandom#48
jserv wants to merge 1 commit into
mainfrom
perf

Conversation

@jserv
Copy link
Copy Markdown
Contributor

@jserv jserv commented May 27, 2026

vDSO clock_gettime drops from 1256 ns SVC trap to 2.5ns via CNTVCT-based fast path (493x speedup, 20x under the sub-50 ns design target). The trampoline emits a 28-instruction A64 sequence that reads CNTVCT_EL0, LDAR-acquires the vvar initialized flag, and interpolates wall clock from the anchor as delta * 125 / 3 (Apple Silicon CNTFRQ = 24 MHz), falling back to SVC on first call or CNTVCT regression. The first SVC seeds the vvar via a three-state CAS (0 -> 2 -> 1) so concurrent first calls cannot tear the anchor fields. The seed is gated on ELR_EL1 matching the trampoline's svc_fallback PC so an unrelated raw clock_gettime syscall cannot poison the anchor from arbitrary X9.

/dev/urandom 1-byte reads drop from 5688 ns uncached to 2054 ns (2.77x) via a new per-fd entropy cache: an arc4random_buf-refilled 4 KiB buffer per FD_URANDOM slot. The cache is zeroed on close via a type-to-cleanup registry that also closes pre-existing dup and fork-state race windows for every synthetic fd type.

eventfd dup shares state across aliases per the Linux contract (refcounted slot plus eventfd_owner[FD_TABLE_SIZE] table). The dup path holds fd_lock and sfd_lock together for the bind commit so racing close cannot leak the refcount; the source identity is pinned via snapshotted host fd so a racing close-and-rebind of the source cannot bind to the wrong slot. tests/test-eventfd-dup pins the shared-state contract.

fork_ipc_send_fd_table filters eventfd, signalfd, timerfd, inotify, netlink, pidfd, and epoll out of the SCM_RIGHTS payload. macOS rejects kqueue fds across SCM_RIGHTS and per-class side-table state is not transferable, so a clean drop is the only honest contract. tests/test-fork-synthetic-fd pins it.

Startup decomposition: ELFUSE_STARTUP_TRACE=1 emits per-step wall time for VM bring-up (17 steps on test-hello, dominated by hv_vcpu_create and guest_init at roughly 0.9 ms each). Zero overhead when unset.


Summary by cubic

Adds a CNTVCT-based vDSO fast path for clock_gettime (~500x faster) and a per‑fd /dev/urandom cache (2.7x faster 1‑byte reads). Also fixes eventfd dup to share state, filters non‑transferable synthetic FDs on fork, and adds an optional startup timing trace.

  • New Features - New features added

    • vDSO: versioned ELF with a CNTVCT_EL0 fast path for clock_gettime; seeds on first SVC with a safe fallback; exports _kernel* symbols and updates the signal trampoline offset; resolves with glibc/musl.
    • /dev/urandom: new FD_URANDOM with a 4 KiB per‑fd cache refilled via arc4random_buf; faster small reads; cache resets on close/dup/fork.
    • Startup tracing: set ELFUSE_STARTUP_TRACE=1 to print per‑step VM bring‑up times; zero overhead when unset.
  • Bug Fixes - Bug fixes implemented

    • eventfd dup now shares counter/readiness across aliases (race‑free bind and refcounting); adds a test to pin the contract.
    • Fork behavior: drop eventfd, signalfd, timerfd, inotify, netlink, pidfd, and epoll from SCM_RIGHTS; the child sees EBADF and recreates them (prevents half‑state and host‑fd leaks).
    • Central FD cleanup registry for synthetic types (fuse, inotify, netlink, pidfd, timerfd, urandom), with cleanup installed atomically to close dup/fork race windows.

Written for commit a24fc53. Summary will update on new commits. Review in cubic

@jserv jserv requested a review from Max042004 May 27, 2026 05:55
cubic-dev-ai[bot]

This comment was marked as resolved.

vDSO clock_gettime drops from 1256 ns SVC trap to 2.5ns via CNTVCT-based
fast path (493x speedup, 20x under the sub-50 ns design target). The
trampoline emits a 28-instruction A64 sequence that reads CNTVCT_EL0,
LDAR-acquires the vvar initialized flag, and interpolates wall clock
from the anchor as delta * 125 / 3 (Apple Silicon CNTFRQ = 24 MHz),
falling back to SVC on first call or CNTVCT regression. The first SVC
seeds the vvar via a three-state CAS (0 -> 2 -> 1) so concurrent first
calls cannot tear the anchor fields. The seed is gated on ELR_EL1
matching the trampoline's svc_fallback PC so an unrelated raw
clock_gettime syscall cannot poison the anchor from arbitrary X9.

/dev/urandom 1-byte reads drop from 5688 ns uncached to 2054 ns (2.77x)
via a new per-fd entropy cache: an arc4random_buf-refilled 4 KiB buffer
per FD_URANDOM slot. The cache is zeroed on close via a type-to-cleanup
registry that also closes pre-existing dup and fork-state race windows
for every synthetic fd type.

eventfd dup shares state across aliases per the Linux contract
(refcounted slot plus eventfd_owner[FD_TABLE_SIZE] table). The dup path
holds fd_lock and sfd_lock together for the bind commit so racing close
cannot leak the refcount; the source identity is pinned via snapshotted
host fd so a racing close-and-rebind of the source cannot bind to the
wrong slot. tests/test-eventfd-dup pins the shared-state contract.

fork_ipc_send_fd_table filters eventfd, signalfd, timerfd, inotify,
netlink, pidfd, and epoll out of the SCM_RIGHTS payload. macOS rejects
kqueue fds across SCM_RIGHTS and per-class side-table state is not
transferable, so a clean drop is the only honest contract.
tests/test-fork-synthetic-fd pins it.

Startup decomposition: ELFUSE_STARTUP_TRACE=1 emits per-step wall time
for VM bring-up (17 steps on test-hello, dominated by hv_vcpu_create and
guest_init at roughly 0.9 ms each). Zero overhead when unset.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant