Skip to content

Support wakeup watermark for perf ring buffers#276

Open
utpilla wants to merge 1 commit into
microsoft:mainfrom
utpilla:utpilla/Add-wakeup-watermark
Open

Support wakeup watermark for perf ring buffers#276
utpilla wants to merge 1 commit into
microsoft:mainfrom
utpilla:utpilla/Add-wakeup-watermark

Conversation

@utpilla
Copy link
Copy Markdown
Contributor

@utpilla utpilla commented May 20, 2026

Closes part of #254 (first of two PRs; the follow-up will expose the per-CPU perf fds so a consumer can actually wait on them via epoll(2) / tokio::io::unix::AsyncFd).

Changes

Adds an opt-in RingBufSessionBuilder::with_wakeup_watermark(bytes: u32) that configures the kernel to defer marking each per-CPU perf fd readable until at least bytes of data have accumulated in its ring buffer.

Motivation

The kernel default is to mark the perf fd readable on every event. For a future epoll / AsyncFd-driven consumer this means one wakeup per record, which is expensive under load. The watermark lets the caller amortize wakeup cost against latency.

  • New helper ring_data_bytes(page_count) in perf_event/rb/mod.rs computes the per-CPU ring data area in bytes (next_power_of_two(page_count) * PAGE_SIZE). This is the actual bound the watermark must stay under; making it a named function gives us a single source of truth and lets the unit tests reference exactly the same value the builder uses.
  • New RingBufBuilder<T>::set_wakeup_watermark(bytes) (crate-visible) sets FLAG_WATERMARK and wakeup_events_watermark on the underlying perf_event_attr.
  • New public RingBufSessionBuilder::with_wakeup_watermark(bytes) stores the requested value on the builder. It is threaded through every existing rebuild block alongside pages / target_pids / target_cpus.
  • RingBufSessionBuilder::build() applies the watermark to the kernel/leader perf_event_attr only. The leader fd is the one the kernel wakes when data lands in its per-CPU ring; all other event types (tracepoint, profiling, context-switch, page-fault, BPF) redirect their output into the same ring via PERF_EVENT_IOC_SET_OUTPUT, so the leader's watermark covers them.
  • build() validates that bytes < ring_data_bytes(self.pages) and returns Err(io_error(...)) otherwise. This is necessary because the kernel does not validate this at perf_event_open() time and silently never wakes the fd if the watermark cannot be reached: the wakeup check is head - rb->wakeup > rb->watermark, and head - rb->wakeup is bounded above by the data area size, so a watermark >= that size can never trigger a wakeup and a consumer waiting on the fd would stall while the kernel drops records as PERF_RECORD_LOST.
  • CpuRingBuf::create_reader is refactored to use the new helper (ring_data_bytes(page_count) + page_size for the mmap length, the + page_size being the perf metadata page). The previous expression page_count.next_power_of_two() + 1 (pages, with the metadata page folded into the count) computes the same total length; the refactor just expresses it in terms of the shared helper.

Tests

Three unit tests on the boundary, all using ring_data_bytes(pages) directly so they are correct on any page size:

  • wakeup_watermark_rejected_when_equal_to_ring_size - bytes == ring_data_bytes must error.
  • wakeup_watermark_rejected_when_above_ring_size - bytes == ring_data_bytes + 1 must error.
  • wakeup_watermark_accepted_when_below_ring_size - bytes == ring_data_bytes - 1 must pass builder validation (the subsequent perf_event_open() may still fail in test environments without the relevant privileges; the test asserts only that the failure, if any, is not the watermark validation error).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant