Support wakeup watermark for perf ring buffers#276
Open
utpilla wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes part of #254 (first of two PRs; the follow-up will expose the per-CPU perf fds so a consumer can actually wait on them via
epoll(2)/tokio::io::unix::AsyncFd).Changes
Adds an opt-in
RingBufSessionBuilder::with_wakeup_watermark(bytes: u32)that configures the kernel to defer marking each per-CPU perf fd readable until at leastbytesof data have accumulated in its ring buffer.Motivation
The kernel default is to mark the perf fd readable on every event. For a future epoll /
AsyncFd-driven consumer this means one wakeup per record, which is expensive under load. The watermark lets the caller amortize wakeup cost against latency.ring_data_bytes(page_count)inperf_event/rb/mod.rscomputes the per-CPU ring data area in bytes (next_power_of_two(page_count) * PAGE_SIZE). This is the actual bound the watermark must stay under; making it a named function gives us a single source of truth and lets the unit tests reference exactly the same value the builder uses.RingBufBuilder<T>::set_wakeup_watermark(bytes)(crate-visible) setsFLAG_WATERMARKandwakeup_events_watermarkon the underlyingperf_event_attr.RingBufSessionBuilder::with_wakeup_watermark(bytes)stores the requested value on the builder. It is threaded through every existing rebuild block alongsidepages/target_pids/target_cpus.RingBufSessionBuilder::build()applies the watermark to the kernel/leaderperf_event_attronly. The leader fd is the one the kernel wakes when data lands in its per-CPU ring; all other event types (tracepoint, profiling, context-switch, page-fault, BPF) redirect their output into the same ring viaPERF_EVENT_IOC_SET_OUTPUT, so the leader's watermark covers them.build()validates thatbytes < ring_data_bytes(self.pages)and returnsErr(io_error(...))otherwise. This is necessary because the kernel does not validate this atperf_event_open()time and silently never wakes the fd if the watermark cannot be reached: the wakeup check ishead - rb->wakeup > rb->watermark, andhead - rb->wakeupis bounded above by the data area size, so a watermark>=that size can never trigger a wakeup and a consumer waiting on the fd would stall while the kernel drops records asPERF_RECORD_LOST.CpuRingBuf::create_readeris refactored to use the new helper (ring_data_bytes(page_count) + page_sizefor the mmap length, the+ page_sizebeing the perf metadata page). The previous expressionpage_count.next_power_of_two() + 1(pages, with the metadata page folded into the count) computes the same total length; the refactor just expresses it in terms of the shared helper.Tests
Three unit tests on the boundary, all using
ring_data_bytes(pages)directly so they are correct on any page size:wakeup_watermark_rejected_when_equal_to_ring_size-bytes == ring_data_bytesmust error.wakeup_watermark_rejected_when_above_ring_size-bytes == ring_data_bytes + 1must error.wakeup_watermark_accepted_when_below_ring_size-bytes == ring_data_bytes - 1must pass builder validation (the subsequentperf_event_open()may still fail in test environments without the relevant privileges; the test asserts only that the failure, if any, is not the watermark validation error).