Fix one_collect panics when reading in-process perf ring buffer lost records#275
Conversation
collect-linux perf parsing
|
@copilot, also look at our in-process ring buffer implementation and make sure that this isn't an issue where we're writing a bad value that then panics during the read path. We did make changes recently in this area. |
I traced this through the in-process ring buffer path and found we were emitting |
Revert read_time back to single-peek semantics on the unknown-id and invalid-header paths so we don't silently consume an unbounded run of records in a single call. On either path we now advance past just the offending record (or the minimum header size if the size field is too small) and return None, letting the caller's outer scan retry on the next pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
collect-linux perf parsing
Summary
Fixes panics in
RingBufDataSourcewhen consuming the in-process perfring buffer. The root cause was that
InProcessRingBufWriteremittedPERF_RECORD_LOSTrecords without the trailingsample_id_allfieldsthat the rest of the writer uses, so the reader decoded a bogus sample
id and then either:
&ring_bufs[&id]for an unknown id, orThis PR fixes the root cause and lightly hardens the reader so similar
malformed records can never panic again.
Changes
InProcessRingBufWriter::write_lost_record— emit thesample_id_alltrailer (pid/tid, time, id) onPERF_RECORD_LOSTrecords so they match the layout produced for normal records.
LOST_RECORD_SIZEbumped from 24 → 48 bytes; the timestamp is filledin via
perf_timestamp(...), and id/tid are left as 0.RingBufDataSource::read_time— replace the panickingring_bufs[&id]index withring_bufs.get(&id). On an unknown id,log a warning, advance the cursor past just that one record, and
return
Noneso the caller's outer scan retries on the next pass.This preserves the original single-peek semantics — we never silently
consume more than one record per call, so if the unknown-id path ever
fires due to a real bug it stays visible rather than draining the
ring quietly.
RingBufDataSource::read_time— add a minimum-size guard(
header.size < data_offset + 16) to avoidu16underflow whencomputing the time/id offsets for non-sample records. On a too-small
record, advance by at least the header size so the ring still drains
and return
None. The cursor parameter is now&mut CpuRingCursorto allow advancing; both existing callsites already passed
&mut.Tests
InProcessRingBuflost-record tests toassert the new 48-byte record size and read back the
sample_id_allid field.read_time_skips_unknown_ring_buffer_idcovering the newskip-on-unknown-id path in
read_time: the first call returnsNoneafter advancing past the unknown record, and a follow-up callreturns the time of the known record.
Risk
Scope is limited to Linux
perf_event::rb. The on-wire layout forPERF_RECORD_LOSTnow matches what the rest of the in-process writeralready emits for normal records, so external readers that already
handle
sample_id_allfor sample records will handle the new lostrecords the same way.