perf(profiling): separate queue and waking thread#3611
perf(profiling): separate queue and waking thread#3611morrisonlevi wants to merge 6 commits intomasterfrom
Conversation
On every sample we send data to another thread and attempt to wake that thread. The problem is the syscall here is almost as expensive as collecting the sample if the other thread is asleep, which is often the case. This branch avoids that. Instead it writes to a queue and samples are handled when that background thread wakes up for other reasons. Notably one of those reasons is every 10ms when wall-time is enabled (default) so _probably_ this should work pretty well without filling the queues.
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #3611 +/- ##
==========================================
- Coverage 62.24% 62.20% -0.05%
==========================================
Files 141 141
Lines 13387 13387
Branches 1753 1753
==========================================
- Hits 8333 8327 -6
- Misses 4255 4263 +8
+ Partials 799 797 -2 see 1 file with indirect coverage changes Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
Under sustained load in ZTS, this could otherwise run forever.
Benchmarks [ profiler ]Benchmark execution time: 2026-02-02 20:19:12 Comparing candidate commit a401846 in PR branch Found 10 performance improvements and 1 performance regressions! Performance is the same for 20 metrics, 5 unstable metrics. scenario:php-profiler-exceptions-control
scenario:php-profiler-exceptions-with-profiler
scenario:php-profiler-exceptions-with-profiler-and-timeline
scenario:php-profiler-timeline-memory-with-profiler
scenario:php-profiler-timeline-memory-with-profiler-and-timeline
|
b5a4330 to
07df26e
Compare
This reverts commit 07df26e. This was just an experiment, but it turns out that yes, the run time cache is valuable. We had not run it on doe before, since doe is a newer tool.
Description
On master, we send every sample to another thread and attempt to wake that thread to process the sample. The problem is the syscall here is almost as expensive as collecting the sample if the other thread is asleep, which is often the case in NTS builds. Here's a profile showing this (but note it's not always this close):
This git branch avoids that. Instead it writes to a queue and samples are handled when that background thread wakes up for other reasons. Notably one of those reasons is every 10ms when wall-time is enabled (default) so probably this should work pretty well without filling the queues. The queues are larger on ZTS builds, though by a fixed multiplier (16).
Unfortunately, thus far I haven't been able to improve things. On my tests, I'm getting an average of 129.07% CPU vs the baseline version of the tracer having 127% CPU on the same benchmark. Using the ebpf full host profiler, the profiles show reduced time spent in
prepare_and_send_message, from 191ms to 14ms. There's no obvious place that the CPU time shifted to. Here are some places I checked (the data is from one run, I did multiple runs in the past, this isn't an isolate result):TimeCollector::run, down from 1.08s to 845ms.collect_stack_sample, down from 260ms to 140ms.The PR barely touches anything else.
Reviewer checklist