[Performance Analysis] Adding intra-kernel timing runs #829
[Performance Analysis] Adding intra-kernel timing runs #829SergioMartin86 wants to merge 19 commits into
Conversation
…. This is important for accurate timing
There was a problem hiding this comment.
Code Review
This pull request introduces a performance timing framework for AICPU kernels, enabling warmup and timed execution iterations configurable via environment variables. The changes include a new two-phase barrier for thread synchronization, the use of thread-local storage for thread indexing, and enhanced logging. Feedback highlights several critical issues: an operator precedence bug in the thread completion logic that prevents proper cleanup, thread-safety violations when calling initialization routines concurrently, and a break in binary compatibility due to field insertion in the Runtime class. Additionally, improvements are suggested for memory ordering in the barrier, robustness in environment variable parsing, and correcting a log message typo.
| std::string env_timing_iterations_string = std::string(env_timing_iterations); | ||
| bool isValidValue = false; | ||
| if (env_timing_iterations_string == "True") { runtime->is_timing_enabled = true; isValidValue = true; } | ||
| if (env_timing_iterations_string == "False") { runtime->is_timing_enabled = false; isValidValue = true; } | ||
| if (isValidValue == false) | ||
| { | ||
| LOG_WARN("PTO2_KERNEL_TIMING_ENABLED=%s is invalid, using default: \"False\"", env_timing_iterations); | ||
| runtime->is_timing_enabled = false; | ||
| } | ||
| } |
There was a problem hiding this comment.
The environment variable parsing for PTO2_KERNEL_TIMING_ENABLED is brittle as it only accepts exact case-sensitive matches for 'True' or 'False'. It would be more robust to support a wider range of boolean representations (e.g., '1', '0', 'true', 'false', 'on', 'off') and perform case-insensitive comparisons.
|
Please see the updated PR description: it contains this comparison |
Review summaryIndependent review from reading the diff against The timing feature itself is well-motivated and the avg/stddev sampling approach is sound. However, the refactor of Must fix (P1 — correctness)1. Init failure deadlocks every AICPU thread — int32_t load_orch_rc = loadOrchestrator(runtime);
if (load_orch_rc != 0) {
LOG_ERROR("Thread %d: Failed to load orchestrator", my_thread_idx_);
return load_orch_rc; // <-- init_failed_ never set
}
2. The previous The cleanup branch at line 934-937 is therefore dead, so
This is broken for the common worker-reuse path, independent of timing. 3. if (my_thread_idx_ == 0) {
deinit(runtime);
init(runtime);
}
inline void barrier() {
barrier_counter_in_.fetch_add(1);
while (barrier_counter_in_.load(...) % aicpu_thread_num_ != 0) ; // % 0 = UB
...
}Modulo by zero is UB. Also, 4. Timing iterations don't restore inout tensor data — measurement is invalid for in-place / accumulating / partial-write kernels — Each timing iteration calls
For accumulators ( Fix options: (a) snapshot inout tensor data before timing and restore between iterations — must be outside the timed region; (b) detect inout tensors at 5. Warmup is silently ineffective —
const bool reload_so = runtime->register_new_callable_id();This flag is set by the host's void notify_callable_id_registered();…but this method has no definition anywhere in So on the first launch with a new callable (the case where warmup matters most), warmup measures dlopen overhead, and every "timed" run also includes a fresh dlopen + dlsym. The reported "stable distribution around the fast peak" then is not "dlopen-free orchestration cost" — it's "orchestration cost + recurrent dlopen". The whole point of the warmup design is defeated. Fix: define Should fix6. Extra "real" run after timing — unintentional or undocumented — When timing is enabled, 7. PMU / dump_tensor re-initialized every timing iteration without paired finalize — The new 8. Stale function-level docstring — /**
* Shutdown AICore - Send exit signal via registers to all AICore kernels
*/
int32_t AicpuExecutor::runScheduling(Runtime *runtime) {The comment describes a shutdown function but precedes 9. Every scheduler thread does 10. The file includes 11. Negative env values silently no-op the loop —
12. const auto runCount = runTimes.size();
double runAvg = (double)runSum / (double)runCount;If 13. No tests — neither a smoke test that exercises Coding style — full rename pass to snake_case requiredThis codebase uses snake_case for C++ method names, local variables, and members throughout, including in the very struct this PR is editing. The PR introduces camelCase in several places that need to be renamed to match local style (
The runtime struct members ( Squash before mergeThe PR currently carries 19 commits including two "merge with main" / "Merge branch 'main'" merge commits and many WIP/fix steps. Please squash into one (or a small number of) logical commit(s) before merge. A clean single-commit rebase onto current Suggested merge gateP1 (#1–#5) are merge-blockers — particularly #1, #2, and #5, which affect correctness or measurement validity even outside the timing path. #4 (inout) is also a merge-blocker for the feature to be trustworthy; at minimum it needs documentation if a code-level fix is deferred. Should-fix items can land in the same PR or as immediate follow-ups; the squash and style rename should be in the merged version. |
We want to add the ability to run a task multiple times inside the same kernel launch. This is essential for precise timing and performance evaluation of both orchestration and scheduling.
We add:
By running multiple timed runs, we dissipate OS/device noise that cause random variations in running time. This noise is significant when running these extremely low-latency kernels, so, if we want to precisely measure scheduling/orchestration performance, we need to use a statistical analysis with many samples inside the same kernel launch.
Relevant Change:
See https://github.com/hw-native-sys/simpler/pull/829/changes#diff-f1bd1d412c7f0c6e99f4f11c3830d67582037fbbd6ef3a981c34edb244f9a849R761 for main timing function we added.
Why is it necessary:
Simpler already provides a way to repeat a kernel launch
Ntimes within the same run by using the--rounds Nparameter. This can be use for obtaining many samples for a statistical performance analysis. However, measuring the scheduling/orchestration time using different kernel launches leads to high variance and, for some reason, a bimodal distribution:This might indicate a bug or problem that is introduced by re-launching a kernel immediate after the previous launch finishes and needs to be taken a look at.
What this PR adds is the ability of running
Nsamples within the same kernel launch. By doing this, the bug mentioned above, along with other possible the sources of noise and variance that are unrelated to the scheduling/orchestration efficiency can be removed. This yields a stable distribution (orange) that is in the same range as the ~500kns peak:This result coincides with the slower half of the
--rounds 100samples, but it is slightly more concentrated.