Skip to content

[Performance Analysis] Adding intra-kernel timing runs #829

Open
SergioMartin86 wants to merge 19 commits into
hw-native-sys:mainfrom
huawei-csl:intraKernelTiming
Open

[Performance Analysis] Adding intra-kernel timing runs #829
SergioMartin86 wants to merge 19 commits into
hw-native-sys:mainfrom
huawei-csl:intraKernelTiming

Conversation

@SergioMartin86
Copy link
Copy Markdown

@SergioMartin86 SergioMartin86 commented May 20, 2026

We want to add the ability to run a task multiple times inside the same kernel launch. This is essential for precise timing and performance evaluation of both orchestration and scheduling.

We add:

  • Warmup runs: used to disregard cache intialization/dlopen/kernel launch noise.
  • Timed runs: these are actually timed, and an average + stddev is reported.

By running multiple timed runs, we dissipate OS/device noise that cause random variations in running time. This noise is significant when running these extremely low-latency kernels, so, if we want to precisely measure scheduling/orchestration performance, we need to use a statistical analysis with many samples inside the same kernel launch.

Relevant Change:

See https://github.com/hw-native-sys/simpler/pull/829/changes#diff-f1bd1d412c7f0c6e99f4f11c3830d67582037fbbd6ef3a981c34edb244f9a849R761 for main timing function we added.

Why is it necessary:

Simpler already provides a way to repeat a kernel launch N times within the same run by using the --rounds N parameter. This can be use for obtaining many samples for a statistical performance analysis. However, measuring the scheduling/orchestration time using different kernel launches leads to high variance and, for some reason, a bimodal distribution:

image

This might indicate a bug or problem that is introduced by re-launching a kernel immediate after the previous launch finishes and needs to be taken a look at.

What this PR adds is the ability of running N samples within the same kernel launch. By doing this, the bug mentioned above, along with other possible the sources of noise and variance that are unrelated to the scheduling/orchestration efficiency can be removed. This yields a stable distribution (orange) that is in the same range as the ~500kns peak:

image

This result coincides with the slower half of the --rounds 100 samples, but it is slightly more concentrated.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a performance timing framework for AICPU kernels, enabling warmup and timed execution iterations configurable via environment variables. The changes include a new two-phase barrier for thread synchronization, the use of thread-local storage for thread indexing, and enhanced logging. Feedback highlights several critical issues: an operator precedence bug in the thread completion logic that prevents proper cleanup, thread-safety violations when calling initialization routines concurrently, and a break in binary compatibility due to field insertion in the Runtime class. Additionally, improvements are suggested for memory ordering in the barrier, robustness in environment variable parsing, and correcting a log message typo.

Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp Outdated
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp Outdated
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/runtime.h Outdated
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp Outdated
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp Outdated
Comment on lines +232 to +241
std::string env_timing_iterations_string = std::string(env_timing_iterations);
bool isValidValue = false;
if (env_timing_iterations_string == "True") { runtime->is_timing_enabled = true; isValidValue = true; }
if (env_timing_iterations_string == "False") { runtime->is_timing_enabled = false; isValidValue = true; }
if (isValidValue == false)
{
LOG_WARN("PTO2_KERNEL_TIMING_ENABLED=%s is invalid, using default: \"False\"", env_timing_iterations);
runtime->is_timing_enabled = false;
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The environment variable parsing for PTO2_KERNEL_TIMING_ENABLED is brittle as it only accepts exact case-sensitive matches for 'True' or 'False'. It would be more robust to support a wider range of boolean representations (e.g., '1', '0', 'true', 'false', 'on', 'off') and perform case-insensitive comparisons.

Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp Outdated
@ChaoWao
Copy link
Copy Markdown
Collaborator

ChaoWao commented May 21, 2026

  • Please give the comparison data of N*kernelLaunch and 1 kernelLaunch*N inner run.

Run 100 times and trim highest 10 and lowest 10

@ChaoWao ChaoWao closed this May 21, 2026
@ChaoWao ChaoWao reopened this May 21, 2026
@SergioMartin86 SergioMartin86 marked this pull request as ready for review May 26, 2026 09:43
@SergioMartin86
Copy link
Copy Markdown
Author

  • Please give the comparison data of NkernelLaunch and 1 kernelLaunchN inner run.

Run 100 times and trim highest 10 and lowest 10

Please see the updated PR description: it contains this comparison

@ChaoWao
Copy link
Copy Markdown
Collaborator

ChaoWao commented May 27, 2026

Review summary

Independent review from reading the diff against upstream/main, cross-checked with two other reviewers (codex, gemini) running in separate processes. All three reads converged on the same correctness issues with no false positives.

The timing feature itself is well-motivated and the avg/stddev sampling approach is sound. However, the refactor of AicpuExecutor::run() that came with it introduced two correctness regressions that affect existing non-timing flows, and the timing path has its own set of issues — most importantly, the warmup loop is silently ineffective and inout tensor data drifts across iterations, both of which undermine the measurement methodology the PR is trying to establish.


Must fix (P1 — correctness)

1. Init failure deadlocks every AICPU threadaicpu_executor.cpp:238-241

int32_t load_orch_rc = loadOrchestrator(runtime);
if (load_orch_rc != 0) {
    LOG_ERROR("Thread %d: Failed to load orchestrator", my_thread_idx_);
    return load_orch_rc;          // <-- init_failed_ never set
}

init() returns the error from loadOrchestrator() without setting init_failed_ or init_done_. aicpu_execute() ignores the return value and spins on while (!init_done_) at line 890. Any orchestrator-load failure (invalid callable_id, missing SO bytes, dlopen/dlsym failure, arg-count mismatch) now hangs every AICPU thread instead of surfacing an error. Fix: init_failed_.store(true, std::memory_order_release); before the early return.

2. finished_ is never set → deinit() is never invoked → second kernel launch runs on stale stateaicpu_executor.cpp:935

The previous run() did finished_count_.fetch_add(1, std::memory_order_acq_rel) and the last thread set finished_.store(true, std::memory_order_release). The refactor split run() into separate methods but did not reintroduce this counter logic anywhere. Grep confirms: the only places finished_* is touched in the new code are the initialization to false in deinit() and the read at line 935.

The cleanup branch at line 934-937 is therefore dead, so deinit(runtime) is never called from aicpu_execute. On any second kernel launch in the same worker process:

  • init()'s CAS on initialized_ short-circuits (returns 0 immediately, no work)
  • the spin on init_done_ passes from the previous launch
  • runOrchestration / runScheduling run against the old rt, the previous callable's cached SO pointers, and a stale sched_ctx_

This is broken for the common worker-reuse path, independent of timing.

3. barrier() modulo-by-zero / counter race when thread 0 resets state mid-loopaicpu_executor.cpp:797-801, 822-825

if (my_thread_idx_ == 0) {
    deinit(runtime);
    init(runtime);
}

deinit() sets aicpu_thread_num_ = 0 (line 748). Other threads are simultaneously approaching the next iteration's barrier():

inline void barrier() {
    barrier_counter_in_.fetch_add(1);
    while (barrier_counter_in_.load(...) % aicpu_thread_num_ != 0) ;  // % 0 = UB
    ...
}

Modulo by zero is UB. Also, deinit() does not reset barrier_counter_in_ / barrier_counter_out_, so stale counters persist across the boundary and the modulo can spuriously satisfy or skip the barrier. With aicpu_thread_num_ >= 2 and timing enabled this is reachable. Fix: either don't reset shared executor state mid-timing-loop (use a dedicated reset path that only resets what each iteration actually needs), or have all threads cooperatively park before the deinit.

4. Timing iterations don't restore inout tensor data — measurement is invalid for in-place / accumulating / partial-write kernelsaicpu_executor.cpp:805-829

Each timing iteration calls runOrchestration + runScheduling against the same runtime->get_orch_args() tensors. Between iterations, deinit destroys rt (PTO2Runtime metadata) but does not touch the tensor data in HBM, and the PR adds no snapshot/restore path.

Tensor class Across iterations Measurement valid?
Pure input (read-only) Identical Yes
Pure output (kernel overwrites entirely) Entry state differs but kernel overwrites Yes
inout (read-modify-write) Iter i reads what iter i-1 wrote No — different workload each iter
Partial-write output (mask / sparse) Uncovered positions retain prior iter's writes Possibly different numerics

For accumulators (acc += x), in-place activations, KV-cache writes, LayerNorm running stats, etc., the resulting avg/stddev does not characterize the kernel's real latency distribution; it characterizes "latency on progressively drifting input". For accumulators this also leads to numerical drift across iterations (denormals, saturation), and AICore vector-unit latency for denormals is not constant — so the measured stddev can include data-dependent execution-time variation that has nothing to do with the scheduling/orchestration efficiency the PR is trying to measure.

Fix options: (a) snapshot inout tensor data before timing and restore between iterations — must be outside the timed region; (b) detect inout tensors at bind_prepared_to_runtime_impl time and refuse to enable timing with a clear error; (c) document the restriction prominently in both PR body and env-var help text.

5. Warmup is silently ineffective — dlopen runs every iteration, not just the firstaicpu_executor.cpp:268, 408; missing definition for runtime.h:272

loadOrchestrator decides whether to dlopen via:

const bool reload_so = runtime->register_new_callable_id();

This flag is set by the host's set_active_callable_id(id, is_new) and never auto-resets. Within a single kernel launch, after the first dlopen, nothing flips it to false. The header declares:

void notify_callable_id_registered();

…but this method has no definition anywhere in src/, and loadOrchestrator never calls it. Result: every init() inside performTimingRuns (including the one between warmup iterations) re-enters the reload_so == true branch and does dlclose + write SO file + dlopen + dlsym again.

So on the first launch with a new callable (the case where warmup matters most), warmup measures dlopen overhead, and every "timed" run also includes a fresh dlopen + dlsym. The reported "stable distribution around the fast peak" then is not "dlopen-free orchestration cost" — it's "orchestration cost + recurrent dlopen". The whole point of the warmup design is defeated.

Fix: define void Runtime::notify_callable_id_registered() { register_new_callable_id_ = false; } and call it at the end of loadOrchestrator's reload branch (after a successful dlopen + dlsym). Or replace the register_new_callable_id_ check with a local *p_handle_ != nullptr cache check.


Should fix

6. Extra "real" run after timing — unintentional or undocumentedaicpu_executor.cpp:898-918

When timing is enabled, performTimingRuns runs N warmup + M timed iterations, each ending with deinit/init. Then aicpu_execute unconditionally falls through to runOrchestration + runScheduling again. So the user gets N + M + 1 total runs. If the trailing real run is intentional (to produce real output after pure measurement), it should be documented and gated. As written, it looks accidental given the parallel structure of the two execution paths.

7. PMU / dump_tensor re-initialized every timing iteration without paired finalizeaicpu_executor.cpp:235; scheduler_dispatch.cpp:456-481

The new initializePerfCounters() is gated by pto2_init_done_.exchange(true, ...), but sched_ctx_.deinit() resets that flag to false. So every timing iteration re-runs pmu_aicpu_init(physical_core_ids_, cores_total_num_) and dump_tensor_init(...) without a corresponding pmu_aicpu_finalize / dump_tensor finalize in between. The paired finalize only happens once at the very end via shutdown(). If pmu_aicpu_init programs MMIO registers or allocates buffers, repeated init without finalize leaks state or double-programs hardware. Need to confirm these functions are idempotent under repeated calls; if not, pair them per-iteration or hoist the init out of the timing loop entirely.

8. Stale function-level docstringaicpu_executor.cpp:458-460

/**
 * Shutdown AICore - Send exit signal via registers to all AICore kernels
 */
int32_t AicpuExecutor::runScheduling(Runtime *runtime) {

The comment describes a shutdown function but precedes runScheduling, which neither shuts down nor touches AICore registers. Looks like a leftover from a refactor draft. Per .claude/rules/doc-consistency.md, delete it.

9. std::vector<uint64_t> runTimes allocated and pushed by every thread, read only by the orch threadaicpu_executor.cpp:805, 817

Every scheduler thread does runTimes.push_back(...) (heap allocation + amortized resize on AICPU) while only my_thread_idx_ == sched_thread_num_ reads it at line 837. Heap allocation in AICPU timing-measurement code is exactly the kind of noise this PR aims to eliminate. Move the vector under the orch-only branch, or use a fixed-size stack array sized by timing_iteration_count.

10. std::sqrt used without <cmath> includedaicpu_executor.cpp:21, 851

The file includes <math.h> (line 21), not <cmath>. std::sqrt availability in the std namespace without <cmath> is compiler-dependent. Either include <cmath> or call sqrt() from the global namespace.

11. Negative env values silently no-op the loopruntime_maker.cpp:251-269, 281-299

strtol happily parses -5; the cast to int keeps it negative; the for-loop condition i < warmupIterationCount is immediately false. Quiet no-op for "timing enabled, count = -5" is worse than the warning the parser already emits for non-numeric input. Add a val < 0 rejection in both blocks alongside the existing endptr check.

12. runCount == 0 produces NaN in the reported averageaicpu_executor.cpp:844-845

const auto runCount = runTimes.size();
double runAvg = (double)runSum / (double)runCount;

If PTO2_KERNEL_TIMING_ITERATION_COUNT=0 with timing enabled, runAvg = 0/0 = NaN. The runCount == 1 check at line 851 dodges the stddev divide-by-zero but the avg report still prints NaN. Guard runCount == 0 (log "no timing samples") or treat 0 as "skip reporting".

13. No tests — neither a smoke test that exercises PTO2_KERNEL_TIMING_ENABLED=True end-to-end, nor a unit test for the avg/stddev math. The feature is verified only by the screenshots in the PR description.


Coding style — full rename pass to snake_case required

This codebase uses snake_case for C++ method names, local variables, and members throughout, including in the very struct this PR is editing. The PR introduces camelCase in several places that need to be renamed to match local style (.claude/rules/discipline.md §2). Concretely:

  • New methods (aicpu_executor.cpp:159-165): loadOrchestrator, runScheduling, runOrchestration, performTimingRuns, getThreadId, initializePerfCounters — should be load_orchestrator, run_scheduling, run_orchestration, perform_timing_runs, get_thread_id, initialize_perf_counters. Existing peers in the same struct are already snake (init, deinit, shutdown, bind_runtime).
  • Locals in performTimingRuns: warmupIterationCount, timingIterationCount, runTimes, runSum, runCount, runAvg, runStdDev, runDiff, runTime.
  • Locals in runtime_maker.cpp: env_timing_iterations_string, isValidValue.

The runtime struct members (is_timing_enabled, warmup_iteration_count, timing_iteration_count) and their getters are already snake — only the new methods + locals need the rename.


Squash before merge

The PR currently carries 19 commits including two "merge with main" / "Merge branch 'main'" merge commits and many WIP/fix steps. Please squash into one (or a small number of) logical commit(s) before merge. A clean single-commit rebase onto current upstream/main is straightforward — the only conflict at the moment (aicpu_executor.cpp) resolves trivially against the latest main, which reverted the multi-callable dispatcher PR (#537) that previously made the rebase non-trivial.


Suggested merge gate

P1 (#1#5) are merge-blockers — particularly #1, #2, and #5, which affect correctness or measurement validity even outside the timing path. #4 (inout) is also a merge-blocker for the feature to be trustworthy; at minimum it needs documentation if a code-level fix is deferred. Should-fix items can land in the same PR or as immediate follow-ups; the squash and style rename should be in the merged version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants