Skip to content

[Bug] a5 PMU CNT_TOTAL returns 0 when reg_base read is slow between pmu_disable and ld_dev #800

@ChaoZheng109

Description

@ChaoZheng109

Platform

a5 (Ascend 950 hardware)

Runtime Variant

tensormap_and_ringbuffer

Description

On a5, AICore reads its own PMU MMIO directly via ld_dev. The sequence per task is roughly:

write_reg(CTRL, CTRL & ~PMU_ENABLE_BIT);   // pmu_aicore_end(): disable PMU
... fetch reg_base from somewhere ...
ld_dev(reg_base + CNT0_OFFSET);            // event counters cnt[0..9]
...
ld_dev(reg_base + CNT_TOTAL0_OFFSET);      // 64-bit cycle counter
ld_dev(reg_base + CNT_TOTAL1_OFFSET);

Empirically, the latency of "fetch reg_base" decides whether CNT_TOTAL0/1 reads valid values or returns 0. Event counters (cnt[0..9]) are unaffected — they hold their value after PMU disable.

Reporting this as a hardware-mechanism question because the three reg_base-fetch shapes we've tried map cleanly to three outcomes:

Reg-base fetch shape Where the value lives Typical latency CNT_TOTAL result
Read a volatile uint64_t field from a GM struct that AICore already accesses every task (so its cache line is L1-hot) GM (cached, hot) ~1–2 cycles (L1 hit) Always non-zero
Read table[block_idx] from a separate device-memory table that AICore touches only here (cold cache line) GM (uncached / cold) dozens to hundreds of cycles (L1 miss → DDR) ~25% of records read 0
Read a [[block_local]] uint64_t value resolved once at kernel entry AICore per-block private storage ~1 cycle (scalar register access) Always non-zero

Same ld_dev sequence in all three cases; only the operation immediately before it changes.

Steps to Reproduce

  1. Build the kernel so AICore fetches reg_base from a cold GM table per task — i.e. [[block_local]] static __gm__ uint64_t *table; and get_reg_base() { return table[block_idx]; }, where table points to a per-core device-memory array AICore otherwise never touches.
  2. Run any PMU-profiling test on real a5 hardware with enough tasks to populate outputs/<run>/pmu.csv (we used examples/paged_attention_unroll, ~1024 tasks).
  3. awk -F, 'NR>1 && $6=="0"' outputs/<run>/pmu.csv | wc -l.

Expected Behavior

CNT_TOTAL returns a valid cycle count whenever the kernel actually executed — i.e. should behave the same way the event counters do, sticky after PMU disable.

Actual Behavior

Cold-GM-table reg-base fetch:

log level total rows rows with pmu_total_cycles == 0
debug 1024 482 (≈47%)
warn 1024 265 (≈26%)

Sample row (event counters valid, total cycles zero):

0,0,0x00000001000001e1,0,0,0,0,274,38,171,0,0,6,0,0,2

After switching to the block-local fetch shape, the same test yields 0 / 1024 zero rows.

The dependency on log level is informative: AICPU log throughput changes dispatch timing, which changes per-core task density, which changes how often the cold cache line gets evicted between reads. More eviction → more CNT_TOTAL == 0 rows. Suggests the failure is driven by cache-miss-rate, not by any deterministic counter-clear behavior.

Git Commit ID

N/A — the broken intermediate state is no longer on main. The pattern is reproducible by deliberately introducing a cold per-record GM read between pmu_aicore_end() and ld_dev(CNT_TOTAL0).

CANN Version

N/A.

Driver Version

N/A.

Host Platform

Linux (aarch64)

Additional Context

This issue exists as a hardware-behavior record, not an open repo bug. The software-side fix is already in place (resolve reg_base into block-local storage at kernel entry).

What we'd like the hardware team to confirm or correct:

  • Is CNT_TOTAL0/1 expected to remain readable indefinitely after PMU disable (CTRL bit 0 = 0), or is there a defined valid-read window after disable?
  • If a window exists: is it specified in cycles, or in terms of "next access on the MMIO interface after disable"?
  • Is the cycle counter's post-disable behavior expected to differ from event counters' (which are clearly sticky)?

If this is expected hardware behavior, then software has a hard constraint: after pmu_aicore_end(), nothing slow (cache miss, long scalar dependency, etc.) is allowed before ld_dev(CNT_TOTAL). The current fix relies on that constraint informally; a documented spec would let us assert it.

If this is unexpected / a hardware bug, please advise on a hardware-side guard so software does not have to manage this timing window.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions