Platform
a2a3 (Ascend 910B/C hardware)
Runtime Variant
tensormap_and_ringbuffer
Description
Summary
st-onboard-a2a3 can fail in spmd_paged_attention because the observed
hardware numerical drift is slightly above the current golden tolerance.
Failure
Observed in PR #839 CI run 26282368753, job st-onboard-a2a3
(77361524802):
FAILED tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention/test_spmd_paged_attention.py::TestPagedAttentionUnrollTpushPop::test_run
AssertionError: Golden mismatch on 'out': max_diff=0.005540801212191582, rtol=0.005, atol=0.005
The current test already documents relaxed tolerance for this AIC/AIV
cooperative TPUSH/TPOP pipeline, but the latest onboard run exceeded the
5e-3 bound by roughly 5.5e-4.
Notes
- A targeted follow-up should decide whether to relax this case's tolerance,
improve the golden comparison strategy, or investigate the hardware numeric
drift source.
Steps to Reproduce
1. Run the A2/A3 onboard scene tests for `spmd_paged_attention`:
python -m pytest \
tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention/test_spmd_paged_attention.py \
--platform a2a3 \
--device <a2a3-device-range> \
-v \
--clone-protocol ssh \
--require-pto-isa
2. Observe that TestPagedAttentionUnrollTpushPop::test_run may fail golden
comparison with output drift slightly above the current tolerance.
Observed in CI:
AssertionError: Golden mismatch on 'out':
max_diff=0.005540801212191582, rtol=0.005, atol=0.005
Expected Behavior
A2/A3 onboard spmd_paged_attention should pass golden comparison; output
difference should stay within the configured tolerance rtol=0.005, atol=0.005,
or the test tolerance/golden strategy should cover expected BF16 hardware drift.
Actual Behavior
TestPagedAttentionUnrollTpushPop::test_run fails golden comparison on A2/A3
onboard hardware.
Observed failures:
- first run: max_diff=0.005348655860871077, rtol=0.005, atol=0.005
- retry with pinned PTO-ISA: max_diff=0.005540801212191582, rtol=0.005, atol=0.005
Git Commit ID
CI merge commit: 27ad98b; PR head commit: 75d9d73
CANN Version
9.0.0
Driver Version
No response
Host Platform
Linux (aarch64)
Additional Context
No response
Platform
a2a3 (Ascend 910B/C hardware)
Runtime Variant
tensormap_and_ringbuffer
Description
Summary
st-onboard-a2a3can fail inspmd_paged_attentionbecause the observedhardware numerical drift is slightly above the current golden tolerance.
Failure
Observed in PR #839 CI run
26282368753, jobst-onboard-a2a3(
77361524802):The current test already documents relaxed tolerance for this AIC/AIV
cooperative TPUSH/TPOP pipeline, but the latest onboard run exceeded the
5e-3bound by roughly5.5e-4.Notes
improve the golden comparison strategy, or investigate the hardware numeric
drift source.
Steps to Reproduce
Expected Behavior
A2/A3 onboard
spmd_paged_attentionshould pass golden comparison; outputdifference should stay within the configured tolerance rtol=0.005, atol=0.005,
or the test tolerance/golden strategy should cover expected BF16 hardware drift.
Actual Behavior
TestPagedAttentionUnrollTpushPop::test_runfails golden comparison on A2/A3onboard hardware.
Observed failures:
Git Commit ID
CI merge commit: 27ad98b; PR head commit: 75d9d73
CANN Version
9.0.0
Driver Version
No response
Host Platform
Linux (aarch64)
Additional Context
No response