Skip to content

[Bug] Track spmd_paged_attention A2/A3 golden tolerance drift #848

@puddingfjz

Description

@puddingfjz

Platform

a2a3 (Ascend 910B/C hardware)

Runtime Variant

tensormap_and_ringbuffer

Description

Summary

st-onboard-a2a3 can fail in spmd_paged_attention because the observed
hardware numerical drift is slightly above the current golden tolerance.

Failure

Observed in PR #839 CI run 26282368753, job st-onboard-a2a3
(77361524802):

FAILED tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention/test_spmd_paged_attention.py::TestPagedAttentionUnrollTpushPop::test_run
AssertionError: Golden mismatch on 'out': max_diff=0.005540801212191582, rtol=0.005, atol=0.005

The current test already documents relaxed tolerance for this AIC/AIV
cooperative TPUSH/TPOP pipeline, but the latest onboard run exceeded the
5e-3 bound by roughly 5.5e-4.

Notes

  • A targeted follow-up should decide whether to relax this case's tolerance,
    improve the golden comparison strategy, or investigate the hardware numeric
    drift source.

Steps to Reproduce

1. Run the A2/A3 onboard scene tests for `spmd_paged_attention`:

     
     python -m pytest \
       tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention/test_spmd_paged_attention.py \
       --platform a2a3 \
       --device <a2a3-device-range> \
       -v \
       --clone-protocol ssh \
       --require-pto-isa

  2. Observe that TestPagedAttentionUnrollTpushPop::test_run may fail golden
     comparison with output drift slightly above the current tolerance.

  Observed in CI:

  AssertionError: Golden mismatch on 'out':
  max_diff=0.005540801212191582, rtol=0.005, atol=0.005

Expected Behavior

A2/A3 onboard spmd_paged_attention should pass golden comparison; output
difference should stay within the configured tolerance rtol=0.005, atol=0.005,
or the test tolerance/golden strategy should cover expected BF16 hardware drift.

Actual Behavior

TestPagedAttentionUnrollTpushPop::test_run fails golden comparison on A2/A3
onboard hardware.

Observed failures:

  • first run: max_diff=0.005348655860871077, rtol=0.005, atol=0.005
  • retry with pinned PTO-ISA: max_diff=0.005540801212191582, rtol=0.005, atol=0.005

Git Commit ID

CI merge commit: 27ad98b; PR head commit: 75d9d73

CANN Version

9.0.0

Driver Version

No response

Host Platform

Linux (aarch64)

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions