Skip to content

tensor_avgpool_kernel: ZeroDivisionError at trace time when pool_size == 0 #127

@lucasccordeiro

Description

@lucasccordeiro

Describe the bug

Dear AWS Neuron Team,

This is Lucas Cordeiro (https://ssvlab.github.io/lucasccordeiro/) again; thank you for merging #126 (closing #125). This is a small, lower-severity follow-on of the same class (host-side arithmetic with an admissible-but-unhandled input), surfaced by the same mechanical sweep of // <param> expressions across the kernel set.

tutorials/average_pool2d/tensor_avgpool_kernel(in_tensor, pool_size) divides by zero at trace time if pool_size is set to 0. The pool_size parameter is untyped, has no default, and the body adds no pool_size >= 1 precondition, so pool_size == 0 is admissible from the public signature.

The kernel is unchanged on the current main branch (commit 0dd1b48b612c6a1604a50bc6d034b010739e508d, 2026-05-19) — PR #126 fixed only the interpolate kernels and left this one untouched. It is byte-identical to the NKI 0.3.0 / Neuron SDK 2.29 release commit a87aaa44f7b26241bdb152af8838e287669c3947. In src/nki_samples/tutorials/average_pool2d/average_pool2d_nki_kernels.py:

@nki.jit
def tensor_avgpool_kernel(in_tensor, pool_size):          # line 15 — pool_size untyped, no default
    sz_cin, sz_hin, sz_win = in_tensor.shape
    sz_hout = sz_hin // pool_size                         # line 28 — ZeroDivisionError when pool_size == 0
    sz_wout = sz_win // pool_size                         # line 29 — same
    ...
    sz_pool = pool_size
    pool_view = in_tile.ap([
        [sz_hin * sz_win, sz_p],
        [sz_pool * sz_win, sz_hin // sz_pool],            # lines 51–52 — also divide by sz_pool
        [sz_pool,          sz_win // sz_pool],
        ...
    ])
    ...
    nisa.tensor_scalar(dst=out_tile, data=sum_tile, op0=nl.multiply,
                       operand0=1.0 / (pool_size * pool_size))   # line 59 — also unsafe at pool_size == 0

The first floor-div (sz_hout, line 28) is the trace-time crash point; the sz_wout floor-div (line 29), the two .ap() stride divisions (lines 51–52), and the 1.0 / (pool_size * pool_size) reciprocal (line 59) are all equally unsafe at pool_size == 0 and would fire if line 28 were guarded individually.

Expected Behavior

pool_size <= 0 is rejected at the API boundary with a clear message specifying the violated precondition, e.g., AssertionError: pool_size must be >= 1, rather than an uncaught ZeroDivisionError in the kernel body.

Current Behavior

pool_size == 0 raises ZeroDivisionError: division by zero at JIT trace time, from sz_hout = sz_hin // pool_size (line 28), before any device-side codegen. In a full @nki.jit trace, the exception propagates out of tensor_avgpool_kernel.

Reproduction Steps

Pure host-side Python; reproduces in plain CPython without a Trainium device or neuronx-cc:

sz_hin    = 8
pool_size = 0                       # the bug

sz_hout = sz_hin // pool_size       # ZeroDivisionError: division by zero

Actual traceback (CPython 3.14):

Traceback (most recent call last):
  File "<string>", line 4, in <module>
    sz_hout = sz_hin // pool_size
              ~~~~~~~^^~~~~~~~~~~
ZeroDivisionError: division by zero

This is the exact expression at average_pool2d_nki_kernels.py:28; the kernel body never executes under pool_size == 0.

A symbolic counterexample (witness from ESBMC-Python against a stubbed port) that enumerates both floor-div sites in a single run via --multi-property is at:

ESBMC output (trimmed; reproducer lines 24/25 mirror upstream sz_hout/sz_wout at lines 28/29):

[Counterexample]
Violated property:
  file harness/avgpool_hostarith_unguarded.py line 24 column 4 function output_shape
  division by zero
  CWE: CWE-369
  pool_size = 0

✗ FAILED: 'division by zero at ... line 24 ... function output_shape'
✗ FAILED: 'division by zero at ... line 25 ... function output_shape'
VERIFICATION FAILED

Regression Issue

  • Select this option if this issue appears to be a regression.

Possible Solution

Add a one-line precondition (and, optionally, a type annotation documenting it), mirroring the precondition style adopted in #126 for the interpolate kernels:

@nki.jit
def tensor_avgpool_kernel(in_tensor, pool_size: int):
    assert pool_size >= 1, "pool_size must be >= 1"
    sz_cin, sz_hin, sz_win = in_tensor.shape
    sz_hout = sz_hin // pool_size
    ...

Additional Information/Context

Found while extending our formal verifier ESBMC-Python (https://github.com/esbmc/esbmc) with a proof-of-concept verification flow for NKI kernels. The same // <param> sweep that surfaced #125 (interpolate, chunk_size == 1) flags this site. The flow compiles a thin stub of nki.language and runs symbolic verification against host-side shape and trip-count arithmetic; ESBMC's integer division-by-zero check fires at pool_size == 0, and --multi-property enumerates both the sz_hout and sz_wout floor-divs in one run (witness pool_size = 0, CWE-369).

Full methodology, harness, regression suite, and audit notes (Finding 16):

https://github.com/lucasccordeiro/AWS-Neuron

neuronx-cc version used

Not applicable — the reproducer is pure CPython against nki-samples source at commit 0dd1b48b612c6a1604a50bc6d034b010739e508d (current main; avgpool kernel unchanged since the NKI 0.3.0 / Neuron SDK 2.29 release commit a87aaa44f7b26241bdb152af8838e287669c3947). No neuronx-cc invocation is required to trigger the defect.

Framework(s) and their versions used (JAX, PyTorch, etc..)

  • Python 3.14.5 (latest stable; the standalone reproducer is version-independent) - ESBMC 8.3.0 (Python frontend, default solver Bitwuzla 0.9.0) — used to produce the symbolic counterexample - nki-samples @ 0dd1b48b612c6a1604a50bc6d034b010739e508d (current main HEAD, 2026-05-19)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions