Describe the bug
Dear AWS Neuron Team,
This is Lucas Cordeiro (https://ssvlab.github.io/lucasccordeiro/). Thank you for the NKI samples repository — it has been a valuable resource for extending our verifier ESBMC-Python (https://github.com/esbmc/esbmc) to handle this codebase.
While reviewing the contributed interpolation kernels, I found that both interpolate_bilinear_2x_fwd and interpolate_trilinear_2x_fwd will divide by zero at trace time if chunk_size is set to 1.
The defect appears in the following files, based on upstream commit a87aaa44f7b26241bdb152af8838e287669c3947 (NKI 0.3.0 / Neuron SDK 2.29 release, 22 April 2026):
contributed/interpolate_bilinear_fwd.py, lines 59–62
contributed/interpolate_trilinear_fwd.py, lines 67–69
Both kernels use the same host-side trip-count expression:
wdw_size = chunk_size
step_size = wdw_size - 1 # == 0 when chunk_size == 1
for _ in nl.static_range(math.ceil((h_src - wdw_size) / step_size) + 1):
...
When chunk_size is 1, step_size becomes zero, and the division inside math.ceil(...) raises ZeroDivisionError.
Expected Behavior
Either:
chunk_size == 1 is rejected at the API boundary with a clear AssertionError / ValueError naming the violated precondition (chunk_size >= 2), or
chunk_size == 1 is supported as a meaningful "no-overlap" mode and the trip-count expression is rewritten to avoid the zero divisor.
The public type signatures (chunk_size: int = 10) admit chunk_size == 1, so the current behaviour — an uncaught ZeroDivisionError from inside the kernel body — might be surprising.
Current Behavior
chunk_size == 1 raises ZeroDivisionError: division by zero at JIT trace time, from inside the loop-bound expression. Stack trace from the minimal reproducer below:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ZeroDivisionError: division by zero
In a full @nki.jit trace the same exception propagates out of interpolate_bilinear_2x_fwd / interpolate_trilinear_2x_fwd before any device-side codegen takes place.
Reproduction Steps
The defect is in pure host-side Python and reproduces in plain CPython without a Trainium device or neuronx-cc:
import math
chunk_size = 1 # the bug
x_src = 5 # any positive int
wdw_size = chunk_size
step_size = wdw_size - 1 # 0
n = math.ceil((x_src - wdw_size) / step_size) + 1
# ZeroDivisionError: division by zero
This is the exact expression at contributed/interpolate_bilinear_fwd.py:62 and contributed/interpolate_trilinear_fwd.py:69; the bodies of both kernels never execute under chunk_size == 1.
A symbolic counterexample (witness produced by ESBMC-Python against a stubbed port of the kernels) and the corresponding @nki.jit entry-point harnesses are at:
Regression Issue
Possible Solution
Add the missing precondition early in each kernel and (optionally) handle the single-window regime explicitly:
assert chunk_size >= 2, "chunk_size must be >= 2 (step_size must be positive)"
assert h_src >= 1, "h_src must be positive"
wdw_size = chunk_size
step_size = wdw_size - 1
n_iter = 1 if h_src <= wdw_size else math.ceil((h_src - wdw_size) / step_size) + 1
for h in nl.static_range(n_iter):
...
(Equivalent change for d_src in interpolate_trilinear_fwd.py.) If the design intent is that chunk_size == 1 should be a meaningful "no-overlap" mode, the right fix is different — replace step_size = chunk_size - 1 with the desired stride — but as the code stands today chunk_size == 1 cannot work under any reading.
Additional Information/Context
Found while extending our formal verifier ESBMC-Python (https://github.com/esbmc/esbmc) with a proof-of-concept verification flow for NKI kernels. The flow compiles a thin stub of nki.language and runs symbolic verification against host-side shape and trip-count arithmetic.
A related but distinct boundary case — h_src == 1 (or d_src == 1) with chunk_size >= 2 — causes the same outer loop to be zero-trip: math.ceil((1 - chunk_size) / (chunk_size - 1)) + 1 == 0. The kernel allocates dst_arr via nl.ndarray(...) but never writes to it, so the caller receives a correctly-shaped tensor with no kernel writes (no exception). I have not filed this as a separate issue yet because shape-and-bounds verification does not flag it, but the precondition h_src >= chunk_size (or an explicit single-window branch as above) would close it.
Full methodology, harnesses, regression suite (interpolate_bilinear_chunk1, interpolate_trilinear_chunk1 pin chunk_size == 1), and audit notes (Finding 15):
https://github.com/lucasccordeiro/AWS-Neuron
neuronx-cc version used
Not applicable — the reproducer is pure CPython against nki-samples source at commit a87aaa44f7b26241bdb152af8838e287669c3947 (NKI 0.3.0, Neuron SDK 2.29). No neuronx-cc invocation is required to trigger the defect.
Framework(s) and their versions used (JAX, PyTorch, etc..)
- Python 3.14.4 (for the standalone reproducer) - ESBMC 8.3.0 (Python frontend) — used to produce the symbolic counterexample - nki-samples @
a87aaa44f7b26241bdb152af8838e287669c3947 (NKI 0.3.0)
Describe the bug
Dear AWS Neuron Team,
This is Lucas Cordeiro (https://ssvlab.github.io/lucasccordeiro/). Thank you for the NKI samples repository — it has been a valuable resource for extending our verifier ESBMC-Python (https://github.com/esbmc/esbmc) to handle this codebase.
While reviewing the contributed interpolation kernels, I found that both
interpolate_bilinear_2x_fwdandinterpolate_trilinear_2x_fwdwill divide by zero at trace time ifchunk_sizeis set to1.The defect appears in the following files, based on upstream commit
a87aaa44f7b26241bdb152af8838e287669c3947(NKI 0.3.0 / Neuron SDK 2.29 release, 22 April 2026):contributed/interpolate_bilinear_fwd.py, lines 59–62contributed/interpolate_trilinear_fwd.py, lines 67–69Both kernels use the same host-side trip-count expression:
When
chunk_sizeis1,step_sizebecomes zero, and the division insidemath.ceil(...)raisesZeroDivisionError.Expected Behavior
Either:
chunk_size == 1is rejected at the API boundary with a clearAssertionError/ValueErrornaming the violated precondition (chunk_size >= 2), orchunk_size == 1is supported as a meaningful "no-overlap" mode and the trip-count expression is rewritten to avoid the zero divisor.The public type signatures (
chunk_size: int = 10) admitchunk_size == 1, so the current behaviour — an uncaughtZeroDivisionErrorfrom inside the kernel body — might be surprising.Current Behavior
chunk_size == 1raisesZeroDivisionError: division by zeroat JIT trace time, from inside the loop-bound expression. Stack trace from the minimal reproducer below:In a full
@nki.jittrace the same exception propagates out ofinterpolate_bilinear_2x_fwd/interpolate_trilinear_2x_fwdbefore any device-side codegen takes place.Reproduction Steps
The defect is in pure host-side Python and reproduces in plain CPython without a Trainium device or
neuronx-cc:This is the exact expression at
contributed/interpolate_bilinear_fwd.py:62andcontributed/interpolate_trilinear_fwd.py:69; the bodies of both kernels never execute underchunk_size == 1.A symbolic counterexample (witness produced by ESBMC-Python against a stubbed port of the kernels) and the corresponding
@nki.jitentry-point harnesses are at:Regression Issue
Possible Solution
Add the missing precondition early in each kernel and (optionally) handle the single-window regime explicitly:
(Equivalent change for
d_srcininterpolate_trilinear_fwd.py.) If the design intent is thatchunk_size == 1should be a meaningful "no-overlap" mode, the right fix is different — replacestep_size = chunk_size - 1with the desired stride — but as the code stands todaychunk_size == 1cannot work under any reading.Additional Information/Context
Found while extending our formal verifier ESBMC-Python (https://github.com/esbmc/esbmc) with a proof-of-concept verification flow for NKI kernels. The flow compiles a thin stub of
nki.languageand runs symbolic verification against host-side shape and trip-count arithmetic.A related but distinct boundary case —
h_src == 1(ord_src == 1) withchunk_size >= 2— causes the same outer loop to be zero-trip:math.ceil((1 - chunk_size) / (chunk_size - 1)) + 1 == 0. The kernel allocatesdst_arrvianl.ndarray(...)but never writes to it, so the caller receives a correctly-shaped tensor with no kernel writes (no exception). I have not filed this as a separate issue yet because shape-and-bounds verification does not flag it, but the preconditionh_src >= chunk_size(or an explicit single-window branch as above) would close it.Full methodology, harnesses, regression suite (
interpolate_bilinear_chunk1,interpolate_trilinear_chunk1pinchunk_size == 1), and audit notes (Finding 15):https://github.com/lucasccordeiro/AWS-Neuron
neuronx-cc version used
Not applicable — the reproducer is pure CPython against nki-samples source at commit
a87aaa44f7b26241bdb152af8838e287669c3947(NKI 0.3.0, Neuron SDK 2.29). Noneuronx-ccinvocation is required to trigger the defect.Framework(s) and their versions used (JAX, PyTorch, etc..)
a87aaa44f7b26241bdb152af8838e287669c3947(NKI 0.3.0)