A small, runnable demonstration that the AWS Neuron Kernel Interface (NKI) kernels can be partially verified by ESBMC, using a thin Python stub library that models NKI tile shapes and bounds — without executing on a real NeuronCore.
The verifier runs in two phases:
- Phase 1 (default flags) — shape and bounds contracts on every NKI kernel call (partition-dim limits, in-bounds slicing, shape-equality on DMA / elementwise / matmul, hardware constraints on the matmul unit), via stub asserts. Contract violations surface as precise counterexamples.
- Phase 2 (
--overflow-check, default div-by-zero) — safety properties on host-side index arithmetic. Signed-integer overflow and integer division-by-zero, mapped to CWE-190 / CWE-369. Reproduces theZeroDivisionError(CWE-369) on the upstream trip-count expression atchunk_size = 1that motivated aws-neuron/nki-samples#125 — now fixed upstream by PR #126, which adopted the preconditions this PoC proposed. Audit-style targets (today:audit15_hostarith_unguarded) additionally pass--multi-propertyso ESBMC enumerates every violated property in one run rather than stopping at the first — useful when a host-arithmetic reproducer exercises multiple at-risk divisors. Standard phase-2 stays on plain--overflow-checkbecause--multi-propertymeasurably slows symbolic-shape verification (~5 min vs <1 min per symbolic target) without changing the verdict.
The PoC reproduces a real shipped bug from
aws-neuron/nki-samples PR #74
("matmul sbuf allocation dimension fix"). The pre-PR nki_matmul_hoist_load_
allocated lhsT_tiles with free-dim TILE_N (=512) when it should have been
TILE_M (=128). ESBMC catches the resulting shape mismatch retroactively, in
about two seconds on a laptop:
(The figure regenerates from
figures/bug_pr74.html — open it in a browser
for the same rendering at any zoom level.)
The same PoC also catches host-side arithmetic bugs prospectively.
aws-neuron/nki-samples#125
(ZeroDivisionError at JIT trace time in the
interpolate_{bi,tri}linear_2x_fwd chunking math when chunk_size == 1)
was reported from this PoC and closed by upstream
PR #126, which
applied the chunk_size >= 2 / dim >= chunk_size preconditions
proposed in the AUDIT-15 write-up verbatim.
A second, lower-severity host-arithmetic gap of the same class —
tensor_avgpool_kernel dividing by zero when pool_size == 0
(AUDIT Finding 16) — was subsequently filed as
aws-neuron/nki-samples#127,
framed explicitly as a defensive-programming follow-on rather than a
security finding.
.
├── verify.py # manifest of (entry script, ESBMC args, expected verdict)
├── Makefile # `make verify`
├── REPORT.md # full method, soundness analysis, scope, coverage
├── RETROSPECTIVE.md # narrative summary aimed at the ESBMC team
├── AUDIT.md # stub-correctness audit + incident log
└── harness/ # everything ESBMC sees
├── stubs.py # canonical stub library — single source of truth
├── kernels/ # ported NKI kernels, each `from stubs import *`
│ └── <name>.py
└── <name>.py # entry scripts; import stubs + kernels.<name>
ESBMC's Python frontend searches the entry script's directory for modules,
so stubs.py and kernels/ live next to the entry scripts under
harness/. Each entry script imports the stub names and the kernel
function it exercises, sets up concrete (or nondet) input shapes, and
asserts the kernel's output contract.
| Target | Entry script (harness/) | Kernel module | Expected |
|---|---|---|---|
tensor_add |
tensor_add.py |
kernels/tensor_add.py |
SUCCESSFUL |
tensor_add_buggy |
tensor_add_buggy.py |
kernels/tensor_add_buggy.py |
FAILED |
tensor_add_symbolic |
tensor_add_symbolic.py |
kernels/tensor_add.py |
SUCCESSFUL (--unwind 6) |
transpose2d |
transpose2d.py |
kernels/transpose2d.py |
SUCCESSFUL |
transpose2d_buggy |
transpose2d_buggy.py |
kernels/transpose2d_buggy.py |
FAILED |
matmul |
matmul.py |
kernels/matmul.py |
SUCCESSFUL |
matmul_big |
matmul_big.py |
kernels/matmul.py |
SUCCESSFUL |
matmul_buggy |
matmul_buggy.py |
kernels/matmul_buggy.py |
FAILED |
maxpooling |
maxpooling.py |
kernels/maxpooling.py |
SUCCESSFUL |
maxpooling_buggy |
maxpooling_buggy.py |
kernels/maxpooling_buggy.py |
FAILED |
interpolate_bilinear |
interpolate_bilinear.py |
kernels/interpolate_bilinear.py |
SUCCESSFUL |
interpolate_bilinear_buggy |
interpolate_bilinear_buggy.py |
kernels/interpolate_bilinear_buggy.py |
FAILED |
interpolate_trilinear |
interpolate_trilinear.py |
kernels/interpolate_trilinear.py |
SUCCESSFUL |
interpolate_trilinear_buggy |
interpolate_trilinear_buggy.py |
kernels/interpolate_trilinear_buggy.py |
FAILED |
interpolate_bilinear_chunk1 |
interpolate_bilinear_chunk1.py |
kernels/interpolate_bilinear.py |
FAILED — chunk_size=1 boundary input (AUDIT Finding 15) |
interpolate_trilinear_chunk1 |
interpolate_trilinear_chunk1.py |
kernels/interpolate_trilinear.py |
FAILED — same boundary as bilinear |
audit15_hostarith_unguarded |
audit15_hostarith_unguarded.py |
— (standalone) | phase-2 only: FAILED with division by zero (CWE-369) on the upstream trip-count expression at chunk_size = 1 (now regressed against upstream PR #126) |
avgpool_hostarith_unguarded |
avgpool_hostarith_unguarded.py |
— (standalone) | phase-2 only: FAILED with division by zero (CWE-369) on the upstream tensor_avgpool_kernel floor-divs at pool_size = 0 — --multi-property surfaces both sz_hout and sz_wout sites in one run (AUDIT Finding 16; filed upstream as #127) |
matmul_basic |
matmul_basic.py |
kernels/matmul_basic.py |
SUCCESSFUL |
matmul_basic_buggy |
matmul_basic_buggy.py |
kernels/matmul_basic_buggy.py |
FAILED |
matmul_basic_symbolic |
matmul_basic_symbolic.py |
kernels/matmul_basic.py |
SUCCESSFUL (dtype sweep; input dtype ∈ {BF16, F16, F32}; no --unwind) |
mamba_v1 |
mamba_v1.py |
kernels/mamba_v1.py |
SUCCESSFUL |
mamba_v1_buggy |
mamba_v1_buggy.py |
kernels/mamba_v1_buggy.py |
FAILED |
transpose2d_symbolic |
transpose2d_symbolic.py |
kernels/transpose2d.py |
SUCCESSFUL (--unwind 5; F1, F2 ∈ [1, 4]) |
maxpooling_symbolic |
maxpooling_symbolic.py |
kernels/maxpooling.py |
SUCCESSFUL (--unwind 5; H = k·128, k ∈ [1, 4]) |
mamba_v1_symbolic |
mamba_v1_symbolic.py |
kernels/mamba_v1.py |
SUCCESSFUL (--unwind 5; state ∈ [1, 4], seq ∈ [2, 8]) |
interpolate_bilinear_symbolic |
interpolate_bilinear_symbolic.py |
kernels/interpolate_bilinear.py |
SUCCESSFUL (--unwind 5; H_src, W_src ∈ {10, 19, 28}) |
interpolate_trilinear_symbolic |
interpolate_trilinear_symbolic.py |
kernels/interpolate_trilinear.py |
SUCCESSFUL (--unwind 5; D_src, H_src, W_src ∈ {10, 19}) |
matmul_tiled |
matmul_tiled.py |
kernels/matmul_tiled.py |
SUCCESSFUL |
matmul_tiled_buggy |
matmul_tiled_buggy.py |
kernels/matmul_tiled_buggy.py |
FAILED |
matmul_tiled_symbolic |
matmul_tiled_symbolic.py |
kernels/matmul_tiled.py |
SUCCESSFUL (--unwind 4; M ∈ {128, 256, 384}, N ∈ {512, 1024}) |
matmul_hoist_load |
matmul_hoist_load.py |
kernels/matmul_hoist_load.py |
SUCCESSFUL |
matmul_hoist_load_buggy |
matmul_hoist_load_buggy.py |
kernels/matmul_hoist_load_buggy.py |
FAILED |
matmul_hoist_load_symbolic |
matmul_hoist_load_symbolic.py |
kernels/matmul_hoist_load.py |
SUCCESSFUL (--unwind 3; K, M ∈ {128, 256}, N ∈ {512, 1024}) |
matmul_hoist_load_historical |
matmul_hoist_load_historical.py |
kernels/matmul_hoist_load_historical.py |
FAILED — reproduces the pre-fix bug from upstream aws-neuron/nki-samples#74 |
matmul_block_free |
matmul_block_free.py |
kernels/matmul_block_free.py |
SUCCESSFUL |
matmul_block_free_buggy |
matmul_block_free_buggy.py |
kernels/matmul_block_free_buggy.py |
FAILED |
matmul_block_free_symbolic |
matmul_block_free_symbolic.py |
kernels/matmul_block_free.py |
SUCCESSFUL (--unwind 3; K ∈ {128, 256}, M ∈ {256, 512}, N ∈ {1024, 2048}) |
matmul_fully_optimized |
matmul_fully_optimized.py |
kernels/matmul_fully_optimized.py |
SUCCESSFUL |
matmul_fully_optimized_buggy |
matmul_fully_optimized_buggy.py |
kernels/matmul_fully_optimized_buggy.py |
FAILED |
matmul_fully_optimized_symbolic |
matmul_fully_optimized_symbolic.py |
kernels/matmul_fully_optimized.py |
SUCCESSFUL (--unwind 3; K, M ∈ {256, 512}, N ∈ {1024, 2048}) |
mamba_v2 |
mamba_v2.py |
kernels/mamba_v2.py |
SUCCESSFUL |
mamba_v2_buggy |
mamba_v2_buggy.py |
kernels/mamba_v2_buggy.py |
FAILED |
mamba_v3 |
mamba_v3.py |
kernels/mamba_v3.py |
SUCCESSFUL |
mamba_v3_buggy |
mamba_v3_buggy.py |
kernels/mamba_v3_buggy.py |
FAILED |
mamba_v3_symbolic |
mamba_v3_symbolic.py |
kernels/mamba_v3.py |
SUCCESSFUL (--unwind 5; STATE ∈ [1, 4], num_seq_tiles ∈ [1, 3]) |
avgpool |
avgpool.py |
kernels/avgpool.py |
SUCCESSFUL |
avgpool_buggy |
avgpool_buggy.py |
kernels/avgpool_buggy.py |
FAILED |
avgpool_symbolic |
avgpool_symbolic.py |
kernels/avgpool.py |
SUCCESSFUL (--unwind 5; H, W ∈ {6, 8, 10}) |
attn_fwd_v1 |
attn_fwd_v1.py |
kernels/attn_fwd_v1.py |
SUCCESSFUL |
attn_fwd_v1_buggy |
attn_fwd_v1_buggy.py |
kernels/attn_fwd_v1_buggy.py |
FAILED |
attn_fwd_v2 |
attn_fwd_v2.py |
kernels/attn_fwd_v2.py |
SUCCESSFUL |
attn_fwd_v2_buggy |
attn_fwd_v2_buggy.py |
kernels/attn_fwd_v2_buggy.py |
FAILED |
attn_fwd_v3 |
attn_fwd_v3.py |
kernels/attn_fwd_v3.py |
SUCCESSFUL (--unwind 5; seqlen = 512) |
attn_fwd_v3_buggy |
attn_fwd_v3_buggy.py |
kernels/attn_fwd_v3_buggy.py |
FAILED (--unwind 5) |
attn_fwd_v3_symbolic |
attn_fwd_v3_symbolic.py |
kernels/attn_fwd_v3.py |
SUCCESSFUL (--unwind 9; SEQLEN ∈ {512, 1024}) |
pipelined_attention |
pipelined_attention.py |
kernels/pipelined_attention.py |
SUCCESSFUL (shape-skeleton only; inner Flash Attention pipeline deferred — see ROADMAP) |
pipelined_attention_load_q |
pipelined_attention_load_q.py |
kernels/pipelined_attention.py |
SUCCESSFUL (shape skeleton + upstream load_q inner helper; six remaining inner helpers deferred — see ROADMAP) |
verify.py is the single source of truth for these pairings, the ESBMC
flags, and the expected verdicts.
Requires ESBMC 8.3.0 or later with the Python frontend. The list of
upstream issues this PoC depends on (and the PRs that closed each one)
is in RETROSPECTIVE.md.
make verify # run both phases on every target, tally results
python3 verify.py NAME # run a single target (both phases if it opts into both)
python3 verify.py --phase=1 # phase-1 only (shape and bounds)
python3 verify.py --phase=2 # phase-2 only (safety properties)
make dashboard # rebuild dashboard.html from current datadashboard.html is a single self-contained HTML page
that pulls live data from verify.py and RETROSPECTIVE.md — headline
stats, all targets with kind/family/flags, every ESBMC issue filed with
its status, kernel coverage tree, and source-rewriting status. Open it
locally after make dashboard to see the current state of the PoC.
Concrete-shape targets complete in 1–3 seconds wall-clock each on a
stock laptop. The thirteen symbolic-shape targets run for ~5–90
seconds depending on the size of the shape family they sweep —
eleven of the thirteen --unwind values are k-induction-certified
complete (see REPORT.md for the table). One further symbolic target,
matmul_basic_symbolic, sweeps the dtype axis rather than shape (the
basic kernel is fixed at 64×128×512, so a shape sweep is vacuous): it
certifies the dtype-passthrough contract for every supported input
dtype in one run and needs no --unwind since the kernel has no loops.
Phase-1 (59 runs) finishes in about 14 minutes; phase-2 (39 runs —
every concrete- and symbolic-shape good kernel plus the AUDIT-15
reproducer) finishes in about 11 minutes; the combined two-phase sweep
is ~26 minutes end-to-end.
REPORT.md— full method, contract provenance, soundness analysis, stub-library catalogue, kernel coverage, what doesn't yet work.RETROSPECTIVE.md— narrative summary aimed at the ESBMC team: what the PoC exercised, the upstream issues filed and how each retired, source-rewriting history, verification patterns.AUDIT.md— stub-correctness audit (the original duplicated-stubs diff plus the per-port contract-tightness incidents).ROADMAP.md— what's left to port, grouped by stub-library effort.
- ESBMC: https://github.com/esbmc/esbmc — 8.3.0, default Bitwuzla solver.
- NKI samples: https://github.com/aws-neuron/nki-samples — the
tensor_addition,transpose2d,matrix_multiplication,fused_mamba,average_pool2d,attention_fwd_performance(v1) tutorials and thecontributed/{matmul, maxpooling, interpolate_bilinear_fwd, interpolate_trilinear_fwd}community kernels. Pinned snapshota87aaa44.
