Skip to content

yolo26l on inf2/trn1 (NCv2): numerical error in C2PSA Attention.forward + LegalizeSundaMacro internal error at bs>=4 #1323

@jaggiK

Description

@jaggiK

Bug report — AWS Neuron / NeuronCore-v2

Affected SDK: neuronx-cc 2.21.33363.0 (inf2 DLAMI 2.8) and neuronx-cc 2.24.8799.0 (trn1 DLAMI 2.9)
Affected silicon: NeuronCore-v2 — confirmed reproducible on inf2.xlarge and trn1.2xlarge. Not affecting NeuronCore-v1 (inf1.xlarge compiles and runs the same model correctly).


TL;DR

We have two distinct compiler bugs on NeuronCore-v2 affecting Ultralytics' YOLO26 (the successor model to YOLOv8). Together they make yolo26l unusable for production on inf2/trn1 even though the silicon clearly has the throughput. We need help with:

  1. Bug A — silent numerical error: Attention.forward (used in YOLO26's C2PSA block) is lowered to a NEFF that produces mathematically wrong outputs. Match rate against CPU FP32 reference drops from 96% (correct, on inf1) to 44.7% on inf2/trn1 — the model silently misses ~55% of high-confidence detections at production thresholds. Reproducible across BF16 / FP16 / FP32 / --auto-cast matmult / --model-type=generic / --model-type=unet-inference / --optlevel 1-3 / explicit re-write of the attention math with einsum.

  2. Bug B — internal compiler error at batch ≥ 4 in the YOLO26 backbone: compiling layers 0-9 of yolo26l at batch ≥ 4 reproducibly fails with [TEN404] LegalizeSundaMacro: Incorrect IR / RuntimeError: neuronx-cc failed with 70. Affects every flag combination tested. bs=3 compiles cleanly; bs=4, 8, 16 all fail.

These two bugs interact: Bug A forces us to bypass C2PSA (run it on CPU as a hybrid), and Bug B prevents us from amortizing the resulting two-call Neuron pipeline at higher batch sizes. The combined impact is that we hit 158 FPS hybrid throughput on inf2.xlarge with CPU C2PSA (or 199 FPS with the C2PSA offloaded to a remote GPU sidecar) — instead of the ~480 FPS the silicon can demonstrably achieve. We've separately measured the broken-accuracy full-yolo26l at bs=8 hitting 479 FPS in pure-Neuron mode; the throughput is there, only the compiler is wrong.


What we ran

  • Model: Ultralytics YOLO26-large, official weights from ultralytics-pytorch 8.4.47.
  • Instances: inf2.xlarge (i-0757abb324da395b0) and trn1.2xlarge (i-0a077315608c00165, separately verified) in us-west-2.
  • Framework: torch_neuronx 2.9.0.2.13 / neuronx-cc 2.21-2.24 / PyTorch 2.8 / Ubuntu 24.04 DLAMI.
  • Compile recipe: Jim Burtoft's contrib script (jimburtoft/neuronx-distributed-inference) — sets detect.end2end = False, calls model.fuse(), m.export = True, then torch_neuronx.trace(model, dummy, compiler_args=…).
  • Validation: 51-image production CCTV dataset (person_dataset), conf>=0.25 threshold, IoU>=0.5 box matching, same-class detections vs CPU FP32 reference. Match rate is the fraction of CPU detections that have a corresponding Neuron detection.

Reproducer scripts and minimal-repro artifacts in our public-internal repo aegissystems/hw-cost_optimization. Specific repro paths:

  • inf1_vs_inf2/yolo26/scripts/compile_yolo26_no_psa.py — diagnostic that proves Bug A
  • inf1_vs_inf2/yolo26/scripts/test_rewritten_attention.py — proves Bug A is op-pattern-specific in lowering, not in user-PyTorch code
  • inf1_vs_inf2/yolo26/scripts/compile_yolo26_hybrid_bs.py — minimal repro of Bug B (parameterized batch size)

Reference baseline — inf1.xlarge (NeuronCore-v1) compiles and runs YOLO26L correctly

The same YOLO26-large weights, the same input shape (384×640), and the equivalent compile recipe work correctly on NeuronCore-v1. This is the strongest evidence that the issues below are NCv2-side compiler bugs, not model bugs or recipe bugs.

Metric inf1.xlarge (NCv1) inf2.xlarge (NCv2)
Compile success at production batch sizes ✅ bs=1 with --neuroncore-pipeline-cores 4 ❌ bs ≥ 4 fails (Bug B)
Match rate vs CPU FP32 (conf≥0.25) 95.7–99.3% 44.7% ❌ (Bug A)
Mean IoU on matched detections 0.96 0.93
Score delta (Neuron − CPU FP32) −0.002 (essentially zero) +0.038 (systematic bias)
Peak FPS (production config) 286 FPS @ DP×4, 2 in-flight 158 FPS (CPU c2psa hybrid) / 199 FPS (GPU sidecar hybrid)
Compile recipe prepare_yolo26(weights, dtype=torch.float32) + torch_neuron.trace(model, [dummy], compiler_args=["--neuroncore-pipeline-cores", "4"]) same prepare_yolo26 + torch_neuronx.trace (every flag combo broken)
Spot price (us-west-2) $0.034/hr $0.143/hr
$/M images at production accuracy $0.033/M $0.21–0.25/M (hybrid only — direct compile not deployable)

This is our current production deployment for yolo26l: inf1.xlarge with the JB recipe at 384×640, 286 FPS at >95% match rate, validated on a production CCTV dataset. The inf1 path is also tested at 640×640 (with --neuroncore-pipeline-cores 4 to fit, since one v1 NeuronCore alone exhausts SB at that resolution) and produces 99.3% match rate at 56.7 FPS.

The ask: if NCv1 can lower yolo26's Attention.forward correctly and at any batch size, NCv2 should be able to as well. We are essentially asking AWS to bring NCv2's lowering of this attention pattern to functional parity with NCv1's.


Bug A — C2PSA / Attention numerical lowering error

Symptom

A YOLO26-large compiled with Ultralytics' standard recipe and any reasonable flag combination produces a NEFF that:

  • emits ~55% fewer high-confidence detections than the CPU FP32 reference at conf>=0.25
  • when matched, the matched detections have IoU 0.93-0.95 with CPU (geometry close)
  • but ~60 of CPU's 141 detections are completely missing from the Neuron output at any score, not score-shifted

Reproducibility (numbers)

Same prepare_yolo26 model, same dataset, same decoder both sides:

Variant Match rate vs CPU FP32 Mean IoU (matched) Score delta
--optlevel 3 --auto-cast all --model-type unet-inference (BF16) 44.68% 0.93 +0.038
--auto-cast none (full FP32) 44.68% 0.95 +0.039
--auto-cast matmult --auto-cast-type bf16 (BF16 matmul, FP32 softmax) 44.68% 0.95 +0.039
--auto-cast all --auto-cast-type fp16 44.68% 0.95 +0.039
--model-type=generic BF16 (640×640) 42.14% 0.92 +0.048
--optlevel 1 --auto-cast all --auto-cast-type bf16 44.68% 0.95 +0.039
trn1.2xlarge with current SDK (neuronx-cc 2.24, fresh DLAMI) 44.68% 0.93 +0.040
inf1.xlarge native (NeuronCore-v1, same model) 95.74% 0.96 −0.002

A confidence-threshold sweep on the BF16 unet-inference artifact rules out simple score-shift:

NRN conf  cpu_dets  nrn_dets  matched  match_rate
0.25      141       78        63       0.4468
0.20      141       90        65       0.4610
0.10      141       147       73       0.5177
0.05      141       266       81       0.5745     ← still 57% even at extremely permissive threshold

Even at conf=0.05 with 185 unmatched Neuron detections (266 total NRN − 81 matched), only 81/141 CPU detections find a Neuron equivalent. The missing detections genuinely don't appear at any coordinate in Neuron output — not just below threshold.

Isolation — confirmed it's the attention block

We narrowed the bug to the C2PSA module (layer 10 of yolo26l) by surgically replacing c2psa.m = nn.Identity() (skipping the 2 PSABlocks inside C2PSA), then comparing no-PSA Neuron against no-PSA CPU FP32 (same architecture both sides, isolating the lowering bug from architectural detection loss):

=== no-PSA CPU FP32 vs no-PSA Neuron BF16 ===
images:                51
CPU FP32 detections:   82
Neuron BF16 detections:84
Match rate IoU>=0.5:   0.9634       <-- correct! (vs 0.4468 with PSA)
Mean IoU (matched):    0.9902
Score delta (mean):    +0.0127

Conclusion: removing the 2 PSABlocks from C2PSA fixes the numerics. The bug is in NCv2's lowering of the attention pattern inside Attention.forward.

Further isolation — even C2PSA alone is broken

Compile just model.model[10] (the C2PSA module) with bs=8 random input, compare numerically against the same module on CPU:

Compiler args abs error relative error
--auto-cast none --optlevel 1 5.86 3.75 (output is ~3.75× off)
--auto-cast all --auto-cast-type bf16 --optlevel 1 5.87 3.76
--auto-cast all --auto-cast-type fp16 --optlevel 1 5.87 3.76
Single PSABlock alone, BF16 + optlevel 3 11.33 2.48
Custom Attention.forward rewritten with einsum + manual softmax 5.78 3.64

Identical error magnitude across precisions and optlevels — indicates a deterministic compiler bug in the lowering path, not a precision/scheduling issue. Rewriting the attention math with different ops (einsum vs @) doesn't help — the compiler pattern-matches the QK^T·softmax·V structure regardless.

What we believe is happening

Looking at the YOLO26 attention forward:

B, C, H, W = x.shape
N = H * W
qkv = self.qkv(x)
q, k, v = qkv.view(B, num_heads, key_dim*2 + head_dim, N).split(
    [key_dim, key_dim, head_dim], dim=2
)
attn = (q.transpose(-2, -1) @ k) * self.scale     # [B, H, N, N]
attn = attn.softmax(dim=-1)
x = (v @ attn.transpose(-2, -1)).view(B, C, H, W) + self.pe(v.reshape(B, C, H, W))
return self.proj(x)

At 384×640 input, the spatial dim at layer 10 is 12×20 = N=240, num_heads=4, key_dim=32, head_dim=64 (verified from model.model[10].m[0].attn on a fresh load). Small attention. CPU produces correct output. NCv1 (inf1) produces correct output. NCv2 produces output with relative error ~3.8 — strongly suggesting an op or graph-rewrite pass that's only on the v2 path is corrupting the result. Likely candidates (you would know better than us):

  • The combination of transpose+@+reshape on the QKV split path
  • The position encoding self.pe(v.reshape(B, C, H, W)) add interacting with the attention output
  • A fused-attention pattern detector that rewrites this into a primitive that handles different shapes correctly elsewhere but mishandles this specific (B=1-8, H=8, N=240) shape

Help we need (Bug A)

  1. Confirm/identify the broken op-pattern: which lowering pass on NCv2 transforms this attention pattern, and what's it doing differently from the v1 path that handles it correctly?
  2. Provide a workaround flag if available (e.g., a way to disable a specific fusion or pattern rewrite for the attention pattern).
  3. Estimate of timeline for a fix in neuronx-cc. We have a real production deadline this affects.

Bug B — LegalizeSundaMacro compile error in YOLO26 backbone at batch ≥ 4

Symptom

Compiling layers 0-9 of yolo26l (the convolutional backbone, no attention) at batch size ≥ 4 reproducibly fails:

2026-05-08T01:46:38Z [TEN404] (_add.1188) Internal tensorizer error:
LegalizeSundaMacro:Incorrect IR by
<class 'neuronxcc.starfish.penguin.targets.sunda.passes.LegalizeSundaMacro.LegalizeSundaMacro'>

Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new.
RuntimeError: neuronx-cc failed with 70

The error message asks us to open a ticket — that's literally why this report exists.

Reproducibility (numbers)

Layers 0-9 are: 5× Conv + 4× C3k2 + SPPF — all standard ultralytics modules, no attention, no exotic ops. (Verified layout: indices 0,1,3,5,7 = Conv; 2,4,6,8 = C3k2; 9 = SPPF.)

Batch Compile result
bs=1 ✅ compiles
bs=2 (FP32, optlevel 2) ✅ compiles
bs=3 (BF16, optlevel 3, unet-inference) ✅ compiles
bs=4 ❌ LegalizeSundaMacro
bs=8 ❌ LegalizeSundaMacro
bs=16 ❌ LegalizeSundaMacro

Tested every flag combination at bs=4:

  • --auto-cast all --auto-cast-type bf16 --optlevel 3 --model-type unet-inference --enable-fast-loading-neuron-binaries
  • --auto-cast all --auto-cast-type bf16 --optlevel 2
  • --auto-cast all --auto-cast-type bf16 --optlevel 1
  • --auto-cast all --auto-cast-type fp16 --optlevel 3
  • --auto-cast none --optlevel 2
  • --auto-cast none --optlevel 1

All fail with the same LegalizeSundaMacro error.

Splitting the backbone in half doesn't help: we tried PRE_A = layers 0-4 and PRE_B = layers 5-9 separately at bs=8. Both halves trip the same bug.

Wrapping vs not wrapping doesn't help: tried nn.ModuleList-wrapped and direct model.model[:10] traversal — same error.

The bug is specific to ops in layers 0-9. The post-C2PSA portion (layers 11-23 — neck + Detect head) compiles cleanly at bs=8, 16, and 32 with the same flag set.

Why this matters

We use the no-attention-on-Neuron workaround for Bug A: compile a PRE module (layers 0-9) and a POST module (layers 11-23) separately, run C2PSA on CPU between them. The pipeline is bottlenecked by per-call Neuron call overhead because we can't go above bs=3 on the PRE leg. Effective throughput is 158 FPS with CPU C2PSA, or 199 FPS when C2PSA is offloaded to a remote T4 GPU over TCP (which moves the bottleneck off CPU but doesn't break the bs=3 NC ceiling).

If Bug B were fixed (PRE compiled at bs=8), the same hybrid pipeline projects to ~480 FPS at full accuracy — the silicon ceiling. We've measured the broken full-yolo26l at bs=8 at 479 FPS (reproduction), so we know the chip can do the work; it's the compiler that's gating us.

Help we need (Bug B)

  1. Identify what op or shape pattern in the yolo26 backbone triggers LegalizeSundaMacro at bs≥4 — likely a specific Conv/C3k2 layer interaction with the SPPF block.
  2. Workaround flag, if any — even a --disable-pass LegalizeSundaMacro style escape hatch would unblock us.
  3. Fix in a future SDK release.

What we tried (and why none of it worked around the bugs)

Approach Result
Every compile flag combo on full yolo26l All produce 42-47% match rate (Bug A)
Compile C2PSA in isolation, every precision/optlevel abs error 5.8-11, relative 2.5-3.8
Rewrite attention with einsum + explicit softmax Same broken numerics
Confidence threshold sweep (0.25 → 0.05) Confirms detections genuinely missing, not score-shifted
Drop second PSABlock (1 PSA instead of 2) Compiles, but loses 24% of detections architecturally
Compile pre+post hybrid at bs=4, 8, 16 with every flag combo LegalizeSundaMacro (Bug B)
Split PRE into halves (layers 0-4 + 5-9) at bs=8 Same LegalizeSundaMacro
Compile pre via no-PSA wrapper at bs=8 Same LegalizeSundaMacro
torch_neuronx.set_neuron_cores to pin pre→NC0, post→NC1 142 FPS (worse than set_multicore)
torch_neuronx.DataParallel instead of set_multicore Hangs / NRT timeouts at high concurrency
Multi-process inf2 client (NEURON_RT_VISIBLE_CORES per process) Neuron RT refuses 2nd init: Could not be initialized
GPU sidecar over TCP (offload C2PSA to remote T4) 199 FPS — improved by 26% over CPU c2psa, but bottleneck moved to Neuron silicon
GPU sidecar (offload everything post-C2PSA to GPU) 91 FPS — yolo26's FPN skip tensors (15 MB at bs=3 FP16) don't ship efficiently

Production impact

We ship object detection on 100k+ CCTV cameras. We cannot deploy yolo26l on inf2.xlarge because the broken model silently drops ~55% of detections — a safety risk we won't ship. Until Bug A is fixed, our yolo26l fleet is on inf1.xlarge (NCv1, accurate at 96% match, 286 FPS, $0.033/M spot). That's working, but:

  1. inf1 is an older Inferentia generation; long-term we expect to consolidate on Inferentia2 and our roadmap assumes that path.
  2. Other workloads in our fleet (yolov8x weapon detection, 220 FPS at $0.180/M spot on inf2) are already on inf2. Maintaining two Neuron generations adds operational complexity.
  3. We would prefer a single inf2-based fleet running both models. The blocker is yolo26l.

We are happy to provide:

  • Detailed minimal reproducer artifacts (already in the GitHub repo, scripts ready to run on a fresh inf2.xlarge).
  • Direct collaboration: our ML team can debug end-to-end with whoever owns the NCv2 lowering. We have time-budget for this.
  • Test signals on candidate fixes — if you publish a SDK candidate, we can validate it against our 51-image production dataset within hours.

Contact

We would appreciate a confirmation that these bugs are reproducible on AWS's side and an estimated timeline for fixes. Thanks.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions