yolo26l on inf2/trn1 (NCv2): numerical error in C2PSA Attention.forward + LegalizeSundaMacro internal error at bs>=4

# Bug report — AWS Neuron / NeuronCore-v2

**Affected SDK:** `neuronx-cc 2.21.33363.0` (inf2 DLAMI 2.8) and `neuronx-cc 2.24.8799.0` (trn1 DLAMI 2.9)
**Affected silicon:** NeuronCore-v2 — confirmed reproducible on `inf2.xlarge` and `trn1.2xlarge`. Not affecting NeuronCore-v1 (`inf1.xlarge` compiles and runs the same model correctly).

---

## TL;DR

We have **two distinct compiler bugs** on NeuronCore-v2 affecting Ultralytics' YOLO26 (the successor model to YOLOv8). Together they make yolo26l unusable for production on inf2/trn1 even though the silicon clearly has the throughput. We need help with:

1. **Bug A — silent numerical error**: `Attention.forward` (used in YOLO26's `C2PSA` block) is lowered to a NEFF that produces mathematically wrong outputs. Match rate against CPU FP32 reference drops from 96% (correct, on inf1) to **44.7%** on inf2/trn1 — the model silently misses ~55% of high-confidence detections at production thresholds. Reproducible across BF16 / FP16 / FP32 / `--auto-cast matmult` / `--model-type=generic` / `--model-type=unet-inference` / `--optlevel 1-3` / explicit re-write of the attention math with `einsum`.

2. **Bug B — internal compiler error at batch ≥ 4 in the YOLO26 backbone**: compiling layers 0-9 of yolo26l at batch ≥ 4 reproducibly fails with `[TEN404] LegalizeSundaMacro: Incorrect IR` / `RuntimeError: neuronx-cc failed with 70`. Affects every flag combination tested. bs=3 compiles cleanly; bs=4, 8, 16 all fail.

These two bugs interact: Bug A forces us to bypass C2PSA (run it on CPU as a hybrid), and Bug B prevents us from amortizing the resulting two-call Neuron pipeline at higher batch sizes. The combined impact is that we hit **158 FPS** hybrid throughput on inf2.xlarge with CPU C2PSA (or **199 FPS** with the C2PSA offloaded to a remote GPU sidecar) — instead of the ~480 FPS the silicon can demonstrably achieve. We've separately measured the broken-accuracy full-yolo26l at bs=8 hitting **479 FPS** in pure-Neuron mode; the throughput is there, only the compiler is wrong.

---

## What we ran

- **Model:** Ultralytics YOLO26-large, official weights from ultralytics-pytorch 8.4.47.
- **Instances:** `inf2.xlarge` (i-0757abb324da395b0) and `trn1.2xlarge` (i-0a077315608c00165, separately verified) in `us-west-2`.
- **Framework:** `torch_neuronx 2.9.0.2.13` / `neuronx-cc 2.21-2.24` / PyTorch 2.8 / Ubuntu 24.04 DLAMI.
- **Compile recipe:** Jim Burtoft's contrib script ([jimburtoft/neuronx-distributed-inference](https://github.com/jimburtoft/neuronx-distributed-inference/tree/contrib/yolo26)) — sets `detect.end2end = False`, calls `model.fuse()`, `m.export = True`, then `torch_neuronx.trace(model, dummy, compiler_args=…)`.
- **Validation:** 51-image production CCTV dataset (`person_dataset`), conf>=0.25 threshold, IoU>=0.5 box matching, same-class detections vs CPU FP32 reference. Match rate is the fraction of CPU detections that have a corresponding Neuron detection.

Reproducer scripts and minimal-repro artifacts in our public-internal repo [aegissystems/hw-cost_optimization](https://github.com/aegissystems/hw-cost_optimization). Specific repro paths:

- `inf1_vs_inf2/yolo26/scripts/compile_yolo26_no_psa.py` — diagnostic that proves Bug A
- `inf1_vs_inf2/yolo26/scripts/test_rewritten_attention.py` — proves Bug A is op-pattern-specific in lowering, not in user-PyTorch code
- `inf1_vs_inf2/yolo26/scripts/compile_yolo26_hybrid_bs.py` — minimal repro of Bug B (parameterized batch size)

---

## Reference baseline — inf1.xlarge (NeuronCore-v1) compiles and runs YOLO26L correctly

The same YOLO26-large weights, the same input shape (384×640), and the equivalent compile recipe **work correctly on NeuronCore-v1**. This is the strongest evidence that the issues below are NCv2-side compiler bugs, not model bugs or recipe bugs.

| Metric | inf1.xlarge (NCv1) | inf2.xlarge (NCv2) |
|---|---:|---:|
| **Compile success at production batch sizes** | ✅ bs=1 with `--neuroncore-pipeline-cores 4` | ❌ bs ≥ 4 fails (Bug B) |
| **Match rate vs CPU FP32 (conf≥0.25)** | **95.7–99.3%** ✅ | **44.7%** ❌ (Bug A) |
| Mean IoU on matched detections | 0.96 | 0.93 |
| Score delta (Neuron − CPU FP32) | −0.002 (essentially zero) | +0.038 (systematic bias) |
| Peak FPS (production config) | 286 FPS @ DP×4, 2 in-flight | 158 FPS (CPU c2psa hybrid) / 199 FPS (GPU sidecar hybrid) |
| Compile recipe | `prepare_yolo26(weights, dtype=torch.float32)` + `torch_neuron.trace(model, [dummy], compiler_args=["--neuroncore-pipeline-cores", "4"])` | same `prepare_yolo26` + `torch_neuronx.trace` (every flag combo broken) |
| Spot price (us-west-2) | $0.034/hr | $0.143/hr |
| **$/M images at production accuracy** | **$0.033/M** | $0.21–0.25/M (hybrid only — direct compile not deployable) |

This is our current production deployment for yolo26l: **inf1.xlarge with the JB recipe at 384×640**, 286 FPS at >95% match rate, validated on a production CCTV dataset. The inf1 path is also tested at 640×640 (with `--neuroncore-pipeline-cores 4` to fit, since one v1 NeuronCore alone exhausts SB at that resolution) and produces **99.3% match rate at 56.7 FPS**.

**The ask:** if NCv1 can lower yolo26's `Attention.forward` correctly and at any batch size, NCv2 should be able to as well. We are essentially asking AWS to bring NCv2's lowering of this attention pattern to functional parity with NCv1's.

---

## Bug A — C2PSA / Attention numerical lowering error

### Symptom

A YOLO26-large compiled with Ultralytics' standard recipe and any reasonable flag combination produces a NEFF that:

- emits ~55% fewer high-confidence detections than the CPU FP32 reference at conf>=0.25
- when matched, the matched detections have IoU 0.93-0.95 with CPU (geometry close)
- but ~60 of CPU's 141 detections are **completely missing** from the Neuron output at any score, not score-shifted

### Reproducibility (numbers)

Same `prepare_yolo26` model, same dataset, same decoder both sides:

| Variant | Match rate vs CPU FP32 | Mean IoU (matched) | Score delta |
|---|---:|---:|---:|
| `--optlevel 3 --auto-cast all --model-type unet-inference` (BF16) | 44.68% | 0.93 | +0.038 |
| `--auto-cast none` (full FP32) | 44.68% | 0.95 | +0.039 |
| `--auto-cast matmult --auto-cast-type bf16` (BF16 matmul, FP32 softmax) | 44.68% | 0.95 | +0.039 |
| `--auto-cast all --auto-cast-type fp16` | 44.68% | 0.95 | +0.039 |
| `--model-type=generic` BF16 (640×640) | 42.14% | 0.92 | +0.048 |
| `--optlevel 1 --auto-cast all --auto-cast-type bf16` | 44.68% | 0.95 | +0.039 |
| trn1.2xlarge with current SDK (neuronx-cc 2.24, fresh DLAMI) | 44.68% | 0.93 | +0.040 |
| **inf1.xlarge native (NeuronCore-v1, same model)** | **95.74%** | 0.96 | −0.002 |

A confidence-threshold sweep on the BF16 unet-inference artifact rules out simple score-shift:

```
NRN conf  cpu_dets  nrn_dets  matched  match_rate
0.25      141       78        63       0.4468
0.20      141       90        65       0.4610
0.10      141       147       73       0.5177
0.05      141       266       81       0.5745     ← still 57% even at extremely permissive threshold
```

Even at conf=0.05 with 185 unmatched Neuron detections (266 total NRN − 81 matched), only 81/141 CPU detections find a Neuron equivalent. The missing detections genuinely don't appear at any coordinate in Neuron output — not just below threshold.

### Isolation — confirmed it's the attention block

We narrowed the bug to the C2PSA module (layer 10 of yolo26l) by surgically replacing `c2psa.m = nn.Identity()` (skipping the 2 PSABlocks inside C2PSA), then comparing no-PSA Neuron against no-PSA CPU FP32 (same architecture both sides, isolating the lowering bug from architectural detection loss):

```
=== no-PSA CPU FP32 vs no-PSA Neuron BF16 ===
images:                51
CPU FP32 detections:   82
Neuron BF16 detections:84
Match rate IoU>=0.5:   0.9634       <-- correct! (vs 0.4468 with PSA)
Mean IoU (matched):    0.9902
Score delta (mean):    +0.0127
```

**Conclusion:** removing the 2 PSABlocks from C2PSA fixes the numerics. The bug is in NCv2's lowering of the attention pattern inside `Attention.forward`.

### Further isolation — even C2PSA alone is broken

Compile just `model.model[10]` (the C2PSA module) with bs=8 random input, compare numerically against the same module on CPU:

| Compiler args | abs error | relative error |
|---|---:|---:|
| `--auto-cast none --optlevel 1` | 5.86 | 3.75 (output is ~3.75× off) |
| `--auto-cast all --auto-cast-type bf16 --optlevel 1` | 5.87 | 3.76 |
| `--auto-cast all --auto-cast-type fp16 --optlevel 1` | 5.87 | 3.76 |
| Single PSABlock alone, BF16 + optlevel 3 | 11.33 | 2.48 |
| Custom `Attention.forward` rewritten with `einsum` + manual softmax | 5.78 | 3.64 |

**Identical error magnitude across precisions and optlevels** — indicates a deterministic compiler bug in the lowering path, not a precision/scheduling issue. Rewriting the attention math with different ops (einsum vs `@`) doesn't help — the compiler pattern-matches the QK^T·softmax·V structure regardless.

### What we believe is happening

Looking at the YOLO26 attention forward:

```python
B, C, H, W = x.shape
N = H * W
qkv = self.qkv(x)
q, k, v = qkv.view(B, num_heads, key_dim*2 + head_dim, N).split(
    [key_dim, key_dim, head_dim], dim=2
)
attn = (q.transpose(-2, -1) @ k) * self.scale     # [B, H, N, N]
attn = attn.softmax(dim=-1)
x = (v @ attn.transpose(-2, -1)).view(B, C, H, W) + self.pe(v.reshape(B, C, H, W))
return self.proj(x)
```

At 384×640 input, the spatial dim at layer 10 is 12×20 = N=240, num_heads=4, key_dim=32, head_dim=64 (verified from `model.model[10].m[0].attn` on a fresh load). Small attention. CPU produces correct output. NCv1 (inf1) produces correct output. NCv2 produces output with relative error ~3.8 — strongly suggesting an op or graph-rewrite pass that's only on the v2 path is corrupting the result. Likely candidates (you would know better than us):

- The combination of `transpose`+`@`+`reshape` on the QKV split path
- The position encoding `self.pe(v.reshape(B, C, H, W))` add interacting with the attention output
- A fused-attention pattern detector that rewrites this into a primitive that handles different shapes correctly elsewhere but mishandles this specific (B=1-8, H=8, N=240) shape

### Help we need (Bug A)

1. **Confirm/identify the broken op-pattern**: which lowering pass on NCv2 transforms this attention pattern, and what's it doing differently from the v1 path that handles it correctly?
2. **Provide a workaround flag** if available (e.g., a way to disable a specific fusion or pattern rewrite for the attention pattern).
3. **Estimate of timeline** for a fix in `neuronx-cc`. We have a real production deadline this affects.

---

## Bug B — `LegalizeSundaMacro` compile error in YOLO26 backbone at batch ≥ 4

### Symptom

Compiling **layers 0-9 of yolo26l** (the convolutional backbone, no attention) at batch size ≥ 4 reproducibly fails:

```
2026-05-08T01:46:38Z [TEN404] (_add.1188) Internal tensorizer error:
LegalizeSundaMacro:Incorrect IR by
<class 'neuronxcc.starfish.penguin.targets.sunda.passes.LegalizeSundaMacro.LegalizeSundaMacro'>

Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new.
RuntimeError: neuronx-cc failed with 70
```

The error message asks us to open a ticket — that's literally why this report exists.

### Reproducibility (numbers)

Layers 0-9 are: 5× Conv + 4× C3k2 + SPPF — all standard ultralytics modules, no attention, no exotic ops. (Verified layout: indices 0,1,3,5,7 = Conv; 2,4,6,8 = C3k2; 9 = SPPF.)

| Batch | Compile result |
|---|---|
| bs=1 | ✅ compiles |
| bs=2 (FP32, optlevel 2) | ✅ compiles |
| bs=3 (BF16, optlevel 3, unet-inference) | ✅ compiles |
| bs=4 | ❌ LegalizeSundaMacro |
| bs=8 | ❌ LegalizeSundaMacro |
| bs=16 | ❌ LegalizeSundaMacro |

**Tested every flag combination at bs=4:**

- `--auto-cast all --auto-cast-type bf16 --optlevel 3 --model-type unet-inference --enable-fast-loading-neuron-binaries`
- `--auto-cast all --auto-cast-type bf16 --optlevel 2`
- `--auto-cast all --auto-cast-type bf16 --optlevel 1`
- `--auto-cast all --auto-cast-type fp16 --optlevel 3`
- `--auto-cast none --optlevel 2`
- `--auto-cast none --optlevel 1`

All fail with the same `LegalizeSundaMacro` error.

**Splitting the backbone in half doesn't help:** we tried `PRE_A` = layers 0-4 and `PRE_B` = layers 5-9 separately at bs=8. Both halves trip the same bug.

**Wrapping vs not wrapping doesn't help:** tried `nn.ModuleList`-wrapped and direct `model.model[:10]` traversal — same error.

**The bug is specific to ops in layers 0-9.** The post-C2PSA portion (layers 11-23 — neck + Detect head) compiles cleanly at bs=8, 16, and 32 with the same flag set.

### Why this matters

We use the no-attention-on-Neuron workaround for Bug A: compile a **PRE module** (layers 0-9) and a **POST module** (layers 11-23) separately, run C2PSA on CPU between them. The pipeline is bottlenecked by per-call Neuron call overhead because we can't go above bs=3 on the PRE leg. Effective throughput is **158 FPS** with CPU C2PSA, or **199 FPS** when C2PSA is offloaded to a remote T4 GPU over TCP (which moves the bottleneck off CPU but doesn't break the bs=3 NC ceiling).

If Bug B were fixed (PRE compiled at bs=8), the same hybrid pipeline projects to **~480 FPS at full accuracy** — the silicon ceiling. We've measured the broken full-yolo26l at bs=8 at **479 FPS** ([reproduction](https://github.com/aegissystems/hw-cost_optimization/blob/main/inf1_vs_inf2/yolo26/scripts/compile_yolo26l_full_bs8_384x640.py)), so we know the chip can do the work; it's the compiler that's gating us.

### Help we need (Bug B)

1. **Identify what op or shape pattern in the yolo26 backbone triggers `LegalizeSundaMacro`** at bs≥4 — likely a specific Conv/C3k2 layer interaction with the SPPF block.
2. **Workaround flag, if any** — even a `--disable-pass LegalizeSundaMacro` style escape hatch would unblock us.
3. **Fix in a future SDK release**.

---

## What we tried (and why none of it worked around the bugs)

| Approach | Result |
|---|---|
| Every compile flag combo on full yolo26l | All produce 42-47% match rate (Bug A) |
| Compile C2PSA in isolation, every precision/optlevel | abs error 5.8-11, relative 2.5-3.8 |
| Rewrite attention with `einsum` + explicit softmax | Same broken numerics |
| Confidence threshold sweep (0.25 → 0.05) | Confirms detections genuinely missing, not score-shifted |
| Drop second PSABlock (1 PSA instead of 2) | Compiles, but loses 24% of detections architecturally |
| Compile pre+post hybrid at bs=4, 8, 16 with every flag combo | LegalizeSundaMacro (Bug B) |
| Split PRE into halves (layers 0-4 + 5-9) at bs=8 | Same LegalizeSundaMacro |
| Compile pre via no-PSA wrapper at bs=8 | Same LegalizeSundaMacro |
| `torch_neuronx.set_neuron_cores` to pin pre→NC0, post→NC1 | 142 FPS (worse than `set_multicore`) |
| `torch_neuronx.DataParallel` instead of `set_multicore` | Hangs / NRT timeouts at high concurrency |
| Multi-process inf2 client (NEURON_RT_VISIBLE_CORES per process) | Neuron RT refuses 2nd init: `Could not be initialized` |
| GPU sidecar over TCP (offload C2PSA to remote T4) | 199 FPS — improved by 26% over CPU c2psa, but bottleneck moved to Neuron silicon |
| GPU sidecar (offload everything post-C2PSA to GPU) | 91 FPS — yolo26's FPN skip tensors (15 MB at bs=3 FP16) don't ship efficiently |

---

## Production impact

We ship object detection on 100k+ CCTV cameras. We cannot deploy yolo26l on inf2.xlarge because the broken model silently drops ~55% of detections — a safety risk we won't ship. Until Bug A is fixed, our yolo26l fleet is on inf1.xlarge (NCv1, accurate at 96% match, 286 FPS, $0.033/M spot). That's working, but:

1. inf1 is an older Inferentia generation; long-term we expect to consolidate on Inferentia2 and our roadmap assumes that path.
2. Other workloads in our fleet (yolov8x weapon detection, 220 FPS at $0.180/M spot on inf2) are already on inf2. Maintaining two Neuron generations adds operational complexity.
3. We would prefer a single inf2-based fleet running both models. The blocker is yolo26l.

We are happy to provide:

- **Detailed minimal reproducer artifacts** (already in the GitHub repo, scripts ready to run on a fresh inf2.xlarge).
- **Direct collaboration**: our ML team can debug end-to-end with whoever owns the NCv2 lowering. We have time-budget for this.
- **Test signals on candidate fixes** — if you publish a SDK candidate, we can validate it against our 51-image production dataset within hours.

## Contact

- **Primary contact:** Jagadish Mahendran (`jagadish.mahendran@aegissystems.ai`)
- **Repository (read-only):** [aegissystems/hw-cost_optimization](https://github.com/aegissystems/hw-cost_optimization)
- **Detailed writeups in repo:**
  - [`YOLO26_INF2_BLOCKER.md`](YOLO26_INF2_BLOCKER.md) — top-level summary
  - [`inf1_vs_inf2/yolo26/ACCURACY_NCV2.md`](inf1_vs_inf2/yolo26/ACCURACY_NCV2.md) — Bug A isolation
  - [`inf1_vs_inf2/yolo26/PERF_400FPS_INVESTIGATION.md`](inf1_vs_inf2/yolo26/PERF_400FPS_INVESTIGATION.md) — Bug B + performance investigation
  - [`inf1_vs_inf2/yolo26/GPU_SIDECAR_INVESTIGATION.md`](inf1_vs_inf2/yolo26/GPU_SIDECAR_INVESTIGATION.md) — sidecar workaround experiment

We would appreciate a confirmation that these bugs are reproducible on AWS's side and an estimated timeline for fixes. Thanks.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

yolo26l on inf2/trn1 (NCv2): numerical error in C2PSA Attention.forward + LegalizeSundaMacro internal error at bs>=4 #1323

Bug report — AWS Neuron / NeuronCore-v2

TL;DR

What we ran

Reference baseline — inf1.xlarge (NeuronCore-v1) compiles and runs YOLO26L correctly

Bug A — C2PSA / Attention numerical lowering error

Symptom

Reproducibility (numbers)

Isolation — confirmed it's the attention block

Further isolation — even C2PSA alone is broken

What we believe is happening

Help we need (Bug A)

Bug B — `LegalizeSundaMacro` compile error in YOLO26 backbone at batch ≥ 4

Symptom

Reproducibility (numbers)

Why this matters

Help we need (Bug B)

What we tried (and why none of it worked around the bugs)

Production impact

Contact

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Metric	inf1.xlarge (NCv1)	inf2.xlarge (NCv2)
Compile success at production batch sizes	✅ bs=1 with `--neuroncore-pipeline-cores 4`	❌ bs ≥ 4 fails (Bug B)
Match rate vs CPU FP32 (conf≥0.25)	95.7–99.3% ✅	44.7% ❌ (Bug A)
Mean IoU on matched detections	0.96	0.93
Score delta (Neuron − CPU FP32)	−0.002 (essentially zero)	+0.038 (systematic bias)
Peak FPS (production config)	286 FPS @ DP×4, 2 in-flight	158 FPS (CPU c2psa hybrid) / 199 FPS (GPU sidecar hybrid)
Compile recipe	`prepare_yolo26(weights, dtype=torch.float32)` + `torch_neuron.trace(model, [dummy], compiler_args=["--neuroncore-pipeline-cores", "4"])`	same `prepare_yolo26` + `torch_neuronx.trace` (every flag combo broken)
Spot price (us-west-2)	$0.034/hr	$0.143/hr
$/M images at production accuracy	$0.033/M	$0.21–0.25/M (hybrid only — direct compile not deployable)

Variant	Match rate vs CPU FP32	Mean IoU (matched)	Score delta
`--optlevel 3 --auto-cast all --model-type unet-inference` (BF16)	44.68%	0.93	+0.038
`--auto-cast none` (full FP32)	44.68%	0.95	+0.039
`--auto-cast matmult --auto-cast-type bf16` (BF16 matmul, FP32 softmax)	44.68%	0.95	+0.039
`--auto-cast all --auto-cast-type fp16`	44.68%	0.95	+0.039
`--model-type=generic` BF16 (640×640)	42.14%	0.92	+0.048
`--optlevel 1 --auto-cast all --auto-cast-type bf16`	44.68%	0.95	+0.039
trn1.2xlarge with current SDK (neuronx-cc 2.24, fresh DLAMI)	44.68%	0.93	+0.040
inf1.xlarge native (NeuronCore-v1, same model)	95.74%	0.96	−0.002

Compiler args	abs error	relative error
`--auto-cast none --optlevel 1`	5.86	3.75 (output is ~3.75× off)
`--auto-cast all --auto-cast-type bf16 --optlevel 1`	5.87	3.76
`--auto-cast all --auto-cast-type fp16 --optlevel 1`	5.87	3.76
Single PSABlock alone, BF16 + optlevel 3	11.33	2.48
Custom `Attention.forward` rewritten with `einsum` + manual softmax	5.78	3.64

Batch	Compile result
bs=1	✅ compiles
bs=2 (FP32, optlevel 2)	✅ compiles
bs=3 (BF16, optlevel 3, unet-inference)	✅ compiles
bs=4	❌ LegalizeSundaMacro
bs=8	❌ LegalizeSundaMacro
bs=16	❌ LegalizeSundaMacro

Approach	Result
Every compile flag combo on full yolo26l	All produce 42-47% match rate (Bug A)
Compile C2PSA in isolation, every precision/optlevel	abs error 5.8-11, relative 2.5-3.8
Rewrite attention with `einsum` + explicit softmax	Same broken numerics
Confidence threshold sweep (0.25 → 0.05)	Confirms detections genuinely missing, not score-shifted
Drop second PSABlock (1 PSA instead of 2)	Compiles, but loses 24% of detections architecturally
Compile pre+post hybrid at bs=4, 8, 16 with every flag combo	LegalizeSundaMacro (Bug B)
Split PRE into halves (layers 0-4 + 5-9) at bs=8	Same LegalizeSundaMacro
Compile pre via no-PSA wrapper at bs=8	Same LegalizeSundaMacro
`torch_neuronx.set_neuron_cores` to pin pre→NC0, post→NC1	142 FPS (worse than `set_multicore`)
`torch_neuronx.DataParallel` instead of `set_multicore`	Hangs / NRT timeouts at high concurrency
Multi-process inf2 client (NEURON_RT_VISIBLE_CORES per process)	Neuron RT refuses 2nd init: `Could not be initialized`
GPU sidecar over TCP (offload C2PSA to remote T4)	199 FPS — improved by 26% over CPU c2psa, but bottleneck moved to Neuron silicon
GPU sidecar (offload everything post-C2PSA to GPU)	91 FPS — yolo26's FPN skip tensors (15 MB at bs=3 FP16) don't ship efficiently

yolo26l on inf2/trn1 (NCv2): numerical error in C2PSA Attention.forward + LegalizeSundaMacro internal error at bs>=4 #1323

Description

Bug report — AWS Neuron / NeuronCore-v2

TL;DR

What we ran

Reference baseline — inf1.xlarge (NeuronCore-v1) compiles and runs YOLO26L correctly

Bug A — C2PSA / Attention numerical lowering error

Symptom

Reproducibility (numbers)

Isolation — confirmed it's the attention block

Further isolation — even C2PSA alone is broken

What we believe is happening

Help we need (Bug A)

Bug B — LegalizeSundaMacro compile error in YOLO26 backbone at batch ≥ 4

Symptom

Reproducibility (numbers)

Why this matters

Help we need (Bug B)

What we tried (and why none of it worked around the bugs)

Production impact

Contact

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Bug B — `LegalizeSundaMacro` compile error in YOLO26 backbone at batch ≥ 4