Switch to neon for interleave by metascroy · Pull Request #20137 · pytorch/executorch

metascroy · 2026-06-09T01:50:00Z

Summary:
The BGRA/RGB → planar-CHW-float deinterleave + normalization step was implemented twice and sub-optimally: the Apple backend used a strided vDSP gather (vDSP_vfltu8 ×3 + vDSP_vsmsa, ~6 passes over the input), and the portable/Android backend used a scalar triple-nested loop. This replaces both with a single hand-vectorized kernel in a new shared translation unit.

image_processor_simd.{h,cpp} provides deinterleave_to_chw():

One vld4q_u8 (BGRA/RGBA) or vld3q_u8 (RGB) read, widen uint8→float in-register, fused per-channel affine out = in*(scale/std) + (-mean/std) via vfmaq_f32, single write per plane.
NEON on ARM (all shipping iOS/Apple-silicon targets and Android arm64), scalar fallback elsewhere.
Handles the fast (contiguous) path plus a row-by-row slow path for stride padding and letterbox offsets.

Both backends now call it.

See D102373165 for inspiration.

Differential Revision: D107958353

Summary: The BGRA/RGB → planar-CHW-float deinterleave + normalization step was implemented twice and sub-optimally: the Apple backend used a strided vDSP gather (vDSP_vfltu8 ×3 + vDSP_vsmsa, ~6 passes over the input), and the portable/Android backend used a scalar triple-nested loop. This replaces both with a single hand-vectorized kernel in a new shared translation unit. image_processor_simd.{h,cpp} provides deinterleave_to_chw(): * One vld4q_u8 (BGRA/RGBA) or vld3q_u8 (RGB) read, widen uint8→float in-register, fused per-channel affine out = in*(scale/std) + (-mean/std) via vfmaq_f32, single write per plane. * NEON on ARM (all shipping iOS/Apple-silicon targets and Android arm64), scalar fallback elsewhere. * Handles the fast (contiguous) path plus a row-by-row slow path for stride padding and letterbox offsets. Both backends now call it. See D102373165 for inspiration. Differential Revision: D107958353

pytorch-bot · 2026-06-09T01:50:05Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20137

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 1d1f264 with merge base 2759ef1 ():

NEW FAILURE - The following job has failed:

Lint / lintrunner (gh)
>>> Lint for extension/image/image_processor_simd.h:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2026-06-09T01:50:08Z

@metascroy has exported this pull request. If you are a Meta employee, you can view the originating Diff in D107958353.

github-actions · 2026-06-09T01:50:51Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

metascroy · 2026-06-09T01:52:57Z

@claude review this code

claude · 2026-06-09T01:53:37Z

Claude finished @metascroy's task in 1m 21s —— View job

Code Review: Switch to NEON for interleave

Read the PR diff and changed files
Analyze new SIMD kernel (image_processor_simd.{h,cpp})
Review integration in image_processor.cpp and image_processor_apple.cpp
Check build system changes (CMakeLists.txt, targets.bzl)
Post review feedback

Overall: This is a clean refactoring that unifies two sub-optimal deinterleave implementations into a single well-vectorized kernel. The NEON intrinsics are correct and the architecture (fast path + row-by-row slow path) is sound. A few observations below:

Potential integer overflow in fast path

image_processor_simd.cpp:132 — src_w * src_h is computed as int32_t * int32_t:

deinterleave_run(
    src,
    src_w * src_h,  // <-- can overflow for images > ~46k x 46k
    ...

This is safe given practical image sizes going through this pipeline (the upstream already constrains dimensions), but casting to int64_t before the multiply would be more defensive. Low priority.

No validation of `in_channels` or channel offsets

The function accepts any in_channels value, but the NEON path only handles 4 (vld4) and "else" (vld3). If in_channels is something unexpected (e.g. 2 or 5), the NEON branch would silently use vld3 with wrong stride semantics. Similarly, r_off/g_off/b_off are documented as < in_channels but not checked.

Since this is an internal API with exactly two call sites (both correct), this is fine as-is. An ET_DCHECK would be a cheap safety net if desired:

ET_DCHECK_MSG(in_channels == 3 || in_channels == 4, "in_channels must be 3 or 4");

`attribute((always_inline))` portability

image_processor_simd.cpp:31 — __attribute__((always_inline)) is GCC/Clang-only. This is gated behind ET_IMAGE_USE_NEON which implies ARM+Clang/GCC, so it's correct in practice. If Windows-on-ARM (MSVC) ever becomes a target, this would need a __forceinline equivalent behind a macro.

vDSP `no_offset` optimization removed

The old Apple path had a special no_offset branch using vsmul (cheaper than vsmsa) when bias terms were zero. The new NEON FMA approach always does multiply-add. On modern ARM cores with fused FMA units, vfmaq_f32 with a zero addend is essentially free, so this is not a regression. Just noting the intentional simplification.

Minor: `image_processor_simd.h` not in install list

CMakeLists.txt:68-75 — The header image_processor_simd.h is not in the install(FILES ...) lists. This is correct since it's an internal implementation header, but worth confirming the intent is that downstream consumers never call deinterleave_to_chw directly.

Summary: Looks good. The NEON kernel is idiomatic, the widen→FMA→store pipeline is efficient, and the unification eliminates meaningful code duplication. No correctness issues found. The minor points above are optional hardening.

metascroy requested review from kirklandsign and larryliu0820 as code owners June 9, 2026 01:50

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 9, 2026

meta-codesync Bot added the meta-exported label Jun 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to neon for interleave#20137

Switch to neon for interleave#20137
metascroy wants to merge 1 commit into
pytorch:mainfrom
metascroy:export-D107958353

metascroy commented Jun 9, 2026

Uh oh!

pytorch-bot Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

meta-codesync Bot commented Jun 9, 2026

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

metascroy commented Jun 9, 2026

Uh oh!

claude Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

metascroy commented Jun 9, 2026

Uh oh!

pytorch-bot Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20137

❌ 1 New Failure

Uh oh!

meta-codesync Bot commented Jun 9, 2026

Uh oh!

github-actions Bot commented Jun 9, 2026

This PR needs a release notes: label

Uh oh!

metascroy commented Jun 9, 2026

Uh oh!

claude Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: Switch to NEON for interleave

Potential integer overflow in fast path

No validation of in_channels or channel offsets

__attribute__((always_inline)) portability

vDSP no_offset optimization removed

Minor: image_processor_simd.h not in install list

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pytorch-bot Bot commented Jun 9, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jun 9, 2026 •

edited

Loading

No validation of `in_channels` or channel offsets

`attribute((always_inline))` portability

vDSP `no_offset` optimization removed

Minor: `image_processor_simd.h` not in install list