Skip to content

Switch to neon for interleave#20137

Open
metascroy wants to merge 1 commit into
pytorch:mainfrom
metascroy:export-D107958353
Open

Switch to neon for interleave#20137
metascroy wants to merge 1 commit into
pytorch:mainfrom
metascroy:export-D107958353

Conversation

@metascroy

Copy link
Copy Markdown
Contributor

Summary:
The BGRA/RGB → planar-CHW-float deinterleave + normalization step was implemented twice and sub-optimally: the Apple backend used a strided vDSP gather (vDSP_vfltu8 ×3 + vDSP_vsmsa, ~6 passes over the input), and the portable/Android backend used a scalar triple-nested loop. This replaces both with a single hand-vectorized kernel in a new shared translation unit.

image_processor_simd.{h,cpp} provides deinterleave_to_chw():

  • One vld4q_u8 (BGRA/RGBA) or vld3q_u8 (RGB) read, widen uint8→float in-register, fused per-channel affine out = in*(scale/std) + (-mean/std) via vfmaq_f32, single write per plane.
  • NEON on ARM (all shipping iOS/Apple-silicon targets and Android arm64), scalar fallback elsewhere.
  • Handles the fast (contiguous) path plus a row-by-row slow path for stride padding and letterbox offsets.

Both backends now call it.

See D102373165 for inspiration.

Differential Revision: D107958353

Summary:
The BGRA/RGB → planar-CHW-float deinterleave + normalization step was implemented twice and sub-optimally: the Apple backend used a strided vDSP gather (vDSP_vfltu8 ×3 + vDSP_vsmsa, ~6 passes over the input), and the portable/Android backend used a scalar triple-nested loop. This replaces both with a single hand-vectorized kernel in a new shared translation unit.

image_processor_simd.{h,cpp} provides deinterleave_to_chw():

* One vld4q_u8 (BGRA/RGBA) or vld3q_u8 (RGB) read, widen uint8→float in-register, fused per-channel affine out = in*(scale/std) + (-mean/std) via vfmaq_f32, single write per plane.
* NEON on ARM (all shipping iOS/Apple-silicon targets and Android arm64), scalar fallback elsewhere.
* Handles the fast (contiguous) path plus a row-by-row slow path for stride padding and letterbox offsets.

Both backends now call it.

See D102373165 for inspiration.

Differential Revision: D107958353
@pytorch-bot

pytorch-bot Bot commented Jun 9, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20137

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 1d1f264 with merge base 2759ef1 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 9, 2026
@meta-codesync

meta-codesync Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

@metascroy has exported this pull request. If you are a Meta employee, you can view the originating Diff in D107958353.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@metascroy

Copy link
Copy Markdown
Contributor Author

@claude review this code

@claude

claude Bot commented Jun 9, 2026

Copy link
Copy Markdown

Claude finished @metascroy's task in 1m 21s —— View job


Code Review: Switch to NEON for interleave

  • Read the PR diff and changed files
  • Analyze new SIMD kernel (image_processor_simd.{h,cpp})
  • Review integration in image_processor.cpp and image_processor_apple.cpp
  • Check build system changes (CMakeLists.txt, targets.bzl)
  • Post review feedback

Overall: This is a clean refactoring that unifies two sub-optimal deinterleave implementations into a single well-vectorized kernel. The NEON intrinsics are correct and the architecture (fast path + row-by-row slow path) is sound. A few observations below:


Potential integer overflow in fast path

image_processor_simd.cpp:132src_w * src_h is computed as int32_t * int32_t:

deinterleave_run(
    src,
    src_w * src_h,  // <-- can overflow for images > ~46k x 46k
    ...

This is safe given practical image sizes going through this pipeline (the upstream already constrains dimensions), but casting to int64_t before the multiply would be more defensive. Low priority.

No validation of in_channels or channel offsets

The function accepts any in_channels value, but the NEON path only handles 4 (vld4) and "else" (vld3). If in_channels is something unexpected (e.g. 2 or 5), the NEON branch would silently use vld3 with wrong stride semantics. Similarly, r_off/g_off/b_off are documented as < in_channels but not checked.

Since this is an internal API with exactly two call sites (both correct), this is fine as-is. An ET_DCHECK would be a cheap safety net if desired:

ET_DCHECK_MSG(in_channels == 3 || in_channels == 4, "in_channels must be 3 or 4");

__attribute__((always_inline)) portability

image_processor_simd.cpp:31__attribute__((always_inline)) is GCC/Clang-only. This is gated behind ET_IMAGE_USE_NEON which implies ARM+Clang/GCC, so it's correct in practice. If Windows-on-ARM (MSVC) ever becomes a target, this would need a __forceinline equivalent behind a macro.

vDSP no_offset optimization removed

The old Apple path had a special no_offset branch using vsmul (cheaper than vsmsa) when bias terms were zero. The new NEON FMA approach always does multiply-add. On modern ARM cores with fused FMA units, vfmaq_f32 with a zero addend is essentially free, so this is not a regression. Just noting the intentional simplification.

Minor: image_processor_simd.h not in install list

CMakeLists.txt:68-75 — The header image_processor_simd.h is not in the install(FILES ...) lists. This is correct since it's an internal implementation header, but worth confirming the intent is that downstream consumers never call deinterleave_to_chw directly.


Summary: Looks good. The NEON kernel is idiomatic, the widen→FMA→store pipeline is efficient, and the unification eliminates meaningful code duplication. No correctness issues found. The minor points above are optional hardening.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant