Switch to neon for interleave#20137
Conversation
Summary:
The BGRA/RGB → planar-CHW-float deinterleave + normalization step was implemented twice and sub-optimally: the Apple backend used a strided vDSP gather (vDSP_vfltu8 ×3 + vDSP_vsmsa, ~6 passes over the input), and the portable/Android backend used a scalar triple-nested loop. This replaces both with a single hand-vectorized kernel in a new shared translation unit.
image_processor_simd.{h,cpp} provides deinterleave_to_chw():
* One vld4q_u8 (BGRA/RGBA) or vld3q_u8 (RGB) read, widen uint8→float in-register, fused per-channel affine out = in*(scale/std) + (-mean/std) via vfmaq_f32, single write per plane.
* NEON on ARM (all shipping iOS/Apple-silicon targets and Android arm64), scalar fallback elsewhere.
* Handles the fast (contiguous) path plus a row-by-row slow path for stride padding and letterbox offsets.
Both backends now call it.
See D102373165 for inspiration.
Differential Revision: D107958353
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20137
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit 1d1f264 with merge base 2759ef1 ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@metascroy has exported this pull request. If you are a Meta employee, you can view the originating Diff in D107958353. |
This PR needs a
|
|
@claude review this code |
|
Claude finished @metascroy's task in 1m 21s —— View job Code Review: Switch to NEON for interleave
Overall: This is a clean refactoring that unifies two sub-optimal deinterleave implementations into a single well-vectorized kernel. The NEON intrinsics are correct and the architecture (fast path + row-by-row slow path) is sound. A few observations below: Potential integer overflow in fast path
deinterleave_run(
src,
src_w * src_h, // <-- can overflow for images > ~46k x 46k
...This is safe given practical image sizes going through this pipeline (the upstream already constrains dimensions), but casting to No validation of
|
Summary:
The BGRA/RGB → planar-CHW-float deinterleave + normalization step was implemented twice and sub-optimally: the Apple backend used a strided vDSP gather (vDSP_vfltu8 ×3 + vDSP_vsmsa, ~6 passes over the input), and the portable/Android backend used a scalar triple-nested loop. This replaces both with a single hand-vectorized kernel in a new shared translation unit.
image_processor_simd.{h,cpp} provides deinterleave_to_chw():
Both backends now call it.
See D102373165 for inspiration.
Differential Revision: D107958353