Optimize CPU deform_conv2d forward pass with parallel im2col by developer0hye · Pull Request #9442 · pytorch/vision

developer0hye · 2026-03-16T14:52:08Z

Summary

The CPU deform_conv2d forward pass spends 89–97% of its time in the deformable_im2col_kernel (confirmed via torch.profiler), yet this kernel runs entirely single-threaded. GEMM (addmm_) accounts for only 3–10% and is already parallelized by BLAS.

This PR introduces three changes to torchvision/csrc/ops/cpu/deform_conv2d_kernel.cpp that together yield a 2.5–3.3x end-to-end speedup on the forward pass:

Parallelize deformable_im2col_kernel with at::parallel_for.
Each loop iteration writes to a non-overlapping region of the columns buffer (the write offset is uniquely determined by (in_c, out_b, out_y, out_x)), so parallelization is safe with no synchronization needed. Results are bit-for-bit identical regardless of thread count.
Replace at::zeros with at::empty for the columns buffer.
deformable_im2col_kernel writes every element of this buffer (n_in_channels × kH × kW × parallel_imgs × out_h × out_w elements total), so zero-initialization is wasted work.
Replace at::zeros with at::empty for out_buf and use addmm_ with beta=0.
Each out_buf[b][g] is written exactly once per (batch_block, weight_group) pair. Using beta=0 skips the accumulation of uninitialized values while preserving in-place operation (unlike at::mm, which allocates a new tensor).

Benchmark

All measurements use time.perf_counter(), 10 warmup + 100 timed iterations, reporting the median.

Hardware: Apple M2, torch.get_num_threads() = 4
Dtype: float32, with mask (DCNv2 mode)
Config format: s{spatial}-b{batch}, e.g. s32-b4 = 64 in/out channels, 3×3 kernel, stride 1, padding 1, 32×32 spatial, batch 4. s64-* uses 256 in/out channels.

Config     Baseline (ms)  This PR (ms)   Speedup
─────────────────────────────────────────────────
s32-b1           2.78          0.83         3.3x
s32-b3           9.62          3.54         2.7x
s32-b4          15.99          5.01         3.2x
s32-b8          32.90         11.17         2.9x
s64-b1          76.16         30.52         2.5x
s64-b4         315.69        122.65         2.6x
s64-b7         566.37        230.67         2.5x

Profiler breakdown (baseline, s32-b1)

                          Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
    torchvision::deform_conv2d        92.30%      25.091ms       100.00%      27.183ms       2.718ms            10
                  aten::addmm_         2.82%     766.166us         2.82%     767.458us      76.746us            10
                   aten::zeros         0.57%     154.080us         2.94%     798.875us      79.888us            10

Benchmark script

import time
import torch
from torchvision.ops import deform_conv2d

def benchmark_forward(batch_sz, in_channels, out_channels, in_h, in_w,
                      kernel_h, kernel_w, stride, padding,
                      n_warmup=10, n_iter=100):
    out_h = (in_h + 2 * padding - kernel_h) // stride + 1
    out_w = (in_w + 2 * padding - kernel_w) // stride + 1
    x = torch.randn(batch_sz, in_channels, in_h, in_w)
    weight = torch.randn(out_channels, in_channels, kernel_h, kernel_w)
    offset = torch.randn(batch_sz, 2 * kernel_h * kernel_w, out_h, out_w)
    mask = torch.randn(batch_sz, kernel_h * kernel_w, out_h, out_w)
    bias = torch.randn(out_channels)
    for _ in range(n_warmup):
        deform_conv2d(x, offset, weight, bias, stride=stride, padding=padding, mask=mask)
    times = []
    for _ in range(n_iter):
        t0 = time.perf_counter()
        deform_conv2d(x, offset, weight, bias, stride=stride, padding=padding, mask=mask)
        times.append((time.perf_counter() - t0) * 1000)
    times.sort()
    return times[len(times) // 2]

Numerical correctness

Output is bit-for-bit identical between 1-thread and 8-thread execution (torch.equal returns True). Each thread operates on a disjoint slice of the columns buffer, so floating-point evaluation order is unchanged.

All existing TestDeformConv tests pass (forward, backward, scripting, opcheck).

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/9442

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 137d2b7 with merge base 8a5946e ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Three changes to the CPU deformable convolution forward kernel: 1. Replace at::zeros with at::empty for columns and out_buf buffers. The deformable_im2col_kernel writes every element of the columns buffer, and out_buf is fully written by addmm_, so zero-initialization is wasted work. 2. Use addmm_ with beta=0 instead of the default beta=1. This avoids accumulating into uninitialized memory while preserving in-place operation (no extra allocation unlike at::mm). 3. Parallelize deformable_im2col_kernel with at::parallel_for. The im2col loop was the only single-threaded phase in the forward pass (GEMM is already parallelized by BLAS). Each loop iteration writes to a non-overlapping region of the columns buffer, so parallelization is safe. Benchmark results on Apple M2 (CPU, float32): Config Before (ms) After (ms) Change small-b1 9.76 2.44 -75% small-b8 91.77 33.88 -63% medium-b1 216.70 75.80 -65% medium-b8 1152.09 650.00 -44% large-b1 348.86 302.70 -13% large-b4 1342.75 1289.96 -4% Signed-off-by: Yonghye Kwon <developer.0hye@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Yonghye Kwon <developer.0hye@gmail.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Yonghye Kwon <developer.0hye@gmail.com>

zy1git · 2026-03-20T08:08:29Z

@developer0hye Hi, thanks a lot for this PR! May I ask what's the motivation for optimizing the CPU path for deform_conv2d? It's almost always used on GPU. Is there a specific application in your use case?

developer0hye · 2026-03-20T08:24:43Z

@zy1git Great question!

The CPU path ships with torchvision and has a known issue (#6619) — it runs entirely single-threaded despite being embarrassingly parallel.
The fix is minimal and risk-free — at::parallel_for on a non-overlapping loop + removing redundant zero-init. Output is bit-for-bit identical. No new dependencies, no API change.
CPU inference is real — edge deployment, CI/testing, prototyping, and environments without GPU access (containers, ARM devices) all hit this path.

On a personal note, I've been working on in-browser ML inference — things like humanblur and bgremover. These are built with Candle + WASM rather than PyTorch, so admittedly a different stack, but the experience taught me that CPU-side efficiency matters more than you'd expect — not every user has GPU acceleration available, even in a browser. That mindset carried over here: if a CPU path exists and there's a straightforward way to make it 3x faster, it's worth doing.

zy1git · 2026-03-26T19:03:14Z

Hi @developer0hye, thanks a lot for the explanation of the use case. I wrote a benchmark code below according to your benchmark code but did not reproduce the improvement. Could you please share your exact benchmark code and profiler code? I can run them again to reproduce the improvement. The benchmark code you shared has no print or config info. And also feel free to run my benchmark code below to see whether you can get the improvement you declared.

Thank you!

Reproduction: Local Laptop (Apple M2 Pro, ARM/NEON)

4 threads (OMP_NUM_THREADS=4) — matches PR's setup

Config Baseline (ms) PR (ms) Speedup

s32-b1 2.91 3.16 0.92× ❌

s32-b3 9.10 9.11 1.00×

s32-b4 14.28 14.35 1.00×

s32-b8 29.62 28.69 1.03×

s64-b1 125.84 124.76 1.01×

s64-b4 528.03 563.90 0.94× ❌

s64-b7 1010.77 1024.78 0.99×

Benchmark script

import time
import torch
from torchvision.ops import deform_conv2d
def benchmark_forward(batch_sz, in_channels, out_channels, in_h, in_w,
                      kernel_h, kernel_w, stride, padding,
                      n_warmup=10, n_iter=100):
    out_h = (in_h + 2 * padding - kernel_h) // stride + 1
    out_w = (in_w + 2 * padding - kernel_w) // stride + 1
    x = torch.randn(batch_sz, in_channels, in_h, in_w)
    weight = torch.randn(out_channels, in_channels, kernel_h, kernel_w)
    offset = torch.randn(batch_sz, 2 * kernel_h * kernel_w, out_h, out_w)
    mask = torch.randn(batch_sz, kernel_h * kernel_w, out_h, out_w)
    bias = torch.randn(out_channels)
    for _ in range(n_warmup):
        deform_conv2d(x, offset, weight, bias, stride=stride, padding=padding, mask=mask)
    times = []
    for _ in range(n_iter):
        t0 = time.perf_counter()
        deform_conv2d(x, offset, weight, bias, stride=stride, padding=padding, mask=mask)
        times.append((time.perf_counter() - t0) * 1000)
    times.sort()
    return times[len(times) // 2]
configs = [
    ("s32-b1", 1, 64, 64, 32, 32, 3, 3, 1, 1),
    ("s32-b3", 3, 64, 64, 32, 32, 3, 3, 1, 1),
    ("s32-b4", 4, 64, 64, 32, 32, 3, 3, 1, 1),
    ("s32-b8", 8, 64, 64, 32, 32, 3, 3, 1, 1),
    ("s64-b1", 1, 256, 256, 64, 64, 3, 3, 1, 1),
    ("s64-b4", 4, 256, 256, 64, 64, 3, 3, 1, 1),
    ("s64-b7", 7, 256, 256, 64, 64, 3, 3, 1, 1),
]
print(f"Hardware: {torch.backends.cpu.get_cpu_capability()}")
print(f"Threads: {torch.get_num_threads()}")
print(f"Dtype: float32, with mask (DCNv2 mode)")
print()
print(f"{'Config':<12} {'Median (ms)':>12}")
print("─" * 26)
for name, batch, in_c, out_c, h, w, kh, kw, stride, pad in configs:
    median_ms = benchmark_forward(batch, in_c, out_c, h, w, kh, kw, stride, pad)
    print(f"{name:<12} {median_ms:>12.2f}")

developer0hye · 2026-03-27T15:41:45Z

Hi @zy1git, thanks for trying to reproduce — I tracked down the root cause.

TL;DR

The speedup comes from at::parallel_for, which requires OpenMP to be linked into _C.so. When building torchvision from source on macOS, the upstream setup.py does not pass -fopenmp (compile) or -lomp (link) flags, so at::parallel_for silently falls back to a sequential loop. That's why you saw no improvement.

Once OpenMP is properly linked, I confirm the 2.5–3.0× speedup on Apple M2.

Diagnosis

I verified the problem by checking thread scaling before adding OpenMP flags:

Threads      s32-b1     s32-b4     s64-b1     s64-b4
----------------------------------------------------
1              2.99      16.76      78.12     324.52
2              2.89      16.16      75.21     313.29   ← no scaling
4              2.65      16.14      75.25     313.23
8              2.72      16.19      75.08     313.52

_C.so was not linked to libomp:

$ otool -L torchvision/_C.so | grep omp
(nothing)

After adding OpenMP flags and rebuilding:

$ otool -L torchvision/_C.so | grep omp
  /opt/homebrew/opt/libomp/lib/libomp.dylib (...)

Threads      s32-b1     s32-b4     s64-b1     s64-b4
----------------------------------------------------
1              2.91      15.71      75.60     310.34
2              1.65       8.35      43.41     177.51   ← scales!
4              1.07       5.36      30.33     121.75
8              1.00       4.49      24.09      99.31

Reproduction results (Apple M2, 4 threads)

Config     Baseline (ms)  This PR (ms)   Speedup
─────────────────────────────────────────────────
s32-b1           2.45          1.04         2.4×
s32-b3           9.48          3.38         2.8×
s32-b4          15.00          5.03         3.0×
s32-b8          33.08         10.98         3.0×
s64-b1          76.01         30.52         2.5×
s64-b4         315.50        121.90         2.6×
s64-b7         565.36        226.37         2.5×

How to reproduce (step-by-step for macOS ARM)

Prerequisites

# Python 3.12, PyTorch 2.10.0
pip install torch==2.10.0
# setuptools must be <81 for pkg_resources compatibility
pip install "setuptools<81"
# OpenMP runtime (already installed if you have torch, but just in case)
brew install libomp

1. Clone and checkout

git clone https://github.com/pytorch/vision.git
cd vision

# Baseline (main branch)
git checkout main

2. Patch `setup.py` for OpenMP support

The upstream setup.py doesn't include OpenMP flags. Apply this patch to both the baseline and PR branches before building:

--- a/setup.py
+++ b/setup.py
@@ -130,6 +130,12 @@ def get_macros_and_flags():
         if sysconfig.get_config_var("Py_GIL_DISABLED"):
             extra_compile_args["cxx"].append("-DPy_GIL_DISABLED")
 
+    if sys.platform == "darwin":
+        extra_compile_args["cxx"].append("-Xpreprocessor")
+        extra_compile_args["cxx"].append("-fopenmp")
+    elif sys.platform != "win32":
+        extra_compile_args["cxx"].append("-fopenmp")
+
     if DEBUG:
         extra_compile_args["cxx"].append("-g")
         extra_compile_args["cxx"].append("-O0")
@@ -183,12 +189,22 @@ def make_C_extension():
             sources += mps_sources
 
     define_macros, extra_compile_args = get_macros_and_flags()
+
+    extra_link_args = []
+    if sys.platform == "darwin":
+        # Link against libomp shipped with PyTorch for at::parallel_for support
+        torch_lib_dir = os.path.join(os.path.dirname(torch.__file__), "lib")
+        extra_link_args = [f"-L{torch_lib_dir}", "-lomp"]
+    elif sys.platform != "win32":
+        extra_link_args = ["-lgomp"]
+
     return Extension(
         name="torchvision._C",
         sources=sorted(str(s) for s in sources),
         include_dirs=[CSRS_DIR],
         define_macros=define_macros,
         extra_compile_args=extra_compile_args,
+        extra_link_args=extra_link_args,
     )

3. CC wrapper (macOS build fix)

Apple Clang passes -std=c++17 to .c files (giflib), which causes build errors. Save this as cc_wrapper.sh and chmod +x:

#!/bin/bash
# Wrapper to strip -std=c++17 when compiling .c files
args=()
is_c_file=false
for arg in "$@"; do
    if [[ "$arg" == *.c ]] && [[ "$arg" != *.cpp ]]; then
        is_c_file=true
    fi
    args+=("$arg")
done

if $is_c_file; then
    filtered=()
    for arg in "${args[@]}"; do
        if [[ "$arg" != "-std=c++17" ]]; then
            filtered+=("$arg")
        fi
    done
    exec /usr/bin/cc "${filtered[@]}"
else
    exec /usr/bin/cc "${args[@]}"
fi

4. Build and benchmark

# Build baseline (main branch, with setup.py patch applied)
CC=./cc_wrapper.sh pip install -e . --no-build-isolation

# Verify OpenMP is linked
otool -L torchvision/_C.so | grep omp
# Should show: .../libomp.dylib

# Run baseline benchmark
python bench_deform_conv2d.py

# Switch to PR branch, rebuild, benchmark
git checkout feat/dcnv2-cpu-forward-optimization
# Apply the same setup.py patch again
CC=./cc_wrapper.sh pip install -e . --no-build-isolation
python bench_deform_conv2d.py

5. Quick sanity check (is OpenMP actually working?)

import torch
print(torch.__config__.parallel_info())
# Should show: "ATen parallel backend: OpenMP"

Complete benchmark script (`bench_deform_conv2d.py`)

import time
import torch
from torchvision.ops import deform_conv2d


def benchmark_forward(
    batch_sz: int,
    in_channels: int,
    out_channels: int,
    in_h: int,
    in_w: int,
    kernel_h: int,
    kernel_w: int,
    stride: int,
    padding: int,
    n_warmup: int = 10,
    n_iter: int = 100,
) -> float:
    """Returns median forward pass time in milliseconds."""
    out_h = (in_h + 2 * padding - kernel_h) // stride + 1
    out_w = (in_w + 2 * padding - kernel_w) // stride + 1

    x = torch.randn(batch_sz, in_channels, in_h, in_w)
    weight = torch.randn(out_channels, in_channels, kernel_h, kernel_w)
    offset = torch.randn(batch_sz, 2 * kernel_h * kernel_w, out_h, out_w)
    mask = torch.randn(batch_sz, kernel_h * kernel_w, out_h, out_w)
    bias = torch.randn(out_channels)

    for _ in range(n_warmup):
        deform_conv2d(x, offset, weight, bias, stride=stride, padding=padding, mask=mask)

    times: list[float] = []
    for _ in range(n_iter):
        start = time.perf_counter()
        deform_conv2d(x, offset, weight, bias, stride=stride, padding=padding, mask=mask)
        end = time.perf_counter()
        times.append((end - start) * 1000)

    times.sort()
    return times[len(times) // 2]


def main() -> None:
    configs = [
        # (batch, in_ch, out_ch, H, W, kH, kW, stride, pad, label)
        (1, 64, 64, 32, 32, 3, 3, 1, 1, "s32-b1"),
        (3, 64, 64, 32, 32, 3, 3, 1, 1, "s32-b3"),
        (4, 64, 64, 32, 32, 3, 3, 1, 1, "s32-b4"),
        (8, 64, 64, 32, 32, 3, 3, 1, 1, "s32-b8"),
        (1, 256, 256, 64, 64, 3, 3, 1, 1, "s64-b1"),
        (4, 256, 256, 64, 64, 3, 3, 1, 1, "s64-b4"),
        (7, 256, 256, 64, 64, 3, 3, 1, 1, "s64-b7"),
    ]

    print(f"PyTorch: {torch.__version__}")
    print(f"Threads: {torch.get_num_threads()}")
    print(f"Parallel backend: {torch.__config__.parallel_info().splitlines()[-2].strip()}")
    print()
    print(f"{'Config':<12} {'Median (ms)':>12}")
    print("-" * 26)
    for *args, label in configs:
        median_ms = benchmark_forward(*args)
        print(f"{label:<12} {median_ms:>12.2f}")


if __name__ == "__main__":
    main()

Note on official torchvision wheels

The pip/conda wheels for torchvision are built by PyTorch CI with OpenMP already enabled, so this issue only affects local source builds on macOS. On Linux source builds, -lgomp (GCC's OpenMP) is typically available without extra setup.

developer0hye · 2026-03-27T15:54:45Z

Follow-up: `setup.py` should be updated as part of this PR

After further investigation, I realized this is the first time at::parallel_for is used in torchvision's CPU kernels. I checked every file under torchvision/csrc/ops/cpu/:

File	`at::parallel_for`	`#pragma omp`
`deform_conv2d_kernel.cpp`	❌ (added by this PR)	❌
`nms_kernel.cpp`	❌	❌
`roi_align_kernel.cpp`	❌	commented out
`roi_pool_kernel.cpp`	❌	❌
`ps_roi_align_kernel.cpp`	❌	❌
`ps_roi_pool_kernel.cpp`	❌	❌
`box_iou_rotated_kernel.cpp`	❌	❌

Since no existing torchvision C++ code directly calls at::parallel_for or uses #pragma omp, the upstream setup.py has legitimately never needed -fopenmp / -lomp flags. The pre-built pip/conda wheels get OpenMP through the CI build scripts, and source builds had nothing to parallelize.

But now that this PR introduces at::parallel_for, source builds will silently get no speedup unless setup.py is also updated. at::parallel_for is a header-only template (ATen/Parallel.h) — the #pragma omp parallel inside it is compiled into the calling translation unit (_C.so), not into libtorch_cpu. Without -fopenmp at compile time, the compiler simply ignores the pragma.

Suggestion

Include the setup.py OpenMP patch (from my previous comment) in this PR. It's a small, self-contained addition:

Compile flags: -Xpreprocessor -fopenmp on macOS, -fopenmp on Linux
Link flags: -lomp (linking to PyTorch's bundled libomp) on macOS, -lgomp on Linux

This ensures anyone building from source benefits from the parallelization, and also unblocks future PRs that want to use at::parallel_for in other CPU kernels (e.g., roi_align, nms).

Source builds of torchvision do not pass -fopenmp (compile) or -lomp/-lgomp (link) flags when building the _C extension. Since at::parallel_for is a header-only template whose #pragma omp directives are compiled into the calling translation unit (_C.so), the missing flags cause it to silently fall back to sequential execution. This has had no observable effect so far because no existing torchvision C++ kernel directly uses at::parallel_for or #pragma omp. However, upcoming changes (e.g. pytorch#9442) introduce at::parallel_for, and without these flags source builds get 0% speedup from parallelization. - macOS: -Xpreprocessor -fopenmp (compile) + -lomp from PyTorch's bundled libomp (link) - Linux: -fopenmp (compile) + -lgomp (link) - Windows: unchanged (uses /openmp via MSVC, already handled separately) Fixes pytorch#2783 Signed-off-by: Yonghye Kwon <developer.0hye@gmail.com>

zy1git · 2026-03-31T06:24:54Z

Hi @developer0hye,
Thanks a lot for the investigation! I am reviewing your PR and discussing with Nicolas. I tried your patched setup.py to link OpenMP and confirmed the speedup. One thing we want to clarify: in the original PR description, you reported the 2.5–3.3× speedup without OpenMP needed to be linked separately. Was OpenMP already linked in your build environment at the time? Just want to understand your setup so we can evaluate the PR properly.

developer0hye · 2026-04-07T10:03:09Z

Hi @zy1git, good question — yes, OpenMP was already linked in my build environment when I reported the original numbers.

I had previously patched setup.py with -fopenmp / -lomp flags for unrelated experimentation, and that configuration carried over when I built and benchmarked this PR. I didn't realize at the time that it wasn't the upstream default for source builds, which is why the original PR description didn't mention it.

As I detailed in my follow-up comment, this is actually the first time at::parallel_for is used in torchvision's CPU kernels, so the upstream setup.py has legitimately never needed these flags. The pre-built pip/conda wheels get OpenMP through the CI build scripts, but local source builds will silently fall back to sequential execution without the flags.

That's exactly why I suggested including the setup.py OpenMP patch as part of this PR — so that source builds also benefit from the parallelization out of the box.

meta-cla bot added the cla signed label Mar 16, 2026

developer0hye force-pushed the feat/dcnv2-cpu-forward-optimization branch from e653cad to 8a89fb8 Compare March 16, 2026 14:56

style: fix clang-format lint for method chain in deform_conv2d_kernel

137d2b7

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Yonghye Kwon <developer.0hye@gmail.com>

developer0hye mentioned this pull request Mar 27, 2026

Source builds on macOS/Linux missing OpenMP flags — at::parallel_for silently falls back to sequential #9455

Open

developer0hye mentioned this pull request Mar 27, 2026

Add OpenMP compile/link flags to setup.py for source builds #9456

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize CPU deform_conv2d forward pass with parallel im2col#9442

Optimize CPU deform_conv2d forward pass with parallel im2col#9442
developer0hye wants to merge 2 commits intopytorch:mainfrom
developer0hye:feat/dcnv2-cpu-forward-optimization

developer0hye commented Mar 16, 2026

Uh oh!

pytorch-bot bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

zy1git commented Mar 20, 2026

Uh oh!

developer0hye commented Mar 20, 2026

Uh oh!

zy1git commented Mar 26, 2026 •

edited

Loading

Uh oh!

developer0hye commented Mar 27, 2026

Uh oh!

developer0hye commented Mar 27, 2026

Uh oh!

zy1git commented Mar 31, 2026

Uh oh!

developer0hye commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

developer0hye commented Mar 16, 2026

Summary

Benchmark

Numerical correctness

Related

Uh oh!

pytorch-bot bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/9442

✅ No Failures

Uh oh!

zy1git commented Mar 20, 2026

Uh oh!

developer0hye commented Mar 20, 2026

Uh oh!

zy1git commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reproduction: Local Laptop (Apple M2 Pro, ARM/NEON)

4 threads (OMP_NUM_THREADS=4) — matches PR's setup

Uh oh!

developer0hye commented Mar 27, 2026

TL;DR

Diagnosis

Reproduction results (Apple M2, 4 threads)

How to reproduce (step-by-step for macOS ARM)

Prerequisites

1. Clone and checkout

2. Patch setup.py for OpenMP support

3. CC wrapper (macOS build fix)

4. Build and benchmark

5. Quick sanity check (is OpenMP actually working?)

Complete benchmark script (bench_deform_conv2d.py)

Note on official torchvision wheels

Uh oh!

developer0hye commented Mar 27, 2026

Follow-up: setup.py should be updated as part of this PR

Suggestion

Uh oh!

zy1git commented Mar 31, 2026

Uh oh!

developer0hye commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot bot commented Mar 16, 2026 •

edited

Loading

zy1git commented Mar 26, 2026 •

edited

Loading

2. Patch `setup.py` for OpenMP support

Complete benchmark script (`bench_deform_conv2d.py`)

Follow-up: `setup.py` should be updated as part of this PR