Skip to content

Optimize CPU deform_conv2d forward pass with parallel im2col#9442

Open
developer0hye wants to merge 2 commits intopytorch:mainfrom
developer0hye:feat/dcnv2-cpu-forward-optimization
Open

Optimize CPU deform_conv2d forward pass with parallel im2col#9442
developer0hye wants to merge 2 commits intopytorch:mainfrom
developer0hye:feat/dcnv2-cpu-forward-optimization

Conversation

@developer0hye
Copy link
Copy Markdown
Contributor

Summary

The CPU deform_conv2d forward pass spends 89–97% of its time in the deformable_im2col_kernel (confirmed via torch.profiler), yet this kernel runs entirely single-threaded. GEMM (addmm_) accounts for only 3–10% and is already parallelized by BLAS.

This PR introduces three changes to torchvision/csrc/ops/cpu/deform_conv2d_kernel.cpp that together yield a 2.5–3.3x end-to-end speedup on the forward pass:

  1. Parallelize deformable_im2col_kernel with at::parallel_for.
    Each loop iteration writes to a non-overlapping region of the columns buffer (the write offset is uniquely determined by (in_c, out_b, out_y, out_x)), so parallelization is safe with no synchronization needed. Results are bit-for-bit identical regardless of thread count.

  2. Replace at::zeros with at::empty for the columns buffer.
    deformable_im2col_kernel writes every element of this buffer (n_in_channels × kH × kW × parallel_imgs × out_h × out_w elements total), so zero-initialization is wasted work.

  3. Replace at::zeros with at::empty for out_buf and use addmm_ with beta=0.
    Each out_buf[b][g] is written exactly once per (batch_block, weight_group) pair. Using beta=0 skips the accumulation of uninitialized values while preserving in-place operation (unlike at::mm, which allocates a new tensor).

Benchmark

All measurements use time.perf_counter(), 10 warmup + 100 timed iterations, reporting the median.

Hardware: Apple M2, torch.get_num_threads() = 4
Dtype: float32, with mask (DCNv2 mode)
Config format: s{spatial}-b{batch}, e.g. s32-b4 = 64 in/out channels, 3×3 kernel, stride 1, padding 1, 32×32 spatial, batch 4. s64-* uses 256 in/out channels.

Config     Baseline (ms)  This PR (ms)   Speedup
─────────────────────────────────────────────────
s32-b1           2.78          0.83         3.3x
s32-b3           9.62          3.54         2.7x
s32-b4          15.99          5.01         3.2x
s32-b8          32.90         11.17         2.9x
s64-b1          76.16         30.52         2.5x
s64-b4         315.69        122.65         2.6x
s64-b7         566.37        230.67         2.5x
Profiler breakdown (baseline, s32-b1)
                          Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
    torchvision::deform_conv2d        92.30%      25.091ms       100.00%      27.183ms       2.718ms            10
                  aten::addmm_         2.82%     766.166us         2.82%     767.458us      76.746us            10
                   aten::zeros         0.57%     154.080us         2.94%     798.875us      79.888us            10
Benchmark script
import time
import torch
from torchvision.ops import deform_conv2d

def benchmark_forward(batch_sz, in_channels, out_channels, in_h, in_w,
                      kernel_h, kernel_w, stride, padding,
                      n_warmup=10, n_iter=100):
    out_h = (in_h + 2 * padding - kernel_h) // stride + 1
    out_w = (in_w + 2 * padding - kernel_w) // stride + 1
    x = torch.randn(batch_sz, in_channels, in_h, in_w)
    weight = torch.randn(out_channels, in_channels, kernel_h, kernel_w)
    offset = torch.randn(batch_sz, 2 * kernel_h * kernel_w, out_h, out_w)
    mask = torch.randn(batch_sz, kernel_h * kernel_w, out_h, out_w)
    bias = torch.randn(out_channels)
    for _ in range(n_warmup):
        deform_conv2d(x, offset, weight, bias, stride=stride, padding=padding, mask=mask)
    times = []
    for _ in range(n_iter):
        t0 = time.perf_counter()
        deform_conv2d(x, offset, weight, bias, stride=stride, padding=padding, mask=mask)
        times.append((time.perf_counter() - t0) * 1000)
    times.sort()
    return times[len(times) // 2]

Numerical correctness

Output is bit-for-bit identical between 1-thread and 8-thread execution (torch.equal returns True). Each thread operates on a disjoint slice of the columns buffer, so floating-point evaluation order is unchanged.

All existing TestDeformConv tests pass (forward, backward, scripting, opcheck).

Related

  • #6619 — RFC noting that CPU deform_conv2d kernels are sequential and don't utilize multicore resources

cc @NicolasHug

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 16, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/9442

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 137d2b7 with merge base 8a5946e (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the cla signed label Mar 16, 2026
Three changes to the CPU deformable convolution forward kernel:

1. Replace at::zeros with at::empty for columns and out_buf buffers.
   The deformable_im2col_kernel writes every element of the columns
   buffer, and out_buf is fully written by addmm_, so zero-initialization
   is wasted work.

2. Use addmm_ with beta=0 instead of the default beta=1. This avoids
   accumulating into uninitialized memory while preserving in-place
   operation (no extra allocation unlike at::mm).

3. Parallelize deformable_im2col_kernel with at::parallel_for. The
   im2col loop was the only single-threaded phase in the forward pass
   (GEMM is already parallelized by BLAS). Each loop iteration writes
   to a non-overlapping region of the columns buffer, so parallelization
   is safe.

Benchmark results on Apple M2 (CPU, float32):

  Config          Before (ms)   After (ms)    Change
  small-b1              9.76        2.44       -75%
  small-b8             91.77       33.88       -63%
  medium-b1           216.70       75.80       -65%
  medium-b8          1152.09      650.00       -44%
  large-b1            348.86      302.70       -13%
  large-b4           1342.75     1289.96        -4%

Signed-off-by: Yonghye Kwon <developer.0hye@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Yonghye Kwon <developer.0hye@gmail.com>
@developer0hye developer0hye force-pushed the feat/dcnv2-cpu-forward-optimization branch from e653cad to 8a89fb8 Compare March 16, 2026 14:56
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Yonghye Kwon <developer.0hye@gmail.com>
@zy1git
Copy link
Copy Markdown
Contributor

zy1git commented Mar 20, 2026

@developer0hye Hi, thanks a lot for this PR! May I ask what's the motivation for optimizing the CPU path for deform_conv2d? It's almost always used on GPU. Is there a specific application in your use case?

@developer0hye
Copy link
Copy Markdown
Contributor Author

@zy1git Great question!

  1. The CPU path ships with torchvision and has a known issue (#6619) — it runs entirely single-threaded despite being embarrassingly parallel.

  2. The fix is minimal and risk-freeat::parallel_for on a non-overlapping loop + removing redundant zero-init. Output is bit-for-bit identical. No new dependencies, no API change.

  3. CPU inference is real — edge deployment, CI/testing, prototyping, and environments without GPU access (containers, ARM devices) all hit this path.

On a personal note, I've been working on in-browser ML inference — things like humanblur and bgremover. These are built with Candle + WASM rather than PyTorch, so admittedly a different stack, but the experience taught me that CPU-side efficiency matters more than you'd expect — not every user has GPU acceleration available, even in a browser. That mindset carried over here: if a CPU path exists and there's a straightforward way to make it 3x faster, it's worth doing.

@zy1git
Copy link
Copy Markdown
Contributor

zy1git commented Mar 26, 2026

Hi @developer0hye, thanks a lot for the explanation of the use case. I wrote a benchmark code below according to your benchmark code but did not reproduce the improvement. Could you please share your exact benchmark code and profiler code? I can run them again to reproduce the improvement. The benchmark code you shared has no print or config info. And also feel free to run my benchmark code below to see whether you can get the improvement you declared.

Thank you!

Reproduction: Local Laptop (Apple M2 Pro, ARM/NEON)

4 threads (OMP_NUM_THREADS=4) — matches PR's setup

Config Baseline (ms) PR (ms) Speedup
s32-b1 2.91 3.16 0.92× ❌
s32-b3 9.10 9.11 1.00×
s32-b4 14.28 14.35 1.00×
s32-b8 29.62 28.69 1.03×
s64-b1 125.84 124.76 1.01×
s64-b4 528.03 563.90 0.94× ❌
s64-b7 1010.77 1024.78 0.99×
Benchmark script
import time
import torch
from torchvision.ops import deform_conv2d
def benchmark_forward(batch_sz, in_channels, out_channels, in_h, in_w,
                      kernel_h, kernel_w, stride, padding,
                      n_warmup=10, n_iter=100):
    out_h = (in_h + 2 * padding - kernel_h) // stride + 1
    out_w = (in_w + 2 * padding - kernel_w) // stride + 1
    x = torch.randn(batch_sz, in_channels, in_h, in_w)
    weight = torch.randn(out_channels, in_channels, kernel_h, kernel_w)
    offset = torch.randn(batch_sz, 2 * kernel_h * kernel_w, out_h, out_w)
    mask = torch.randn(batch_sz, kernel_h * kernel_w, out_h, out_w)
    bias = torch.randn(out_channels)
    for _ in range(n_warmup):
        deform_conv2d(x, offset, weight, bias, stride=stride, padding=padding, mask=mask)
    times = []
    for _ in range(n_iter):
        t0 = time.perf_counter()
        deform_conv2d(x, offset, weight, bias, stride=stride, padding=padding, mask=mask)
        times.append((time.perf_counter() - t0) * 1000)
    times.sort()
    return times[len(times) // 2]
configs = [
    ("s32-b1", 1, 64, 64, 32, 32, 3, 3, 1, 1),
    ("s32-b3", 3, 64, 64, 32, 32, 3, 3, 1, 1),
    ("s32-b4", 4, 64, 64, 32, 32, 3, 3, 1, 1),
    ("s32-b8", 8, 64, 64, 32, 32, 3, 3, 1, 1),
    ("s64-b1", 1, 256, 256, 64, 64, 3, 3, 1, 1),
    ("s64-b4", 4, 256, 256, 64, 64, 3, 3, 1, 1),
    ("s64-b7", 7, 256, 256, 64, 64, 3, 3, 1, 1),
]
print(f"Hardware: {torch.backends.cpu.get_cpu_capability()}")
print(f"Threads: {torch.get_num_threads()}")
print(f"Dtype: float32, with mask (DCNv2 mode)")
print()
print(f"{'Config':<12} {'Median (ms)':>12}")
print("─" * 26)
for name, batch, in_c, out_c, h, w, kh, kw, stride, pad in configs:
    median_ms = benchmark_forward(batch, in_c, out_c, h, w, kh, kw, stride, pad)
    print(f"{name:<12} {median_ms:>12.2f}")

@developer0hye
Copy link
Copy Markdown
Contributor Author

Hi @zy1git, thanks for trying to reproduce — I tracked down the root cause.

TL;DR

The speedup comes from at::parallel_for, which requires OpenMP to be linked into _C.so. When building torchvision from source on macOS, the upstream setup.py does not pass -fopenmp (compile) or -lomp (link) flags, so at::parallel_for silently falls back to a sequential loop. That's why you saw no improvement.

Once OpenMP is properly linked, I confirm the 2.5–3.0× speedup on Apple M2.

Diagnosis

I verified the problem by checking thread scaling before adding OpenMP flags:

Threads      s32-b1     s32-b4     s64-b1     s64-b4
----------------------------------------------------
1              2.99      16.76      78.12     324.52
2              2.89      16.16      75.21     313.29   ← no scaling
4              2.65      16.14      75.25     313.23
8              2.72      16.19      75.08     313.52

_C.so was not linked to libomp:

$ otool -L torchvision/_C.so | grep omp
(nothing)

After adding OpenMP flags and rebuilding:

$ otool -L torchvision/_C.so | grep omp
  /opt/homebrew/opt/libomp/lib/libomp.dylib (...)
Threads      s32-b1     s32-b4     s64-b1     s64-b4
----------------------------------------------------
1              2.91      15.71      75.60     310.34
2              1.65       8.35      43.41     177.51   ← scales!
4              1.07       5.36      30.33     121.75
8              1.00       4.49      24.09      99.31

Reproduction results (Apple M2, 4 threads)

Config     Baseline (ms)  This PR (ms)   Speedup
─────────────────────────────────────────────────
s32-b1           2.45          1.04         2.4×
s32-b3           9.48          3.38         2.8×
s32-b4          15.00          5.03         3.0×
s32-b8          33.08         10.98         3.0×
s64-b1          76.01         30.52         2.5×
s64-b4         315.50        121.90         2.6×
s64-b7         565.36        226.37         2.5×

How to reproduce (step-by-step for macOS ARM)

Prerequisites

# Python 3.12, PyTorch 2.10.0
pip install torch==2.10.0
# setuptools must be <81 for pkg_resources compatibility
pip install "setuptools<81"
# OpenMP runtime (already installed if you have torch, but just in case)
brew install libomp

1. Clone and checkout

git clone https://github.com/pytorch/vision.git
cd vision

# Baseline (main branch)
git checkout main

2. Patch setup.py for OpenMP support

The upstream setup.py doesn't include OpenMP flags. Apply this patch to both the baseline and PR branches before building:

--- a/setup.py
+++ b/setup.py
@@ -130,6 +130,12 @@ def get_macros_and_flags():
         if sysconfig.get_config_var("Py_GIL_DISABLED"):
             extra_compile_args["cxx"].append("-DPy_GIL_DISABLED")
 
+    if sys.platform == "darwin":
+        extra_compile_args["cxx"].append("-Xpreprocessor")
+        extra_compile_args["cxx"].append("-fopenmp")
+    elif sys.platform != "win32":
+        extra_compile_args["cxx"].append("-fopenmp")
+
     if DEBUG:
         extra_compile_args["cxx"].append("-g")
         extra_compile_args["cxx"].append("-O0")
@@ -183,12 +189,22 @@ def make_C_extension():
             sources += mps_sources
 
     define_macros, extra_compile_args = get_macros_and_flags()
+
+    extra_link_args = []
+    if sys.platform == "darwin":
+        # Link against libomp shipped with PyTorch for at::parallel_for support
+        torch_lib_dir = os.path.join(os.path.dirname(torch.__file__), "lib")
+        extra_link_args = [f"-L{torch_lib_dir}", "-lomp"]
+    elif sys.platform != "win32":
+        extra_link_args = ["-lgomp"]
+
     return Extension(
         name="torchvision._C",
         sources=sorted(str(s) for s in sources),
         include_dirs=[CSRS_DIR],
         define_macros=define_macros,
         extra_compile_args=extra_compile_args,
+        extra_link_args=extra_link_args,
     )

3. CC wrapper (macOS build fix)

Apple Clang passes -std=c++17 to .c files (giflib), which causes build errors. Save this as cc_wrapper.sh and chmod +x:

#!/bin/bash
# Wrapper to strip -std=c++17 when compiling .c files
args=()
is_c_file=false
for arg in "$@"; do
    if [[ "$arg" == *.c ]] && [[ "$arg" != *.cpp ]]; then
        is_c_file=true
    fi
    args+=("$arg")
done

if $is_c_file; then
    filtered=()
    for arg in "${args[@]}"; do
        if [[ "$arg" != "-std=c++17" ]]; then
            filtered+=("$arg")
        fi
    done
    exec /usr/bin/cc "${filtered[@]}"
else
    exec /usr/bin/cc "${args[@]}"
fi

4. Build and benchmark

# Build baseline (main branch, with setup.py patch applied)
CC=./cc_wrapper.sh pip install -e . --no-build-isolation

# Verify OpenMP is linked
otool -L torchvision/_C.so | grep omp
# Should show: .../libomp.dylib

# Run baseline benchmark
python bench_deform_conv2d.py

# Switch to PR branch, rebuild, benchmark
git checkout feat/dcnv2-cpu-forward-optimization
# Apply the same setup.py patch again
CC=./cc_wrapper.sh pip install -e . --no-build-isolation
python bench_deform_conv2d.py

5. Quick sanity check (is OpenMP actually working?)

import torch
print(torch.__config__.parallel_info())
# Should show: "ATen parallel backend: OpenMP"

Complete benchmark script (bench_deform_conv2d.py)

import time
import torch
from torchvision.ops import deform_conv2d


def benchmark_forward(
    batch_sz: int,
    in_channels: int,
    out_channels: int,
    in_h: int,
    in_w: int,
    kernel_h: int,
    kernel_w: int,
    stride: int,
    padding: int,
    n_warmup: int = 10,
    n_iter: int = 100,
) -> float:
    """Returns median forward pass time in milliseconds."""
    out_h = (in_h + 2 * padding - kernel_h) // stride + 1
    out_w = (in_w + 2 * padding - kernel_w) // stride + 1

    x = torch.randn(batch_sz, in_channels, in_h, in_w)
    weight = torch.randn(out_channels, in_channels, kernel_h, kernel_w)
    offset = torch.randn(batch_sz, 2 * kernel_h * kernel_w, out_h, out_w)
    mask = torch.randn(batch_sz, kernel_h * kernel_w, out_h, out_w)
    bias = torch.randn(out_channels)

    for _ in range(n_warmup):
        deform_conv2d(x, offset, weight, bias, stride=stride, padding=padding, mask=mask)

    times: list[float] = []
    for _ in range(n_iter):
        start = time.perf_counter()
        deform_conv2d(x, offset, weight, bias, stride=stride, padding=padding, mask=mask)
        end = time.perf_counter()
        times.append((end - start) * 1000)

    times.sort()
    return times[len(times) // 2]


def main() -> None:
    configs = [
        # (batch, in_ch, out_ch, H, W, kH, kW, stride, pad, label)
        (1, 64, 64, 32, 32, 3, 3, 1, 1, "s32-b1"),
        (3, 64, 64, 32, 32, 3, 3, 1, 1, "s32-b3"),
        (4, 64, 64, 32, 32, 3, 3, 1, 1, "s32-b4"),
        (8, 64, 64, 32, 32, 3, 3, 1, 1, "s32-b8"),
        (1, 256, 256, 64, 64, 3, 3, 1, 1, "s64-b1"),
        (4, 256, 256, 64, 64, 3, 3, 1, 1, "s64-b4"),
        (7, 256, 256, 64, 64, 3, 3, 1, 1, "s64-b7"),
    ]

    print(f"PyTorch: {torch.__version__}")
    print(f"Threads: {torch.get_num_threads()}")
    print(f"Parallel backend: {torch.__config__.parallel_info().splitlines()[-2].strip()}")
    print()
    print(f"{'Config':<12} {'Median (ms)':>12}")
    print("-" * 26)
    for *args, label in configs:
        median_ms = benchmark_forward(*args)
        print(f"{label:<12} {median_ms:>12.2f}")


if __name__ == "__main__":
    main()

Note on official torchvision wheels

The pip/conda wheels for torchvision are built by PyTorch CI with OpenMP already enabled, so this issue only affects local source builds on macOS. On Linux source builds, -lgomp (GCC's OpenMP) is typically available without extra setup.

@developer0hye
Copy link
Copy Markdown
Contributor Author

Follow-up: setup.py should be updated as part of this PR

After further investigation, I realized this is the first time at::parallel_for is used in torchvision's CPU kernels. I checked every file under torchvision/csrc/ops/cpu/:

File at::parallel_for #pragma omp
deform_conv2d_kernel.cpp ❌ (added by this PR)
nms_kernel.cpp
roi_align_kernel.cpp commented out
roi_pool_kernel.cpp
ps_roi_align_kernel.cpp
ps_roi_pool_kernel.cpp
box_iou_rotated_kernel.cpp

Since no existing torchvision C++ code directly calls at::parallel_for or uses #pragma omp, the upstream setup.py has legitimately never needed -fopenmp / -lomp flags. The pre-built pip/conda wheels get OpenMP through the CI build scripts, and source builds had nothing to parallelize.

But now that this PR introduces at::parallel_for, source builds will silently get no speedup unless setup.py is also updated. at::parallel_for is a header-only template (ATen/Parallel.h) — the #pragma omp parallel inside it is compiled into the calling translation unit (_C.so), not into libtorch_cpu. Without -fopenmp at compile time, the compiler simply ignores the pragma.

Suggestion

Include the setup.py OpenMP patch (from my previous comment) in this PR. It's a small, self-contained addition:

  • Compile flags: -Xpreprocessor -fopenmp on macOS, -fopenmp on Linux
  • Link flags: -lomp (linking to PyTorch's bundled libomp) on macOS, -lgomp on Linux

This ensures anyone building from source benefits from the parallelization, and also unblocks future PRs that want to use at::parallel_for in other CPU kernels (e.g., roi_align, nms).

developer0hye added a commit to developer0hye/vision that referenced this pull request Mar 27, 2026
Source builds of torchvision do not pass -fopenmp (compile) or
-lomp/-lgomp (link) flags when building the _C extension. Since
at::parallel_for is a header-only template whose #pragma omp directives
are compiled into the calling translation unit (_C.so), the missing
flags cause it to silently fall back to sequential execution.

This has had no observable effect so far because no existing torchvision
C++ kernel directly uses at::parallel_for or #pragma omp. However,
upcoming changes (e.g. pytorch#9442) introduce at::parallel_for, and without
these flags source builds get 0% speedup from parallelization.

- macOS: -Xpreprocessor -fopenmp (compile) + -lomp from PyTorch's
  bundled libomp (link)
- Linux: -fopenmp (compile) + -lgomp (link)
- Windows: unchanged (uses /openmp via MSVC, already handled separately)

Fixes pytorch#2783

Signed-off-by: Yonghye Kwon <developer.0hye@gmail.com>
@zy1git
Copy link
Copy Markdown
Contributor

zy1git commented Mar 31, 2026

Hi @developer0hye,
Thanks a lot for the investigation! I am reviewing your PR and discussing with Nicolas. I tried your patched setup.py to link OpenMP and confirmed the speedup. One thing we want to clarify: in the original PR description, you reported the 2.5–3.3× speedup without OpenMP needed to be linked separately. Was OpenMP already linked in your build environment at the time? Just want to understand your setup so we can evaluate the PR properly.

@developer0hye
Copy link
Copy Markdown
Contributor Author

Hi @zy1git, good question — yes, OpenMP was already linked in my build environment when I reported the original numbers.

I had previously patched setup.py with -fopenmp / -lomp flags for unrelated experimentation, and that configuration carried over when I built and benchmarked this PR. I didn't realize at the time that it wasn't the upstream default for source builds, which is why the original PR description didn't mention it.

As I detailed in my follow-up comment, this is actually the first time at::parallel_for is used in torchvision's CPU kernels, so the upstream setup.py has legitimately never needed these flags. The pre-built pip/conda wheels get OpenMP through the CI build scripts, but local source builds will silently fall back to sequential execution without the flags.

That's exactly why I suggested including the setup.py OpenMP patch as part of this PR — so that source builds also benefit from the parallelization out of the box.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants