Optimize CPU deform_conv2d forward pass with parallel im2col#9442
Optimize CPU deform_conv2d forward pass with parallel im2col#9442developer0hye wants to merge 2 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/9442
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 137d2b7 with merge base 8a5946e ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Three changes to the CPU deformable convolution forward kernel: 1. Replace at::zeros with at::empty for columns and out_buf buffers. The deformable_im2col_kernel writes every element of the columns buffer, and out_buf is fully written by addmm_, so zero-initialization is wasted work. 2. Use addmm_ with beta=0 instead of the default beta=1. This avoids accumulating into uninitialized memory while preserving in-place operation (no extra allocation unlike at::mm). 3. Parallelize deformable_im2col_kernel with at::parallel_for. The im2col loop was the only single-threaded phase in the forward pass (GEMM is already parallelized by BLAS). Each loop iteration writes to a non-overlapping region of the columns buffer, so parallelization is safe. Benchmark results on Apple M2 (CPU, float32): Config Before (ms) After (ms) Change small-b1 9.76 2.44 -75% small-b8 91.77 33.88 -63% medium-b1 216.70 75.80 -65% medium-b8 1152.09 650.00 -44% large-b1 348.86 302.70 -13% large-b4 1342.75 1289.96 -4% Signed-off-by: Yonghye Kwon <developer.0hye@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Yonghye Kwon <developer.0hye@gmail.com>
e653cad to
8a89fb8
Compare
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Yonghye Kwon <developer.0hye@gmail.com>
|
@developer0hye Hi, thanks a lot for this PR! May I ask what's the motivation for optimizing the CPU path for deform_conv2d? It's almost always used on GPU. Is there a specific application in your use case? |
|
@zy1git Great question!
On a personal note, I've been working on in-browser ML inference — things like humanblur and bgremover. These are built with Candle + WASM rather than PyTorch, so admittedly a different stack, but the experience taught me that CPU-side efficiency matters more than you'd expect — not every user has GPU acceleration available, even in a browser. That mindset carried over here: if a CPU path exists and there's a straightforward way to make it 3x faster, it's worth doing. |
|
Hi @developer0hye, thanks a lot for the explanation of the use case. I wrote a benchmark code below according to your benchmark code but did not reproduce the improvement. Could you please share your exact benchmark code and profiler code? I can run them again to reproduce the improvement. The benchmark code you shared has no print or config info. And also feel free to run my benchmark code below to see whether you can get the improvement you declared. Thank you! Reproduction: Local Laptop (Apple M2 Pro, ARM/NEON)4 threads (OMP_NUM_THREADS=4) — matches PR's setup
Benchmark scriptimport time
import torch
from torchvision.ops import deform_conv2d
def benchmark_forward(batch_sz, in_channels, out_channels, in_h, in_w,
kernel_h, kernel_w, stride, padding,
n_warmup=10, n_iter=100):
out_h = (in_h + 2 * padding - kernel_h) // stride + 1
out_w = (in_w + 2 * padding - kernel_w) // stride + 1
x = torch.randn(batch_sz, in_channels, in_h, in_w)
weight = torch.randn(out_channels, in_channels, kernel_h, kernel_w)
offset = torch.randn(batch_sz, 2 * kernel_h * kernel_w, out_h, out_w)
mask = torch.randn(batch_sz, kernel_h * kernel_w, out_h, out_w)
bias = torch.randn(out_channels)
for _ in range(n_warmup):
deform_conv2d(x, offset, weight, bias, stride=stride, padding=padding, mask=mask)
times = []
for _ in range(n_iter):
t0 = time.perf_counter()
deform_conv2d(x, offset, weight, bias, stride=stride, padding=padding, mask=mask)
times.append((time.perf_counter() - t0) * 1000)
times.sort()
return times[len(times) // 2]
configs = [
("s32-b1", 1, 64, 64, 32, 32, 3, 3, 1, 1),
("s32-b3", 3, 64, 64, 32, 32, 3, 3, 1, 1),
("s32-b4", 4, 64, 64, 32, 32, 3, 3, 1, 1),
("s32-b8", 8, 64, 64, 32, 32, 3, 3, 1, 1),
("s64-b1", 1, 256, 256, 64, 64, 3, 3, 1, 1),
("s64-b4", 4, 256, 256, 64, 64, 3, 3, 1, 1),
("s64-b7", 7, 256, 256, 64, 64, 3, 3, 1, 1),
]
print(f"Hardware: {torch.backends.cpu.get_cpu_capability()}")
print(f"Threads: {torch.get_num_threads()}")
print(f"Dtype: float32, with mask (DCNv2 mode)")
print()
print(f"{'Config':<12} {'Median (ms)':>12}")
print("─" * 26)
for name, batch, in_c, out_c, h, w, kh, kw, stride, pad in configs:
median_ms = benchmark_forward(batch, in_c, out_c, h, w, kh, kw, stride, pad)
print(f"{name:<12} {median_ms:>12.2f}") |
|
Hi @zy1git, thanks for trying to reproduce — I tracked down the root cause. TL;DRThe speedup comes from Once OpenMP is properly linked, I confirm the 2.5–3.0× speedup on Apple M2. DiagnosisI verified the problem by checking thread scaling before adding OpenMP flags:
After adding OpenMP flags and rebuilding: Reproduction results (Apple M2, 4 threads)How to reproduce (step-by-step for macOS ARM)Prerequisites# Python 3.12, PyTorch 2.10.0
pip install torch==2.10.0
# setuptools must be <81 for pkg_resources compatibility
pip install "setuptools<81"
# OpenMP runtime (already installed if you have torch, but just in case)
brew install libomp1. Clone and checkoutgit clone https://github.com/pytorch/vision.git
cd vision
# Baseline (main branch)
git checkout main2. Patch
|
Follow-up:
|
| File | at::parallel_for |
#pragma omp |
|---|---|---|
deform_conv2d_kernel.cpp |
❌ (added by this PR) | ❌ |
nms_kernel.cpp |
❌ | ❌ |
roi_align_kernel.cpp |
❌ | commented out |
roi_pool_kernel.cpp |
❌ | ❌ |
ps_roi_align_kernel.cpp |
❌ | ❌ |
ps_roi_pool_kernel.cpp |
❌ | ❌ |
box_iou_rotated_kernel.cpp |
❌ | ❌ |
Since no existing torchvision C++ code directly calls at::parallel_for or uses #pragma omp, the upstream setup.py has legitimately never needed -fopenmp / -lomp flags. The pre-built pip/conda wheels get OpenMP through the CI build scripts, and source builds had nothing to parallelize.
But now that this PR introduces at::parallel_for, source builds will silently get no speedup unless setup.py is also updated. at::parallel_for is a header-only template (ATen/Parallel.h) — the #pragma omp parallel inside it is compiled into the calling translation unit (_C.so), not into libtorch_cpu. Without -fopenmp at compile time, the compiler simply ignores the pragma.
Suggestion
Include the setup.py OpenMP patch (from my previous comment) in this PR. It's a small, self-contained addition:
- Compile flags:
-Xpreprocessor -fopenmpon macOS,-fopenmpon Linux - Link flags:
-lomp(linking to PyTorch's bundledlibomp) on macOS,-lgompon Linux
This ensures anyone building from source benefits from the parallelization, and also unblocks future PRs that want to use at::parallel_for in other CPU kernels (e.g., roi_align, nms).
Source builds of torchvision do not pass -fopenmp (compile) or -lomp/-lgomp (link) flags when building the _C extension. Since at::parallel_for is a header-only template whose #pragma omp directives are compiled into the calling translation unit (_C.so), the missing flags cause it to silently fall back to sequential execution. This has had no observable effect so far because no existing torchvision C++ kernel directly uses at::parallel_for or #pragma omp. However, upcoming changes (e.g. pytorch#9442) introduce at::parallel_for, and without these flags source builds get 0% speedup from parallelization. - macOS: -Xpreprocessor -fopenmp (compile) + -lomp from PyTorch's bundled libomp (link) - Linux: -fopenmp (compile) + -lgomp (link) - Windows: unchanged (uses /openmp via MSVC, already handled separately) Fixes pytorch#2783 Signed-off-by: Yonghye Kwon <developer.0hye@gmail.com>
|
Hi @developer0hye, |
|
Hi @zy1git, good question — yes, OpenMP was already linked in my build environment when I reported the original numbers. I had previously patched As I detailed in my follow-up comment, this is actually the first time That's exactly why I suggested including the |
Summary
The CPU
deform_conv2dforward pass spends 89–97% of its time in thedeformable_im2col_kernel(confirmed viatorch.profiler), yet this kernel runs entirely single-threaded. GEMM (addmm_) accounts for only 3–10% and is already parallelized by BLAS.This PR introduces three changes to
torchvision/csrc/ops/cpu/deform_conv2d_kernel.cppthat together yield a 2.5–3.3x end-to-end speedup on the forward pass:Parallelize
deformable_im2col_kernelwithat::parallel_for.Each loop iteration writes to a non-overlapping region of the columns buffer (the write offset is uniquely determined by
(in_c, out_b, out_y, out_x)), so parallelization is safe with no synchronization needed. Results are bit-for-bit identical regardless of thread count.Replace
at::zeroswithat::emptyfor thecolumnsbuffer.deformable_im2col_kernelwrites every element of this buffer (n_in_channels × kH × kW × parallel_imgs × out_h × out_welements total), so zero-initialization is wasted work.Replace
at::zeroswithat::emptyforout_bufand useaddmm_withbeta=0.Each
out_buf[b][g]is written exactly once per(batch_block, weight_group)pair. Usingbeta=0skips the accumulation of uninitialized values while preserving in-place operation (unlikeat::mm, which allocates a new tensor).Benchmark
All measurements use
time.perf_counter(), 10 warmup + 100 timed iterations, reporting the median.Hardware: Apple M2,
torch.get_num_threads() = 4Dtype: float32, with mask (DCNv2 mode)
Config format:
s{spatial}-b{batch}, e.g.s32-b4= 64 in/out channels, 3×3 kernel, stride 1, padding 1, 32×32 spatial, batch 4.s64-*uses 256 in/out channels.Profiler breakdown (baseline, s32-b1)
Benchmark script
Numerical correctness
Output is bit-for-bit identical between 1-thread and 8-thread execution (
torch.equalreturnsTrue). Each thread operates on a disjoint slice of the columns buffer, so floating-point evaluation order is unchanged.All existing
TestDeformConvtests pass (forward, backward, scripting, opcheck).Related
deform_conv2dkernels are sequential and don't utilize multicore resourcescc @NicolasHug