Bug
When building torchvision from source, setup.py does not pass OpenMP compile/link flags (-fopenmp, -lomp/-lgomp) to the C++ extension build. This means any torchvision C++ kernel that calls at::parallel_for will silently fall back to sequential execution, because at::parallel_for is a header-only template (ATen/Parallel.h) whose #pragma omp parallel directives are compiled into the calling translation unit (_C.so), not into libtorch_cpu.
Why this hasn't been a problem until now
I checked every file under torchvision/csrc/ops/cpu/ on the current main branch:
| File |
at::parallel_for |
#pragma omp |
deform_conv2d_kernel.cpp |
❌ |
❌ |
nms_kernel.cpp |
❌ |
❌ |
roi_align_kernel.cpp |
❌ |
commented out |
roi_pool_kernel.cpp |
❌ |
❌ |
ps_roi_align_kernel.cpp |
❌ |
❌ |
ps_roi_pool_kernel.cpp |
❌ |
❌ |
box_iou_rotated_kernel.cpp |
❌ |
❌ |
No existing torchvision C++ code directly uses OpenMP parallelism, so the missing flags had no observable effect. The pre-built pip/conda wheels are built via CI scripts that handle OpenMP separately.
Why it matters now
PR #9442 introduces at::parallel_for to the deform_conv2d CPU forward kernel — the first direct usage in torchvision's codebase. Without the compile/link flags, source builds get 0% speedup from the parallelization while the change is designed to deliver 2.5–3.0×.
I confirmed this on Apple M2 (macOS ARM, 4 threads). Thread scaling with and without OpenMP flags:
Without -fopenmp (current setup.py):
Threads s32-b1 s32-b4 s64-b1 s64-b4
----------------------------------------------------
1 2.99 16.76 78.12 324.52
4 2.65 16.14 75.25 313.23 ← no scaling
With -fopenmp + -lomp:
Threads s32-b1 s32-b4 s64-b1 s64-b4
----------------------------------------------------
1 2.91 15.71 75.60 310.34
4 1.07 5.36 30.33 121.75 ← scales as expected
Proposed fix
Add OpenMP flags to setup.py:
--- a/setup.py
+++ b/setup.py
@@ -130,6 +130,12 @@ def get_macros_and_flags():
if sysconfig.get_config_var("Py_GIL_DISABLED"):
extra_compile_args["cxx"].append("-DPy_GIL_DISABLED")
+ if sys.platform == "darwin":
+ extra_compile_args["cxx"].append("-Xpreprocessor")
+ extra_compile_args["cxx"].append("-fopenmp")
+ elif sys.platform != "win32":
+ extra_compile_args["cxx"].append("-fopenmp")
+
if DEBUG:
extra_compile_args["cxx"].append("-g")
extra_compile_args["cxx"].append("-O0")
@@ -183,12 +189,22 @@ def make_C_extension():
sources += mps_sources
define_macros, extra_compile_args = get_macros_and_flags()
+
+ extra_link_args = []
+ if sys.platform == "darwin":
+ # Link against libomp shipped with PyTorch for at::parallel_for support
+ torch_lib_dir = os.path.join(os.path.dirname(torch.__file__), "lib")
+ extra_link_args = [f"-L{torch_lib_dir}", "-lomp"]
+ elif sys.platform != "win32":
+ extra_link_args = ["-lgomp"]
+
return Extension(
name="torchvision._C",
sources=sorted(str(s) for s in sources),
include_dirs=[CSRS_DIR],
define_macros=define_macros,
extra_compile_args=extra_compile_args,
+ extra_link_args=extra_link_args,
)
This also unblocks future parallelization of other CPU kernels (roi_align, nms, etc.) as originally proposed in #6619.
Related
Versions
- PyTorch: 2.10.0
- torchvision: 0.26.0 (source build)
- macOS 26.3.1, Apple M2, ARM64
- Python 3.12
cc @NicolasHug
Bug
When building torchvision from source,
setup.pydoes not pass OpenMP compile/link flags (-fopenmp,-lomp/-lgomp) to the C++ extension build. This means any torchvision C++ kernel that callsat::parallel_forwill silently fall back to sequential execution, becauseat::parallel_foris a header-only template (ATen/Parallel.h) whose#pragma omp paralleldirectives are compiled into the calling translation unit (_C.so), not intolibtorch_cpu.Why this hasn't been a problem until now
I checked every file under
torchvision/csrc/ops/cpu/on the currentmainbranch:at::parallel_for#pragma ompdeform_conv2d_kernel.cppnms_kernel.cpproi_align_kernel.cpproi_pool_kernel.cppps_roi_align_kernel.cppps_roi_pool_kernel.cppbox_iou_rotated_kernel.cppNo existing torchvision C++ code directly uses OpenMP parallelism, so the missing flags had no observable effect. The pre-built pip/conda wheels are built via CI scripts that handle OpenMP separately.
Why it matters now
PR #9442 introduces
at::parallel_forto thedeform_conv2dCPU forward kernel — the first direct usage in torchvision's codebase. Without the compile/link flags, source builds get 0% speedup from the parallelization while the change is designed to deliver 2.5–3.0×.I confirmed this on Apple M2 (macOS ARM, 4 threads). Thread scaling with and without OpenMP flags:
Without
-fopenmp(currentsetup.py):With
-fopenmp+-lomp:Proposed fix
Add OpenMP flags to
setup.py:This also unblocks future parallelization of other CPU kernels (
roi_align,nms, etc.) as originally proposed in #6619.Related
at::parallel_fortodeform_conv2dCPU forward kernelwarning: ignoring #pragma omp parallelreported in 2020 (same root cause, still open)roi_alignOpenMP parallelization request, blocked by this same missing flagVersions
cc @NicolasHug