Skip to content

Source builds on macOS/Linux missing OpenMP flags — at::parallel_for silently falls back to sequential #9455

@developer0hye

Description

@developer0hye

Bug

When building torchvision from source, setup.py does not pass OpenMP compile/link flags (-fopenmp, -lomp/-lgomp) to the C++ extension build. This means any torchvision C++ kernel that calls at::parallel_for will silently fall back to sequential execution, because at::parallel_for is a header-only template (ATen/Parallel.h) whose #pragma omp parallel directives are compiled into the calling translation unit (_C.so), not into libtorch_cpu.

Why this hasn't been a problem until now

I checked every file under torchvision/csrc/ops/cpu/ on the current main branch:

File at::parallel_for #pragma omp
deform_conv2d_kernel.cpp
nms_kernel.cpp
roi_align_kernel.cpp commented out
roi_pool_kernel.cpp
ps_roi_align_kernel.cpp
ps_roi_pool_kernel.cpp
box_iou_rotated_kernel.cpp

No existing torchvision C++ code directly uses OpenMP parallelism, so the missing flags had no observable effect. The pre-built pip/conda wheels are built via CI scripts that handle OpenMP separately.

Why it matters now

PR #9442 introduces at::parallel_for to the deform_conv2d CPU forward kernel — the first direct usage in torchvision's codebase. Without the compile/link flags, source builds get 0% speedup from the parallelization while the change is designed to deliver 2.5–3.0×.

I confirmed this on Apple M2 (macOS ARM, 4 threads). Thread scaling with and without OpenMP flags:

Without -fopenmp (current setup.py):

Threads      s32-b1     s32-b4     s64-b1     s64-b4
----------------------------------------------------
1              2.99      16.76      78.12     324.52
4              2.65      16.14      75.25     313.23   ← no scaling

With -fopenmp + -lomp:

Threads      s32-b1     s32-b4     s64-b1     s64-b4
----------------------------------------------------
1              2.91      15.71      75.60     310.34
4              1.07       5.36      30.33     121.75   ← scales as expected

Proposed fix

Add OpenMP flags to setup.py:

--- a/setup.py
+++ b/setup.py
@@ -130,6 +130,12 @@ def get_macros_and_flags():
         if sysconfig.get_config_var("Py_GIL_DISABLED"):
             extra_compile_args["cxx"].append("-DPy_GIL_DISABLED")
 
+    if sys.platform == "darwin":
+        extra_compile_args["cxx"].append("-Xpreprocessor")
+        extra_compile_args["cxx"].append("-fopenmp")
+    elif sys.platform != "win32":
+        extra_compile_args["cxx"].append("-fopenmp")
+
     if DEBUG:
         extra_compile_args["cxx"].append("-g")
         extra_compile_args["cxx"].append("-O0")
@@ -183,12 +189,22 @@ def make_C_extension():
             sources += mps_sources
 
     define_macros, extra_compile_args = get_macros_and_flags()
+
+    extra_link_args = []
+    if sys.platform == "darwin":
+        # Link against libomp shipped with PyTorch for at::parallel_for support
+        torch_lib_dir = os.path.join(os.path.dirname(torch.__file__), "lib")
+        extra_link_args = [f"-L{torch_lib_dir}", "-lomp"]
+    elif sys.platform != "win32":
+        extra_link_args = ["-lgomp"]
+
     return Extension(
         name="torchvision._C",
         sources=sorted(str(s) for s in sources),
         include_dirs=[CSRS_DIR],
         define_macros=define_macros,
         extra_compile_args=extra_compile_args,
+        extra_link_args=extra_link_args,
     )

This also unblocks future parallelization of other CPU kernels (roi_align, nms, etc.) as originally proposed in #6619.

Related

Versions

  • PyTorch: 2.10.0
  • torchvision: 0.26.0 (source build)
  • macOS 26.3.1, Apple M2, ARM64
  • Python 3.12

cc @NicolasHug

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions