Add OpenMP compile/link flags to setup.py for source builds by developer0hye · Pull Request #9456 · pytorch/vision

developer0hye · 2026-03-27T23:52:05Z

Summary

Add -fopenmp compile flag and -lomp/-lgomp link flag to setup.py so that at::parallel_for (and other OpenMP constructs) in torchvision's C++ extensions actually parallelize when built from source.
macOS: -Xpreprocessor -fopenmp + -lomp (from PyTorch's bundled libomp)
Linux: -fopenmp + -lgomp
Windows: unchanged (MSVC handles OpenMP separately)

Motivation

at::parallel_for is a header-only template (ATen/Parallel.h) — its #pragma omp parallel directives are compiled into the calling translation unit (_C.so), not into libtorch_cpu. Without -fopenmp at compile time, the compiler silently ignores the pragma and at::parallel_for falls back to sequential execution.

No existing torchvision C++ kernel currently calls at::parallel_for directly, so this has had no observable effect. However:

Optimize CPU deform_conv2d forward pass with parallel im2col #9442 introduces at::parallel_for to deform_conv2d CPU forward — source builds get 0% speedup without this fix
warning: ignoring #pragma omp parallel #2783 (open since 2020) reports warning: ignoring #pragma omp parallel during source builds — this is the root cause
torchvision.roi_align performance optimization with openMP #4935 proposes roi_align OpenMP parallelization — blocked by the same missing flags
[RFC] torchvision performance optimization on CPU #6619 (RFC) proposes broader CPU kernel parallelization — all future work depends on this

Pre-built pip/conda wheels are unaffected (CI build scripts handle OpenMP separately).

Verification

Note: All verification below was performed on macOS ARM only (Apple M2 Air, macOS 26.3.1, Python 3.12, PyTorch 2.10.0). The Linux (-lgomp) path has not been locally tested and needs CI or a separate Linux verification.

Before (current setup.py):

$ otool -L torchvision/_C.so | grep omp
(nothing)

Thread scaling with at::parallel_for in deform_conv2d (#9442):

Threads=1: 2.99ms    Threads=4: 2.65ms   ← no scaling

After (this PR):

$ otool -L torchvision/_C.so | grep omp
  .../libomp.dylib

Threads=1: 2.91ms    Threads=4: 1.07ms   ← 2.7× scaling

Test plan

Verify otool -L torchvision/_C.so | grep omp shows libomp on macOS source build
Verify at::parallel_for scales with thread count (using Optimize CPU deform_conv2d forward pass with parallel im2col #9442 benchmark)
Verify existing tests pass on macOS ARM: python -m pytest test/test_ops.py -v
Needs verification: Linux source build links -lgomp without errors
Needs verification: Pre-built wheels are unaffected (no behavior change for pip installs)

Fixes #2783
Related: #9442, #4935, #6619, #9455

cc @NicolasHug

🤖 Generated with Claude Code

Source builds of torchvision do not pass -fopenmp (compile) or -lomp/-lgomp (link) flags when building the _C extension. Since at::parallel_for is a header-only template whose #pragma omp directives are compiled into the calling translation unit (_C.so), the missing flags cause it to silently fall back to sequential execution. This has had no observable effect so far because no existing torchvision C++ kernel directly uses at::parallel_for or #pragma omp. However, upcoming changes (e.g. pytorch#9442) introduce at::parallel_for, and without these flags source builds get 0% speedup from parallelization. - macOS: -Xpreprocessor -fopenmp (compile) + -lomp from PyTorch's bundled libomp (link) - Linux: -fopenmp (compile) + -lgomp (link) - Windows: unchanged (uses /openmp via MSVC, already handled separately) Fixes pytorch#2783 Signed-off-by: Yonghye Kwon <developer.0hye@gmail.com>

pytorch-bot · 2026-03-27T23:52:10Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/9456

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 6 New Failures

As of commit df554de with merge base 8a5946e ():

NEW FAILURES - The following jobs have failed:

Tests / unittests-linux (3.10, linux.12xlarge, cpu) / linux-job (gh)
test/test_transforms_v2.py::TestResize::test_pil_interpolation_compat_smoke[make_video-InterpolationMode.LANCZOS]
Tests / unittests-linux (3.14, linux.12xlarge, cpu) / linux-job (gh)
test/test_transforms_v2.py::TestResize::test_pil_interpolation_compat_smoke[make_video-InterpolationMode.LANCZOS]
Tests / unittests-macos (3.10, macos-m1-stable) / macos-job (gh)
test/test_transforms_v2.py::TestResize::test_pil_interpolation_compat_smoke[make_video-InterpolationMode.LANCZOS]
Tests / unittests-macos (3.14, macos-m1-stable) / macos-job (gh)
test/test_transforms_v2.py::TestResize::test_pil_interpolation_compat_smoke[make_video-InterpolationMode.LANCZOS]
Tests / unittests-windows (3.10, windows.4xlarge, cpu) / windows-job (gh)
test/test_transforms_v2.py::TestResize::test_pil_interpolation_compat_smoke[make_video-InterpolationMode.LANCZOS]
Tests / unittests-windows (3.14, windows.4xlarge, cpu) / windows-job (gh)
test/test_transforms_v2.py::TestResize::test_pil_interpolation_compat_smoke[make_video-InterpolationMode.LANCZOS]

This comment was automatically generated by Dr. CI and updates every 15 minutes.

JiwaniZakir

The Linux branch unconditionally uses -lgomp (GCC's OpenMP runtime), but users building with Clang on Linux need -lomp instead. This will cause a linker failure for any Clang-based Linux build without a clear error message. A compiler detection check (e.g., inspecting os.environ.get("CC") or the compiler being used) would be needed to choose the right runtime.

On macOS, torch_lib_dir is constructed from torch.__file__ but never verified to actually contain libomp.dylib before it's passed as a linker flag. If a user has a custom or stripped PyTorch build, this produces a cryptic linker error rather than a useful diagnostic — an os.path.exists(os.path.join(torch_lib_dir, "libomp.dylib")) check with a descriptive RuntimeError would improve the experience significantly.

There's also a structural inconsistency: compile flags are added inside get_macros_and_flags(), but the corresponding link flags are constructed separately in make_C_extension(). These are two halves of the same feature and their setup is now split across functions, making it easy for a future refactor to introduce a mismatch where -fopenmp is compiled in but the runtime is never linked (or vice versa). Returning extra_link_args from get_macros_and_flags alongside the compile args would keep the OpenMP logic cohesive.

meta-cla bot added the cla signed label Mar 27, 2026

JiwaniZakir reviewed Mar 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OpenMP compile/link flags to setup.py for source builds#9456

Add OpenMP compile/link flags to setup.py for source builds#9456
developer0hye wants to merge 1 commit intopytorch:mainfrom
developer0hye:fix/setup-openmp-flags

developer0hye commented Mar 27, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 27, 2026 •

edited

Loading

Uh oh!

JiwaniZakir left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

developer0hye commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Verification

Test plan

Uh oh!

pytorch-bot bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/9456

❌ 6 New Failures

Uh oh!

JiwaniZakir left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

developer0hye commented Mar 27, 2026 •

edited

Loading

pytorch-bot bot commented Mar 27, 2026 •

edited

Loading