Add OpenMP compile/link flags to setup.py for source builds#9456
Add OpenMP compile/link flags to setup.py for source builds#9456developer0hye wants to merge 1 commit intopytorch:mainfrom
Conversation
Source builds of torchvision do not pass -fopenmp (compile) or -lomp/-lgomp (link) flags when building the _C extension. Since at::parallel_for is a header-only template whose #pragma omp directives are compiled into the calling translation unit (_C.so), the missing flags cause it to silently fall back to sequential execution. This has had no observable effect so far because no existing torchvision C++ kernel directly uses at::parallel_for or #pragma omp. However, upcoming changes (e.g. pytorch#9442) introduce at::parallel_for, and without these flags source builds get 0% speedup from parallelization. - macOS: -Xpreprocessor -fopenmp (compile) + -lomp from PyTorch's bundled libomp (link) - Linux: -fopenmp (compile) + -lgomp (link) - Windows: unchanged (uses /openmp via MSVC, already handled separately) Fixes pytorch#2783 Signed-off-by: Yonghye Kwon <developer.0hye@gmail.com>
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/9456
Note: Links to docs will display an error until the docs builds have been completed. ❌ 6 New FailuresAs of commit df554de with merge base 8a5946e ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
JiwaniZakir
left a comment
There was a problem hiding this comment.
The Linux branch unconditionally uses -lgomp (GCC's OpenMP runtime), but users building with Clang on Linux need -lomp instead. This will cause a linker failure for any Clang-based Linux build without a clear error message. A compiler detection check (e.g., inspecting os.environ.get("CC") or the compiler being used) would be needed to choose the right runtime.
On macOS, torch_lib_dir is constructed from torch.__file__ but never verified to actually contain libomp.dylib before it's passed as a linker flag. If a user has a custom or stripped PyTorch build, this produces a cryptic linker error rather than a useful diagnostic — an os.path.exists(os.path.join(torch_lib_dir, "libomp.dylib")) check with a descriptive RuntimeError would improve the experience significantly.
There's also a structural inconsistency: compile flags are added inside get_macros_and_flags(), but the corresponding link flags are constructed separately in make_C_extension(). These are two halves of the same feature and their setup is now split across functions, making it easy for a future refactor to introduce a mismatch where -fopenmp is compiled in but the runtime is never linked (or vice versa). Returning extra_link_args from get_macros_and_flags alongside the compile args would keep the OpenMP logic cohesive.
Summary
-fopenmpcompile flag and-lomp/-lgomplink flag tosetup.pyso thatat::parallel_for(and other OpenMP constructs) in torchvision's C++ extensions actually parallelize when built from source.-Xpreprocessor -fopenmp+-lomp(from PyTorch's bundledlibomp)-fopenmp+-lgompMotivation
at::parallel_foris a header-only template (ATen/Parallel.h) — its#pragma omp paralleldirectives are compiled into the calling translation unit (_C.so), not intolibtorch_cpu. Without-fopenmpat compile time, the compiler silently ignores the pragma andat::parallel_forfalls back to sequential execution.No existing torchvision C++ kernel currently calls
at::parallel_fordirectly, so this has had no observable effect. However:at::parallel_fortodeform_conv2dCPU forward — source builds get 0% speedup without this fixwarning: ignoring #pragma omp parallelduring source builds — this is the root causeroi_alignOpenMP parallelization — blocked by the same missing flagsPre-built pip/conda wheels are unaffected (CI build scripts handle OpenMP separately).
Verification
Before (current
setup.py):Thread scaling with
at::parallel_forin deform_conv2d (#9442):After (this PR):
Test plan
otool -L torchvision/_C.so | grep ompshowslibompon macOS source buildat::parallel_forscales with thread count (using Optimize CPU deform_conv2d forward pass with parallel im2col #9442 benchmark)python -m pytest test/test_ops.py -v-lgompwithout errorsFixes #2783
Related: #9442, #4935, #6619, #9455
cc @NicolasHug
🤖 Generated with Claude Code