The lowoha_params structure includes a num_threads field, which is used for optimization decisions (like selecting tile sizes and partitioning strategies, ... etc), but it's not passed to the actual thread management in several parallel code paths. When I set params.num_threads = 2, the optimization logic sees and uses this value, but the backend library is still using all available CPU cores
Problem
I'm trying to control thread for LOWOHA matmul, but setting params.num_threads has no effect. The code always falls back to omp_get_max_threads() regardless of what I set
There are many places in the code where thread control is missing. Here are 4 examples:
zendnnl_parallel_for utility function
File: lowoha_matmul_utils.hpp:41
- Batch parallel partitioning (high FLOPS path)
File: lowoha_matmul.cpp:253
- Batch parallel partitioning (low FLOPS path)
File: lowoha_matmul.cpp:298
- AOCL-DLP and BLIS calls
File: aocl_kernel.cpp:1050-1095
AOCL and BLIS libraries use OpenMP internally and respect omp_set_num_threads(), but we never call it before invoking their GEMM functions
// run_blis function - no thread control!
aocl_gemm_f32f32f32of32(...);
aocl_gemm_bf16s4f32of32(...);
aocl_gemm_bf16bf16f32obf16(...);
Suggested Fix
- for OpenMP loops
#pragma omp parallel for collapse(2) num_threads(num_threads)
- for parallel-for utility
template <class F>
inline void zendnnl_parallel_for(const int64_t begin, const int64_t end,
const int64_t grain_size,
const int64_t max_num_threads, // Add this
const F &f);
- for AOCL/BLIS
void run_blis(...) {
omp_set_num_threads(lowoha_param.num_threads); // Add this
...
}
Change default behavior
Consider changing the default num_threads value in constructor from 0 to 1:
// in lowoha_common.hpp
lowoha_params() : dtypes(), postop_(), quant_params(), mem_format_a('n'),
mem_format_b('n'), lowoha_algo(matmul_algo_t::none), num_threads(1) {} // Changed from 0 to 1
This makes more sense when the thread count isn’t explicitly set; it defaults to single-threaded rather than using all available cores
Additional Info
- Component: LOWOHA matmul operators
- Affects: All backends (AOCL-DLP, BLIS, LibXSMM, oneDNN)
The purpose of this issue is to enable API-level thread control. Currently, threads can only be set via OMP_NUM_THREADS env variable; having params.num_threads work properly would allow per-call thread control
The
lowoha_paramsstructure includes anum_threadsfield, which is used for optimization decisions (like selecting tile sizes and partitioning strategies, ... etc), but it's not passed to the actual thread management in several parallel code paths. When I setparams.num_threads = 2, the optimization logic sees and uses this value, but the backend library is still using all available CPU coresProblem
I'm trying to control thread for LOWOHA matmul, but setting
params.num_threadshas no effect. The code always falls back toomp_get_max_threads()regardless of what I setThere are many places in the code where thread control is missing. Here are 4 examples:
zendnnl_parallel_forutility functionFile:
lowoha_matmul_utils.hpp:41File:
lowoha_matmul.cpp:253File:
lowoha_matmul.cpp:298File:
aocl_kernel.cpp:1050-1095AOCL and BLIS libraries use OpenMP internally and respect
omp_set_num_threads(), but we never call it before invoking their GEMM functionsSuggested Fix
#pragma omp parallel for collapse(2) num_threads(num_threads)Change default behavior
Consider changing the default
num_threadsvalue in constructor from0to1:This makes more sense when the thread count isn’t explicitly set; it defaults to single-threaded rather than using all available cores
Additional Info
The purpose of this issue is to enable API-level thread control. Currently, threads can only be set via
OMP_NUM_THREADSenv variable; havingparams.num_threadswork properly would allow per-call thread control