threadpoolctl limiting API has inconsistent behavior
1. Scope of setting limits is inconsistent
threadpoolctl's API to set limits affect different libraries in different ways:
| Library |
Where is the limit set |
| OpenMP (libgomp/libomp) |
Current thread only |
| OpenBLAS (pthreads) |
Process-wide |
| OpenBLAS (Windows) |
Process-wide |
| OpenBLAS (OpenMP) |
Intended to be process-wide? But currently broken: OpenMathLib/OpenBLAS#5808 |
| MKL (Intel threading) |
Process-wide (but both available) |
| BLIS |
Process-wide, (but both available) |
| FlexiBLAS |
Depends on the backend! |
See #207
2. Semantics of limits are inconsistent
There are also different semantics for limits. Given a limit of L, this limit can be:
- Per-thread: Each thread can have to
L sub-worker threads.
- Process-wide: The process will have a global thread pool with
L workers.
| Library |
Limit is— |
| OpenMP (libgomp/libomp) |
Per-thread |
| OpenBLAS (pthreads) |
Process-wide |
| OpenBLAS (Windows) |
Process-wide |
| OpenBLAS (OpenMP) |
Per-thread |
| MKL (Intel threading) |
Per-thread |
| BLIS |
Per-thread, probably |
| FlexiBLAS |
Depends on the backend! |
Demonstration
Consider an example where I start a Python thread pool with 10 threads, and then run a BLAS operation in each thread:
import psutil
from concurrent.futures import ThreadPoolExecutor
import numpy as np
import os
import threading
import threadpoolctl
import sys
A = np.ones((10_000_000,))
num_threads = 0
stop = False
def count_threads():
global num_threads
process = psutil.Process()
while not stop:
num_threads = max(num_threads, process.num_threads())
def blasop(_ignore):
threadpoolctl.threadpool_limits(
limits=int(sys.argv[1]), user_api="blas"
)
for i in range(100):
A.dot(A)
threadpoolctl.threadpool_limits(
limits=int(sys.argv[1]), user_api="blas"
)
threading.Thread(target=count_threads).start()
POOL = ThreadPoolExecutor(10)
list(POOL.map(blasop, range(10)))
stop = True
print(threadpoolctl.threadpool_info())
print("Max threads:", num_threads)
With OpenBLAS pthreads:
$ python nested-blas.py 10
[{'user_api': 'blas', 'internal_api': 'openblas', 'num_threads': 20, 'prefix': 'libscipy_openblas',
'version': '0.3.31.188.0', 'threading_layer': 'pthreads', 'architecture': 'Haswell'}]
Max threads: 23
With MKL (Conda-Forge):
$ python nested-blas.py 10
[{'user_api': 'blas', 'internal_api': 'mkl', 'num_threads': 12, 'prefix': 'libmkl_rt',
'version': '2026.0-Product', 'threading_layer': 'intel'},
{'user_api': 'openmp', 'internal_api': 'openmp', 'num_threads': 12,
'prefix': 'libomp', 'version': None}]
Max threads: 102
That's a very different configuration outcome!
Also, in an earlier version I forgot to set the thread pool limit in each thread, which is necessary because of the different semantics...
From Python process pools to Python thread pools
If the limit is set on a process which only runs Python code from its main thread, all the variations have the same effect. Because process pools were more common in Python, the current API therefore does behave consistently.
As Python thread pools become more common, the current state of the API is no longer a good one.
Proposal (big picture)
Notice there are three possible API variants for a library:
- Set the size of a process-wide pool.
- Set the size of thread-specific pool, across all threads.
- Set the size of a thread-specific pool, for the current thread only.
(The fourth quadrant, thread-local setting for a process-wide pool, makes no sense semantically.)
I do not think the current API is viable going forward, it is too inconsistent. On the other hand, it is in use. I therefore propose:
- Deprecating the current limits API, but leaving it with the current semantics.
- Add a new API where the user explicitly states which of the 3 variants above they are requesting.
That is, it should be clear to the user exactly what the outcomes will be.
- Ideally, leave room for other variations.
- Add an API for querying which APIs variants are available.
Next steps
If this big picture approach is acceptable, the next step would be coming up with use cases (preventing oversaturation in thread pools, process pool, etc), and then designing a suitable API.
threadpoolctllimiting API has inconsistent behavior1. Scope of setting limits is inconsistent
threadpoolctl's API to set limits affect different libraries in different ways:See #207
2. Semantics of limits are inconsistent
There are also different semantics for limits. Given a limit of
L, this limit can be:Lsub-worker threads.Lworkers.Demonstration
Consider an example where I start a Python thread pool with 10 threads, and then run a BLAS operation in each thread:
With OpenBLAS pthreads:
With MKL (Conda-Forge):
That's a very different configuration outcome!
Also, in an earlier version I forgot to set the thread pool limit in each thread, which is necessary because of the different semantics...
From Python process pools to Python thread pools
If the limit is set on a process which only runs Python code from its main thread, all the variations have the same effect. Because process pools were more common in Python, the current API therefore does behave consistently.
As Python thread pools become more common, the current state of the API is no longer a good one.
Proposal (big picture)
Notice there are three possible API variants for a library:
(The fourth quadrant, thread-local setting for a process-wide pool, makes no sense semantically.)
I do not think the current API is viable going forward, it is too inconsistent. On the other hand, it is in use. I therefore propose:
That is, it should be clear to the user exactly what the outcomes will be.
Next steps
If this big picture approach is acceptable, the next step would be coming up with use cases (preventing oversaturation in thread pools, process pool, etc), and then designing a suitable API.