Bug report
While hardening free-threaded Python support in numba-cuda, we found a workload that could wedge when a stress worker repeatedly called gc.collect() in a tight loop while other threads were doing CUDA dispatch work. Removing only the tight manual-GC worker made the workload pass.
I reduced this to a no-third-party CPython reproducer. The reduced case is strongest on a debug free-threaded CPython main build, where the extra free-threaded GC validation work makes the stop-the-world window long enough to reproduce reliably.
import faulthandler
import gc
import importlib
import os
import sys
import threading
import time
faulthandler.dump_traceback_later(25, exit=True)
modules = (
"abc", "argparse", "collections", "contextlib",
"decimal", "enum", "functools", "heapq",
"importlib", "inspect", "itertools", "json",
"math", "operator", "random", "re",
"statistics", "threading", "types", "weakref",
)
for name in modules:
importlib.import_module(name)
print("python", sys.version, flush=True)
print("abiflags", getattr(sys, "abiflags", ""), flush=True)
print("gil", getattr(sys, "_is_gil_enabled", lambda: None)(), flush=True)
print("debug", hasattr(sys, "gettotalrefcount"), flush=True)
print("pid", os.getpid(), flush=True)
stop = threading.Event()
count = 0
def collect():
global count
while not stop.is_set():
gc.collect()
count += 1
thread = threading.Thread(target=collect, name="gc-collector")
started = time.monotonic()
thread.start()
time.sleep(5.0)
stop.set()
thread.join(10.0)
elapsed = time.monotonic() - started
print(f"alive={thread.is_alive()} count={count} elapsed={elapsed:.6f}",
flush=True)
raise SystemExit(1 if thread.is_alive() else 0)
Build used for the reproducer:
./configure --with-pydebug --disable-gil --without-ensurepip
make -j
Observed on CPython main debug/free-threaded:
Python 3.16.0a0 free-threading build
abiflags td
GIL disabled
debug True
I reproduced this on a B200 Linux system with 224 CPUs at the host level. I have not yet reproduced it on a smaller local workstation. The test process was running under a Slurm allocation that reported 28 CPUs available to the process and 224 CPUs on the system.
The script regularly times out via the faulthandler watchdog. The Python-level traceback shows the GC worker in gc.collect() and the main thread apparently still at time.sleep(). A native stack shows the more precise state:
gc-collector:
validate_refcounts() / validate_gc_objects()
gc_visit_heaps()
deduce_unreachable_heap()
gc_collect_internal()
_PyGC_Collect()
gc.collect()
main thread:
_PyParkingLot_Park()
tstate_wait_attach()
_PyThreadState_Attach()
PyEval_RestoreThread()
pysleep()
time.sleep()
At the time of the stall, GDB showed the interpreter stop-the-world state as:
requested = true
world_stopped = true
thread_countdown = 0
requester = GC worker thread state
main thread state = _Py_THREAD_SUSPENDED
The suspected issue is that after start_the_world() moves suspended threads back to detached and unparks them, a thread running a tight gc.collect() loop can immediately request another stop-the-world pause and re-suspend a just-unparked thread before that thread gets to attach. The result is starvation of a thread trying to return from a detached operation such as time.sleep().
This was not reproduced in my minimized matrix on:
- a non-debug free-threaded CPython
main build
- conda-forge Python 3.14.6t
However, the original larger stress workload was first encountered while testing Python 3.14t free-threaded package support, and the reduced CPython main debug build gives a clean way to expose the progress bug.
I have a local CPython patch that tracks thread states waiting to attach after a stop-the-world suspension and prevents a subsequent stop-the-world requester from immediately re-suspending those attach waiters. With that patch:
- the minimized reproducer passes
- the broader pure-Python GC stress matrix passes
test_free_threading.test_gc passes under debug and non-debug free-threaded builds
test_free_threading.test_gc test_threading -v passes under the debug free-threaded build
I will prepare a PR with the fix and regression test once this issue exists.
CPython versions tested on:
- 3.14
- 3.16
- CPython main branch
Operating systems tested on:
Bug report
While hardening free-threaded Python support in
numba-cuda, we found a workload that could wedge when a stress worker repeatedly calledgc.collect()in a tight loop while other threads were doing CUDA dispatch work. Removing only the tight manual-GC worker made the workload pass.I reduced this to a no-third-party CPython reproducer. The reduced case is strongest on a debug free-threaded CPython
mainbuild, where the extra free-threaded GC validation work makes the stop-the-world window long enough to reproduce reliably.Build used for the reproducer:
Observed on CPython
maindebug/free-threaded:I reproduced this on a B200 Linux system with 224 CPUs at the host level. I have not yet reproduced it on a smaller local workstation. The test process was running under a Slurm allocation that reported 28 CPUs available to the process and 224 CPUs on the system.
The script regularly times out via the faulthandler watchdog. The Python-level traceback shows the GC worker in
gc.collect()and the main thread apparently still attime.sleep(). A native stack shows the more precise state:At the time of the stall, GDB showed the interpreter stop-the-world state as:
The suspected issue is that after
start_the_world()moves suspended threads back to detached and unparks them, a thread running a tightgc.collect()loop can immediately request another stop-the-world pause and re-suspend a just-unparked thread before that thread gets to attach. The result is starvation of a thread trying to return from a detached operation such astime.sleep().This was not reproduced in my minimized matrix on:
mainbuildHowever, the original larger stress workload was first encountered while testing Python 3.14t free-threaded package support, and the reduced CPython
maindebug build gives a clean way to expose the progress bug.I have a local CPython patch that tracks thread states waiting to attach after a stop-the-world suspension and prevents a subsequent stop-the-world requester from immediately re-suspending those attach waiters. With that patch:
test_free_threading.test_gcpasses under debug and non-debug free-threaded buildstest_free_threading.test_gc test_threading -vpasses under the debug free-threaded buildI will prepare a PR with the fix and regression test once this issue exists.
CPython versions tested on:
Operating systems tested on: