Skip to content

Free-threaded CPython can starve a thread reattaching during repeated stop-the-world GC #151518

@tpn

Description

@tpn

Bug report

While hardening free-threaded Python support in numba-cuda, we found a workload that could wedge when a stress worker repeatedly called gc.collect() in a tight loop while other threads were doing CUDA dispatch work. Removing only the tight manual-GC worker made the workload pass.

I reduced this to a no-third-party CPython reproducer. The reduced case is strongest on a debug free-threaded CPython main build, where the extra free-threaded GC validation work makes the stop-the-world window long enough to reproduce reliably.

import faulthandler
import gc
import importlib
import os
import sys
import threading
import time

faulthandler.dump_traceback_later(25, exit=True)

modules = (
    "abc", "argparse", "collections", "contextlib",
    "decimal", "enum", "functools", "heapq",
    "importlib", "inspect", "itertools", "json",
    "math", "operator", "random", "re",
    "statistics", "threading", "types", "weakref",
)
for name in modules:
    importlib.import_module(name)

print("python", sys.version, flush=True)
print("abiflags", getattr(sys, "abiflags", ""), flush=True)
print("gil", getattr(sys, "_is_gil_enabled", lambda: None)(), flush=True)
print("debug", hasattr(sys, "gettotalrefcount"), flush=True)
print("pid", os.getpid(), flush=True)

stop = threading.Event()
count = 0

def collect():
    global count
    while not stop.is_set():
        gc.collect()
        count += 1

thread = threading.Thread(target=collect, name="gc-collector")
started = time.monotonic()
thread.start()

time.sleep(5.0)
stop.set()
thread.join(10.0)

elapsed = time.monotonic() - started
print(f"alive={thread.is_alive()} count={count} elapsed={elapsed:.6f}",
      flush=True)

raise SystemExit(1 if thread.is_alive() else 0)

Build used for the reproducer:

./configure --with-pydebug --disable-gil --without-ensurepip
make -j

Observed on CPython main debug/free-threaded:

Python 3.16.0a0 free-threading build
abiflags td
GIL disabled
debug True

I reproduced this on a B200 Linux system with 224 CPUs at the host level. I have not yet reproduced it on a smaller local workstation. The test process was running under a Slurm allocation that reported 28 CPUs available to the process and 224 CPUs on the system.

The script regularly times out via the faulthandler watchdog. The Python-level traceback shows the GC worker in gc.collect() and the main thread apparently still at time.sleep(). A native stack shows the more precise state:

gc-collector:
  validate_refcounts() / validate_gc_objects()
  gc_visit_heaps()
  deduce_unreachable_heap()
  gc_collect_internal()
  _PyGC_Collect()
  gc.collect()

main thread:
  _PyParkingLot_Park()
  tstate_wait_attach()
  _PyThreadState_Attach()
  PyEval_RestoreThread()
  pysleep()
  time.sleep()

At the time of the stall, GDB showed the interpreter stop-the-world state as:

requested = true
world_stopped = true
thread_countdown = 0
requester = GC worker thread state
main thread state = _Py_THREAD_SUSPENDED

The suspected issue is that after start_the_world() moves suspended threads back to detached and unparks them, a thread running a tight gc.collect() loop can immediately request another stop-the-world pause and re-suspend a just-unparked thread before that thread gets to attach. The result is starvation of a thread trying to return from a detached operation such as time.sleep().

This was not reproduced in my minimized matrix on:

  • a non-debug free-threaded CPython main build
  • conda-forge Python 3.14.6t

However, the original larger stress workload was first encountered while testing Python 3.14t free-threaded package support, and the reduced CPython main debug build gives a clean way to expose the progress bug.

I have a local CPython patch that tracks thread states waiting to attach after a stop-the-world suspension and prevents a subsequent stop-the-world requester from immediately re-suspending those attach waiters. With that patch:

  • the minimized reproducer passes
  • the broader pure-Python GC stress matrix passes
  • test_free_threading.test_gc passes under debug and non-debug free-threaded builds
  • test_free_threading.test_gc test_threading -v passes under the debug free-threaded build

I will prepare a PR with the fix and regression test once this issue exists.

CPython versions tested on:

  • 3.14
  • 3.16
  • CPython main branch

Operating systems tested on:

  • Linux

Metadata

Metadata

Assignees

No one assigned

    Labels

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions