I recently had an issue where some tests I have which were invoking sh with _bg=True were deadlocking.
After investigating, I noticed this was happening inside the sh library, and related to logging inside of an at-fork handler. This is a minimal reproducible example of the issue:
# Run using python3.9 -mpytest --full-trace -o log_cli=true -s -vv --log-cli-level=DEBUG repro.py
import logging
import os
import sh
def test_sh_log_in_child_fork():
logger = logging.getLogger()
os.register_at_fork(after_in_child=lambda: logger.debug("in child"))
procs = [sh.Command("true")(_bg=True) for _ in range(10)]
for proc in procs:
proc.wait()
Hitting this issue does require a bit of scheduling bad luck, I think the background thread needs to be interrupted while the lock is held. My guess is that the log line is being formatted. Then fork() needs to be called from the main thread.
In practice though this seems to not be that rare. In my testing the above reproduction hits it every time, with only 10 iterations. I'm guessing some of the objects being logged take some time to format? I also think instead of using os.register_at_fork() you could shoot yourself in the foot via the preexec_fn argument.
I am not an expert on python internals - but generally calling fork() from a multi-threaded application has some pretty concerning risks. It's realistically not going to be possible to avoid the issue entirely since users will always be able to cause this themselves, but I think the sh library's background thread is making this problem far more likely than it needs to be.
Can this thread avoid logging / other lock-holding-operations?
I think it's also worth noting that at-fork handlers were only added in python3.7 - so they may not be that common yet
I recently had an issue where some tests I have which were invoking
shwith_bg=Truewere deadlocking.After investigating, I noticed this was happening inside the
shlibrary, and related to logging inside of an at-fork handler. This is a minimal reproducible example of the issue:Hitting this issue does require a bit of scheduling bad luck, I think the background thread needs to be interrupted while the lock is held. My guess is that the log line is being formatted. Then
fork()needs to be called from the main thread.In practice though this seems to not be that rare. In my testing the above reproduction hits it every time, with only 10 iterations. I'm guessing some of the objects being logged take some time to format? I also think instead of using
os.register_at_fork()you could shoot yourself in the foot via thepreexec_fnargument.I am not an expert on python internals - but generally calling
fork()from a multi-threaded application has some pretty concerning risks. It's realistically not going to be possible to avoid the issue entirely since users will always be able to cause this themselves, but I think theshlibrary's background thread is making this problem far more likely than it needs to be.Can this thread avoid logging / other lock-holding-operations?
I think it's also worth noting that at-fork handlers were only added in python3.7 - so they may not be that common yet