Skip to content

Race condition in test_traceback_when_child_process_terminates_abruptly #143620

@a12k

Description

@a12k

Bug report

Bug description:

I have been observing that on my local build of main branch when I run the test suite, I occasionally am getting a failed test. Failure here:

0:01:56 load avg: 2.52 [ 39/498] test.test_concurrent_futures.test_interpreter_pool passed

0:01:56 load avg: 2.52 [ 40/498] test.test_concurrent_futures.test_process_pool

test test.test_concurrent_futures.test_process_pool failed -- Traceback (most recent call last):

  File "/Users/a12k/opt/cpython/Lib/test/test_concurrent_futures/test_process_pool.py", line 119, in test_traceback_when_child_process_terminates_abruptly

    self.assertIsInstance(cause, futures.process._RemoteTraceback)

    ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

AssertionError: None is not an instance of <class 'concurrent.futures.process._RemoteTraceback'>

0:02:08 load avg: 2.97 [ 40/498/1] test.test_concurrent_futures.test_process_pool failed (1 failure)
0:02:08 load avg: 2.97 [ 41/498/1] test.test_concurrent_futures.test_shutdown

I tried a few different ways of repro-ing this deterministically, mostly letting it and the few tests that preceded it run for an hour (./python.exe -m test -v -F test.test_concurrent_futures.test_process_pool, or ./python -m test -v -F test_concurrent_futures.test_deadlock test_concurrent_futures.test_interpreter_pool test_concurrent_futures.test_process_pool) until it failed.

I ended up forcing the fail in Lib/concurrent/futures/process.py which is as follows (insert at line 486 right after errors = [] all the way until # Mark pending tasks as failed.):

            if any(fn == os._exit for fn in [w.fn for w in self.pending_work_items.values()]):
                print("~~~ ARTIFICAL DELAY ~~~")
                for p in list(self.processes.values()):
                    # set exit code to None to simulate it not ready yet
                    object.__setattr__(p, "_exitcode", None)

            for p in self.processes.values():
                if p.exitcode is not None and p.exitcode != 0:
                    errors.append(f"Process {p.pid} terminated abruptly "
                                  f"with exit code {p.exitcode}")
            if errors:
                cause_str = "\n".join(errors)

        if cause_str and any(fn == os._exit for fn in [w.fn for w in self.pending_work_items.values()]):
            print("~~~ ARTIFICAL DELAY ~~~ Waiting to set __cause__ for 3 seconds")
            def delayed_set_cause():
                import time
                time.sleep(3)
                print("~~~ ARTIFICAL DELAY COMPLETE ~~~ setting __cause__")
                nonlocal bpe
                bpe.__cause__ = _RemoteTraceback(f"\n'''\n{cause_str}'''")

            # Set cause after delay
            threading.Thread(target=delayed_set_cause, daemon=True).start()
        elif cause_str:
            bpe.__cause__ = _RemoteTraceback(f"\n'''\n{cause_str}'''")

Basically forcing the race condition, setting the cause to None and delaying it. Not sure if this was all necessary, but that's how I was able to deterministically get it to continually fail.

I updated the test to account for this race condition by waiting for __cause__ to be populated and now the test passes. PR incoming for review.

CPython versions tested on:

CPython main branch

Operating systems tested on:

macOS

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    testsTests in the Lib/test dirtype-bugAn unexpected behavior, bug, or error

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions