Skip to content

[Bug]: Multi GPU tests failing silently #11614

@greg-kwasniewski1

Description

@greg-kwasniewski1

🚀 The feature, motivation and pitch

Issue

Running any multi-GPU test via spawn_multiprocess_job (e.g.
test_sharding[MLP-torch_dist_all_reduce-False-False-2]) reports 1 passed
even though both child processes crash during initialization with:

File "/usr/lib/python3.12/multiprocessing/synchronize.py", line 115, in __setstate__
    self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory

The test body never executes — the processes die before reaching
init_and_run_process — yet pytest shows a green result.

Steps to reproduce:
Launch any multi-gpu test and put assert False on top. Test will still be reported as "Passed", even though the output log reports a failed assert.

$ git diff tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py
@@ -417,6 +417,7 @@ def _run_sharding_execution_job(
     rank: int,
     world_size: int,
 ) -> None:
+    assert False
     # init model and input
     batch_size = 4
     sequence_len = 8
pytest tests/unittest/_torch/auto_deploy/unit/multigpu/t
ransformations/library/test_tp_sharding.py::test_sharding[MLP-torch_dist_all_reduce-False-False-2] 

Output:

============== 1 passed, 4 warnings in 19.89s =======

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Labels

AutoDeploy<NV> AutoDeploy BackendScale-out<NV>Multi-GPU and distributed inference scaling issues, tensor/pipeline/data parallelismTesting<NV>Continuous integration, build system, and testing infrastructure issuesbugSomething isn't workingfeature requestNew feature or request. This includes new model, dtype, functionality support

Type

No type

Projects

Status

Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions