-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Open
Labels
AutoDeploy<NV> AutoDeploy Backend<NV> AutoDeploy BackendScale-out<NV>Multi-GPU and distributed inference scaling issues, tensor/pipeline/data parallelism<NV>Multi-GPU and distributed inference scaling issues, tensor/pipeline/data parallelismTesting<NV>Continuous integration, build system, and testing infrastructure issues<NV>Continuous integration, build system, and testing infrastructure issuesbugSomething isn't workingSomething isn't workingfeature requestNew feature or request. This includes new model, dtype, functionality supportNew feature or request. This includes new model, dtype, functionality support
Description
🚀 The feature, motivation and pitch
Issue
Running any multi-GPU test via spawn_multiprocess_job (e.g.
test_sharding[MLP-torch_dist_all_reduce-False-False-2]) reports 1 passed
even though both child processes crash during initialization with:
File "/usr/lib/python3.12/multiprocessing/synchronize.py", line 115, in __setstate__
self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory
The test body never executes — the processes die before reaching
init_and_run_process — yet pytest shows a green result.
Steps to reproduce:
Launch any multi-gpu test and put assert False on top. Test will still be reported as "Passed", even though the output log reports a failed assert.
$ git diff tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py
@@ -417,6 +417,7 @@ def _run_sharding_execution_job(
rank: int,
world_size: int,
) -> None:
+ assert False
# init model and input
batch_size = 4
sequence_len = 8
pytest tests/unittest/_torch/auto_deploy/unit/multigpu/t
ransformations/library/test_tp_sharding.py::test_sharding[MLP-torch_dist_all_reduce-False-False-2]
Output:
============== 1 passed, 4 warnings in 19.89s =======
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
AutoDeploy<NV> AutoDeploy Backend<NV> AutoDeploy BackendScale-out<NV>Multi-GPU and distributed inference scaling issues, tensor/pipeline/data parallelism<NV>Multi-GPU and distributed inference scaling issues, tensor/pipeline/data parallelismTesting<NV>Continuous integration, build system, and testing infrastructure issues<NV>Continuous integration, build system, and testing infrastructure issuesbugSomething isn't workingSomething isn't workingfeature requestNew feature or request. This includes new model, dtype, functionality supportNew feature or request. This includes new model, dtype, functionality support
Type
Projects
Status
Backlog