-
Notifications
You must be signed in to change notification settings - Fork 15
Open
Description
脚本:
#!/usr/bin/env bash
set -e
=========================
参数
=========================
CKPT_PATH="/root/paddlejob/workspace/env_run/model/MOVA"
REF_PATH="/root/paddlejob/workspace/env_run/data/evaldata_1106/5s_video_images/1.png"
OUTPUT_PATH="./output.mp4"
PROMPT="年轻女子坐在华丽王座上,优雅弹奏手中的琵琶,背景紫色闪电闪耀。镜头捕捉她的指尖轻拨琴弦,展示琵琶声逐渐响起的动态过程。"
=========================
启动(单卡)
=========================
torchrun
--nproc_per_node=8
scripts/inference_single.py
--ckpt_path "${CKPT_PATH}"
--prompt "${PROMPT}"
--ref_path "${REF_PATH}"
--output_path "${OUTPUT_PATH}"
--cp_size 8
机器:A800*8
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
报错:
E0130 12:37:45.469000 45127 site-packages/torch/distributed/elastic/multiprocessing/api.py:984] failed (exitcode: -11) local_rank: 5 (pid: 45270) of binary: /root/miniconda3/envs/mova/bin/python3.13
Traceback (most recent call last):
File "/root/miniconda3/envs/mova/bin/torchrun", line 7, in
sys.exit(main())
~~~~^^
File "/root/miniconda3/envs/mova/lib/python3.13/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 362, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/envs/mova/lib/python3.13/site-packages/torch/distributed/run.py", line 991, in main
run(args)
~~~^^^^^^
File "/root/miniconda3/envs/mova/lib/python3.13/site-packages/torch/distributed/run.py", line 982, in run
elastic_launch(
~~~~~~~~~~~~~~~
config=config,
~~~~~~~~~~~~~~
entrypoint=cmd,
~~~~~~~~~~~~~~~
)(*cmd_args)
~^^^^^^^^^^^
File "/root/miniconda3/envs/mova/lib/python3.13/site-packages/torch/distributed/launcher/api.py", line 170, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/mova/lib/python3.13/site-packages/torch/distributed/launcher/api.py", line 317, in launch_agent
raise ChildFailedError(
...<2 lines>...
)
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
scripts/inference_single.py FAILED
Failures:
[1]:
time : 2026-01-30_12:37:45
host : yq02-bcc-sci-a800-25556-010.bcc-yq02.baidu.com
rank : 0 (local_rank: 0)
exitcode : -11 (pid: 45265)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 45265
[2]:
time : 2026-01-30_12:37:45
host : yq02-bcc-sci-a800-25556-010.bcc-yq02.baidu.com
rank : 1 (local_rank: 1)
exitcode : -11 (pid: 45266)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 45266
[3]:
time : 2026-01-30_12:37:45
host : yq02-bcc-sci-a800-25556-010.bcc-yq02.baidu.com
rank : 2 (local_rank: 2)
exitcode : -11 (pid: 45267)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 45267
[4]:
time : 2026-01-30_12:37:45
host : yq02-bcc-sci-a800-25556-010.bcc-yq02.baidu.com
rank : 3 (local_rank: 3)
exitcode : -11 (pid: 45268)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 45268
[5]:
time : 2026-01-30_12:37:45
host : yq02-bcc-sci-a800-25556-010.bcc-yq02.baidu.com
rank : 4 (local_rank: 4)
exitcode : -11 (pid: 45269)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 45269
[6]:
time : 2026-01-30_12:37:45
host : yq02-bcc-sci-a800-25556-010.bcc-yq02.baidu.com
rank : 6 (local_rank: 6)
exitcode : -11 (pid: 45271)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 45271
[7]:
time : 2026-01-30_12:37:45
host : yq02-bcc-sci-a800-25556-010.bcc-yq02.baidu.com
rank : 7 (local_rank: 7)
exitcode : -11 (pid: 45272)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 45272
Root Cause (first observed failure):
[0]:
time : 2026-01-30_12:37:43
host : yq02-bcc-sci-a800-25556-010.bcc-yq02.baidu.com
rank : 5 (local_rank: 5)
exitcode : -11 (pid: 45270)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 45270
Metadata
Metadata
Assignees
Labels
No labels