Skip to content

A800卡直接炸 #16

@artless-spirit

Description

@artless-spirit

脚本:
#!/usr/bin/env bash
set -e

=========================

参数

=========================

CKPT_PATH="/root/paddlejob/workspace/env_run/model/MOVA"
REF_PATH="/root/paddlejob/workspace/env_run/data/evaldata_1106/5s_video_images/1.png"
OUTPUT_PATH="./output.mp4"

PROMPT="年轻女子坐在华丽王座上,优雅弹奏手中的琵琶,背景紫色闪电闪耀。镜头捕捉她的指尖轻拨琴弦,展示琵琶声逐渐响起的动态过程。"

=========================

启动(单卡)

=========================

torchrun
--nproc_per_node=8
scripts/inference_single.py
--ckpt_path "${CKPT_PATH}"
--prompt "${PROMPT}"
--ref_path "${REF_PATH}"
--output_path "${OUTPUT_PATH}"
--cp_size 8

机器:A800*8
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
报错:
E0130 12:37:45.469000 45127 site-packages/torch/distributed/elastic/multiprocessing/api.py:984] failed (exitcode: -11) local_rank: 5 (pid: 45270) of binary: /root/miniconda3/envs/mova/bin/python3.13
Traceback (most recent call last):
File "/root/miniconda3/envs/mova/bin/torchrun", line 7, in
sys.exit(main())
~~~~^^
File "/root/miniconda3/envs/mova/lib/python3.13/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 362, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/envs/mova/lib/python3.13/site-packages/torch/distributed/run.py", line 991, in main
run(args)
~~~^^^^^^
File "/root/miniconda3/envs/mova/lib/python3.13/site-packages/torch/distributed/run.py", line 982, in run
elastic_launch(
~~~~~~~~~~~~~~~
config=config,
~~~~~~~~~~~~~~
entrypoint=cmd,
~~~~~~~~~~~~~~~
)(*cmd_args)
~^^^^^^^^^^^
File "/root/miniconda3/envs/mova/lib/python3.13/site-packages/torch/distributed/launcher/api.py", line 170, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/mova/lib/python3.13/site-packages/torch/distributed/launcher/api.py", line 317, in launch_agent
raise ChildFailedError(
...<2 lines>...
)
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

scripts/inference_single.py FAILED

Failures:
[1]:
time : 2026-01-30_12:37:45
host : yq02-bcc-sci-a800-25556-010.bcc-yq02.baidu.com
rank : 0 (local_rank: 0)
exitcode : -11 (pid: 45265)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 45265
[2]:
time : 2026-01-30_12:37:45
host : yq02-bcc-sci-a800-25556-010.bcc-yq02.baidu.com
rank : 1 (local_rank: 1)
exitcode : -11 (pid: 45266)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 45266
[3]:
time : 2026-01-30_12:37:45
host : yq02-bcc-sci-a800-25556-010.bcc-yq02.baidu.com
rank : 2 (local_rank: 2)
exitcode : -11 (pid: 45267)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 45267
[4]:
time : 2026-01-30_12:37:45
host : yq02-bcc-sci-a800-25556-010.bcc-yq02.baidu.com
rank : 3 (local_rank: 3)
exitcode : -11 (pid: 45268)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 45268
[5]:
time : 2026-01-30_12:37:45
host : yq02-bcc-sci-a800-25556-010.bcc-yq02.baidu.com
rank : 4 (local_rank: 4)
exitcode : -11 (pid: 45269)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 45269
[6]:
time : 2026-01-30_12:37:45
host : yq02-bcc-sci-a800-25556-010.bcc-yq02.baidu.com
rank : 6 (local_rank: 6)
exitcode : -11 (pid: 45271)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 45271
[7]:
time : 2026-01-30_12:37:45
host : yq02-bcc-sci-a800-25556-010.bcc-yq02.baidu.com
rank : 7 (local_rank: 7)
exitcode : -11 (pid: 45272)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 45272

Root Cause (first observed failure):
[0]:
time : 2026-01-30_12:37:43
host : yq02-bcc-sci-a800-25556-010.bcc-yq02.baidu.com
rank : 5 (local_rank: 5)
exitcode : -11 (pid: 45270)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 45270

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions