Skip to content

Sudden Termination During sp_vae Decode on 8*Ascend 910B #71

@BlossomsGarden

Description

@BlossomsGarden

Environment

  • Hardware: 8x Ascend NPU 910B (64G each)
  • Settings: sp_size=8, tile=384

Script

export TORCHDYNAMO_DISABLE=1
export PYTORCH_TRITON_DISABLE=1
export ALBUMENTATIONS_DISABLE_VERSION_CHECK=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True  # for NPU
export HCCL_CONNECT_TIMEOUT=600

torchrun --standalone --nproc_per_node 8 \
    scripts/inference_magicdrive.py \
    configs/magicdrive/test/17-16x848x1600_stdit3_CogVAE_boxTDS_wCT_xCE_wSST_map0_fsp8_cfg2.0.py \
    --cfg-options model.from_pretrained=./ckpts/MagicDriveDiT-stage3-40k-ft/ema.pt

Issue Description
During the VAE decode process, the script suddenly terminates without any specific error message or stack trace. It exits directly, making it impossible to debug :(

Image

I Traced the issue from the sp_vae() function in inference_magicdrive.py.

Narrowed it down to the AutoencoderKLCogVideoX.tiled_decode() method in vae_cogvideox.py, specifically at the line: tile = self.decoder(tile).

Then the program crashes at this point without logging any details.

Has anyone encountered a similar issue or have ideas on what might be causing this sudden termination? Any suggestions for debugging or potential fixes would be appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions