Skip to content

[BUG] Cosmos3 on ARM64 GB300: Nano Invalid Outputs, Super Stalls Before Generation #42

@nkumarrai-nv

Description

@nkumarrai-nv

Bug Description

Cosmos3-Nano

I can build and run NVIDIA/cosmos-framework on an ARM64 GB300 system, but Cosmos3-Nano inference completes with invalid outputs.

Text-to-image writes a valid JPEG, but the image is gray/noisy texture instead of matching the prompt. The reasoner sample also completes and writes output files, but the generated text is invalid/gibberish.

Across the successful-but-invalid runs, I observed this warning:

Failed to initialize the CUTLASS kernel. Last CUDA error is: no error

This looks like an ARM64/GB300 attention backend compatibility issue rather than a command-line usage issue.

Cosmos3-Super

As a separate follow-up, I also ran the same repo text-to-image sample unchanged with Cosmos3-Super on the same local ARM64 GB300 host. That run did not reach generation: it initialized tokenizers/model, allocated roughly 126 GB of GPU memory, then made no further log progress before I stopped it to free the GPU. No output image and no CUTLASS warning were produced in the Cosmos3-Super run.

Common Setup

Repo:

git clone https://github.com/NVIDIA/cosmos-framework.git
cd cosmos-framework
git checkout 82f8229
docker build --network=host -t cosmos-framework:arm64 .

Cosmos3-Nano Reproduction

Run the official Cosmos 3 text-to-image sample:

docker run --rm --gpus all --ipc=host --network=host \
  -v "$PWD:/workspace" \
  -v /workspace/.venv \
  -v "$HOME/.cache:/root/.cache" \
  cosmos-framework:arm64 \
  python -m cosmos_framework.scripts.inference \
  --parallelism-preset=latency \
  --no-guardrails \
  -i inputs/omni/t2i.json \
  -o outputs/cosmos3_t2i_test \
  --checkpoint-path Cosmos3-Nano \
  --seed=0 \
  --benchmark

I also tried:

--no-use-torch-compile

The run still completed but produced the same gray/noisy output.

I also tried forcing the FlashAttention path, but that failed with:

ValueError: Could not find a compatible Attention backend for this use case / device.

Cosmos3-Super Reproduction

I also ran the same official text-to-image sample unchanged with Cosmos3-Super:

docker run --rm --gpus all --ipc=host --network=host \
  -v "$PWD:/workspace" \
  -v /workspace/.venv \
  -v "$HOME/.cache:/root/.cache" \
  cosmos-framework:arm64 \
  python -m cosmos_framework.scripts.inference \
  --parallelism-preset=latency \
  --no-guardrails \
  -i inputs/omni/t2i.json \
  -o outputs/cosmos3_super_t2i_test \
  --checkpoint-path Cosmos3-Super \
  --seed=0 \
  --benchmark

That Cosmos3-Super run reached:

Time spent on OmniMoTModel: set_up_model: 8.19 s

Then it stalled before generation. It allocated approximately 126630 MiB of GPU memory, stayed CPU-active at roughly one core with low GPU utilization, did not update logs further, and did not write vision.jpg or benchmark.json.

Reproducibility:

  • Cosmos3-Nano: always produces invalid/noisy output on this stack.
  • Cosmos3-Super: observed pre-generation stall on this stack.

Expected vs. Actual Behavior

Model Expected Actual
Cosmos3-Nano Text-to-image output should match the prompt; reasoner should emit valid text. T2I writes a valid JPEG, but it is gray/noisy texture. The reasoner writes invalid/gibberish text. Logs show CUTLASS kernel initialization warnings.
Cosmos3-Super Same unchanged T2I sample should reach generation and write vision.jpg. Stalls after model setup and does not write an output image or benchmark file.

Outputs

Cosmos3-Nano Error / Warning
Failed to initialize the CUTLASS kernel. Last CUDA error is: no error
Cosmos3-Nano Observed Output

The T2I output file is a valid 960x960 RGB JPEG, but visually appears as gray/noisy texture rather than the requested scene.

Cosmos3-Super Follow-Up

The Cosmos3-Super run did not produce an image. Last real log line:

[06-14 17:17:35|job=|INFO|cosmos_framework/utils/timer.py:138:_log] Time spent on OmniMoTModel: set_up_model: 8.19 s

Only log files were written:

console.log
debug.log
host_run.log

No vision.jpg, no benchmark.json, and no observed CUTLASS/NATTEN warning before manual cleanup.

System Information

Field Value
Environment Docker, image built from repo Dockerfile
Hardware Single NVIDIA GB300
Architecture aarch64 / ARM64
GPU Driver 610.43.02
Container PyTorch 2.10.0+cu130
CUDA CUDA 13 stack from container
Package Version / Commit 82f8229
Model Cosmos3-Nano; follow-up also tested Cosmos3-Super
Observed NATTEN 0.21.6.dev6

Additional Context

The same commands launch successfully and produce output files, so this is not a startup/download failure. The suspicious part is the attention backend path on ARM64 GB300/Blackwell. The source appears to contain Blackwell-specific NATTEN/CUTLASS handling, but the available ARM64 wheel may not match the support level needed by this path.

For Cosmos3-Super, the unchanged text-to-image sample did not reach the point where I could evaluate output quality. It appears to be a separate pre-generation stall on the same ARM64 GB300 stack.

Please let me know if there is a recommended ARM64/GB300 dependency stack or NATTEN wheel version for Cosmos3-Nano and Cosmos3-Super inference.

Image

debug_super.log
console_super.log
debug_nano.log
console_nano.log

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions