[BUG] Cosmos3 on ARM64 GB300: Nano Invalid Outputs, Super Stalls Before Generation

## Bug Description

### Cosmos3-Nano

I can build and run `NVIDIA/cosmos-framework` on an ARM64 GB300 system, but `Cosmos3-Nano` inference completes with invalid outputs.

Text-to-image writes a valid JPEG, but the image is gray/noisy texture instead of matching the prompt. The reasoner sample also completes and writes output files, but the generated text is invalid/gibberish.

Across the successful-but-invalid runs, I observed this warning:

```text
Failed to initialize the CUTLASS kernel. Last CUDA error is: no error
```

This looks like an ARM64/GB300 attention backend compatibility issue rather than a command-line usage issue.

### Cosmos3-Super

As a separate follow-up, I also ran the same repo text-to-image sample unchanged with `Cosmos3-Super` on the same local ARM64 GB300 host. That run did not reach generation: it initialized tokenizers/model, allocated roughly `126 GB` of GPU memory, then made no further log progress before I stopped it to free the GPU. No output image and no CUTLASS warning were produced in the `Cosmos3-Super` run.

## Common Setup

Repo:

```bash
git clone https://github.com/NVIDIA/cosmos-framework.git
cd cosmos-framework
git checkout 82f8229
docker build --network=host -t cosmos-framework:arm64 .
```

## Cosmos3-Nano Reproduction

Run the official Cosmos 3 text-to-image sample:

```bash
docker run --rm --gpus all --ipc=host --network=host \
  -v "$PWD:/workspace" \
  -v /workspace/.venv \
  -v "$HOME/.cache:/root/.cache" \
  cosmos-framework:arm64 \
  python -m cosmos_framework.scripts.inference \
  --parallelism-preset=latency \
  --no-guardrails \
  -i inputs/omni/t2i.json \
  -o outputs/cosmos3_t2i_test \
  --checkpoint-path Cosmos3-Nano \
  --seed=0 \
  --benchmark
```

I also tried:

```bash
--no-use-torch-compile
```

The run still completed but produced the same gray/noisy output.

I also tried forcing the FlashAttention path, but that failed with:

```text
ValueError: Could not find a compatible Attention backend for this use case / device.
```

## Cosmos3-Super Reproduction

I also ran the same official text-to-image sample unchanged with `Cosmos3-Super`:

```bash
docker run --rm --gpus all --ipc=host --network=host \
  -v "$PWD:/workspace" \
  -v /workspace/.venv \
  -v "$HOME/.cache:/root/.cache" \
  cosmos-framework:arm64 \
  python -m cosmos_framework.scripts.inference \
  --parallelism-preset=latency \
  --no-guardrails \
  -i inputs/omni/t2i.json \
  -o outputs/cosmos3_super_t2i_test \
  --checkpoint-path Cosmos3-Super \
  --seed=0 \
  --benchmark
```

That `Cosmos3-Super` run reached:

```text
Time spent on OmniMoTModel: set_up_model: 8.19 s
```

Then it stalled before generation. It allocated approximately `126630 MiB` of GPU memory, stayed CPU-active at roughly one core with low GPU utilization, did not update logs further, and did not write `vision.jpg` or `benchmark.json`.

**Reproducibility:**

- [x] `Cosmos3-Nano`: always produces invalid/noisy output on this stack.
- [x] `Cosmos3-Super`: observed pre-generation stall on this stack.

## Expected vs. Actual Behavior

| Model | Expected | Actual |
| ----- | -------- | ------ |
| `Cosmos3-Nano` | Text-to-image output should match the prompt; reasoner should emit valid text. | T2I writes a valid JPEG, but it is gray/noisy texture. The reasoner writes invalid/gibberish text. Logs show CUTLASS kernel initialization warnings. |
| `Cosmos3-Super` | Same unchanged T2I sample should reach generation and write `vision.jpg`. | Stalls after model setup and does not write an output image or benchmark file. |

## Outputs

<details>
<summary>Cosmos3-Nano Error / Warning</summary>

```text
Failed to initialize the CUTLASS kernel. Last CUDA error is: no error
```

</details>

<details>
<summary>Cosmos3-Nano Observed Output</summary>

The T2I output file is a valid `960x960` RGB JPEG, but visually appears as gray/noisy texture rather than the requested scene.

</details>

<details>
<summary>Cosmos3-Super Follow-Up</summary>

The `Cosmos3-Super` run did not produce an image. Last real log line:

```text
[06-14 17:17:35|job=|INFO|cosmos_framework/utils/timer.py:138:_log] Time spent on OmniMoTModel: set_up_model: 8.19 s
```

Only log files were written:

```text
console.log
debug.log
host_run.log
```

No `vision.jpg`, no `benchmark.json`, and no observed CUTLASS/NATTEN warning before manual cleanup.

</details>

## System Information

| Field | Value |
| ----- | ----- |
| **Environment** | Docker, image built from repo Dockerfile |
| **Hardware** | Single NVIDIA GB300 |
| **Architecture** | aarch64 / ARM64 |
| **GPU Driver** | 610.43.02 |
| **Container PyTorch** | 2.10.0+cu130 |
| **CUDA** | CUDA 13 stack from container |
| **Package Version / Commit** | `82f8229` |
| **Model** | `Cosmos3-Nano`; follow-up also tested `Cosmos3-Super` |
| **Observed NATTEN** | `0.21.6.dev6` |

## Additional Context

The same commands launch successfully and produce output files, so this is not a startup/download failure. The suspicious part is the attention backend path on ARM64 GB300/Blackwell. The source appears to contain Blackwell-specific NATTEN/CUTLASS handling, but the available ARM64 wheel may not match the support level needed by this path.

For `Cosmos3-Super`, the unchanged text-to-image sample did not reach the point where I could evaluate output quality. It appears to be a separate pre-generation stall on the same ARM64 GB300 stack.

Please let me know if there is a recommended ARM64/GB300 dependency stack or NATTEN wheel version for `Cosmos3-Nano` and `Cosmos3-Super` inference.

<img width="960" height="960" alt="Image" src="https://github.com/user-attachments/assets/7c874b1c-3fb4-4175-b2a6-1ca4ceaf52ca" />

[debug_super.log](https://github.com/user-attachments/files/28931677/debug_super.log)
[console_super.log](https://github.com/user-attachments/files/28931676/console_super.log)
[debug_nano.log](https://github.com/user-attachments/files/28931678/debug_nano.log)
[console_nano.log](https://github.com/user-attachments/files/28931679/console_nano.log)

Field	Value
Environment	Docker, image built from repo Dockerfile
Hardware	Single NVIDIA GB300
Architecture	aarch64 / ARM64
GPU Driver	610.43.02
Container PyTorch	2.10.0+cu130
CUDA	CUDA 13 stack from container
Package Version / Commit	`82f8229`
Model	`Cosmos3-Nano`; follow-up also tested `Cosmos3-Super`
Observed NATTEN	`0.21.6.dev6`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Cosmos3 on ARM64 GB300: Nano Invalid Outputs, Super Stalls Before Generation #42

Bug Description

Cosmos3-Nano

Cosmos3-Super

Common Setup

Cosmos3-Nano Reproduction

Cosmos3-Super Reproduction

Expected vs. Actual Behavior

Outputs

System Information

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Model	Expected	Actual
`Cosmos3-Nano`	Text-to-image output should match the prompt; reasoner should emit valid text.	T2I writes a valid JPEG, but it is gray/noisy texture. The reasoner writes invalid/gibberish text. Logs show CUTLASS kernel initialization warnings.
`Cosmos3-Super`	Same unchanged T2I sample should reach generation and write `vision.jpg`.	Stalls after model setup and does not write an output image or benchmark file.

[BUG] Cosmos3 on ARM64 GB300: Nano Invalid Outputs, Super Stalls Before Generation #42

Description

Bug Description

Cosmos3-Nano

Cosmos3-Super

Common Setup

Cosmos3-Nano Reproduction

Cosmos3-Super Reproduction

Expected vs. Actual Behavior

Outputs

System Information

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions