[QUESTION] How does veScale FSDP handles 2D FP8 blockwise scaling for RaggedShard

For deepseek FP8 blockwise recipe, the scaling block size for weights is 128x128. Having 2D block scaling means sharding by blocks will make memory incontiguous. Does it mean that we have to go through the following function to "permute" the original [N, K] weight tensor. And then after the parameter AG, do we also need to "unpermute" it such that we can launch a regular FP8 GEMM?

```
def to_blocked_128x128(a: torch.Tensor) -> torch.Tensor:
    N, K = a.shape
    return a.view(N//128, 128, K//128, 128).permute(0, 2, 1, 3).contiguous()
```

<img width="2082" height="480" alt="Image" src="https://github.com/user-attachments/assets/8331ef24-2ea1-487f-ae2d-0aecb5783dfa" />


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] How does veScale FSDP handles 2D FP8 blockwise scaling for RaggedShard #73

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[QUESTION] How does veScale FSDP handles 2D FP8 blockwise scaling for RaggedShard #73

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions