[Bug] int32 overflow in element_space_size causes silent zero outputs for large tensors (> 8.4M pages)

## Description

When a tensor's `element_space_size` (= `(lengths[i]-1) * strides[i]` summed across dimensions) exceeds `uint32_max`, `calculate_element_space_size_impl` overflows because `lengths` and `strides` are `index_t = int32_t`. This corrupts `buffer_size_` in `buffer_view`, causing AMD GPU buffer loads (via SRD) to silently return zeros for valid memory accesses.

Discovered in `mha_batch_prefill` kernel via [ROCm/aiter#2517](https://github.com/ROCm/aiter/issues/2517), but the bug is in CK's core tensor infrastructure.

## Affected Code

`include/ck_tile/core/tensor/tensor_descriptor.hpp`, `calculate_element_space_size_impl`:

```cpp
auto acc_new = acc_old + (lengths[i] - number<1>{}) * strides[i];
//                        ^^^^^^^^^ int32  ×  ^^^^^^^^^ int32  →  overflow!
```

The accumulator `acc_old` starts as `long_number<1>` (int64), but `(lengths[i] - 1) * strides[i]` is computed as **int32 × int32 = int32** before promotion. When the product exceeds `2^32`, it wraps to a small value.

## Impact Chain

1. `element_space_size` wraps to small value (e.g., 6.5M instead of 4.3B)
2. `buffer_view::buffer_size_` stores the wrong value
3. SRD `range` = `buffer_size_ × sizeof(T)` = 12 MB instead of 8.6 GB
4. AMD GPU: any buffer load at byte offset > 12 MB → **silent return zero** (OOB behavior)

## Minimal Trigger

Any 3D tensor with `(dim0 - 1) × stride0 > 2^32`:
- Example: shape=[8,401,239, 8, 64], stride=[512, 64, 1] → `8,401,238 × 512 = 4.3B > 4.29B`
- Threshold: `dim0 > 2^32 / stride0` (e.g., `> 8,388,608` for stride=512)

## Suggested Fix

Cast to `long_index_t` before multiplication:

```cpp
auto acc_new = acc_old + static_cast<long_index_t>(lengths[i] - number<1>{})
                       * static_cast<long_index_t>(strides[i]);
```

Also: `buffer_view::init_raw()` and `amd_buffer_addressing.hpp` non-raw load paths pass `element_space_size * sizeof(T)` to `uint32_t` SRD range — needs capping to `0xFFFFFFFF` for byte sizes > 4 GB.

## Versions Affected

Verified on CK commit `eb033ef20` (aiter main) and tag `rocm-7.2.1`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] int32 overflow in element_space_size causes silent zero outputs for large tensors (> 8.4M pages) #3722

Description

Affected Code

Impact Chain

Minimal Trigger

Suggested Fix

Versions Affected

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] int32 overflow in element_space_size causes silent zero outputs for large tensors (> 8.4M pages) #3722

Description

Description

Affected Code

Impact Chain

Minimal Trigger

Suggested Fix

Versions Affected

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions