Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
201 changes: 201 additions & 0 deletions knowledge_base/sycl/xpu/SYCL_lessons.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
# SYCL Lessons

## Lesson: Sub-group Coalesced Memory Access for Intel Xe GPUs

### Symptom

A kernel performs element-wise operations on large tensors but runs at only 0.66× PyTorch speed, despite using vectorized loads and a reasonable work-group size.

### Root Cause

**Poor memory coalescing**. If each work-item processes a contiguous block of N elements independently, lanes within a sub-group access memory N elements apart. With N=32 and sub-group size=16, consecutive lanes load from offsets 0, 32, 64, ... — hitting 16 separate cache lines per SIMD instruction, wasting 75% of fetched data.

### Fix: Sub-group Aware Layout

Reorganize the computation so **consecutive lanes access consecutive memory addresses**. For bf16 pairs, treat them as `uint32` and distribute work by lane:

```cpp
enum { WG = 512, ITERS = 128, SG_SZ = 16, SGS_PER_WG = WG / SG_SZ };
enum { PAIRS_PER_SG = SG_SZ * ITERS }; // 2048 pairs per sub-group

q->submit([&](sycl::handler& h) {
h.parallel_for<class MyKernel>(
sycl::nd_range<1>(global, WG),
[=](sycl::nd_item<1> item) [[sycl::reqd_sub_group_size(16)]] {
auto sg = item.get_sub_group();
int lane = sg.get_local_linear_id();
int64_t sg_id = item.get_group(0) * SGS_PER_WG + sg.get_group_linear_id();
int64_t sg_base = sg_id * PAIRS_PER_SG;

#pragma unroll
for (int i = 0; i < ITERS; i++) {
int64_t idx = sg_base + i * SG_SZ + lane;
// Lane 0 reads idx=sg_base+0
// Lane 1 reads idx=sg_base+1
// ... Lane 15 reads idx=sg_base+15
// All lanes read from same cache line!
uint32_t data = load_u32(idx);
process_and_store(data, idx);
}
});
});
```

**Key points:**
1. All 16 lanes in a sub-group load from consecutive offsets: `[sg_base, sg_base+1, ..., sg_base+15]`.
2. Together they fill exactly **one 64-byte cache line** per iteration.
3. Use `[[sycl::reqd_sub_group_size(16)]]` to guarantee the lane mapping.
4. Use `enum` instead of `constexpr` for compile-time constants to avoid DPC++ host/device capture mismatch.

### Expected Impact

- **Memory bandwidth utilization**: ~100% (vs. 25% with poor coalescing).
- **Performance gain**: 0.66× → 0.97× (47% improvement toward parity with optimized reference).

### Reference (oneAPI GPU Optimization Guide)

- Thread Mapping and GPU Occupancy (Section 6)
- Memory Hierarchy and Access Patterns (Section 7)
- Sub-group sizes: 16 or 32 for most Xe GPUs; use `get_sub_group()` API to query and adapt at runtime.

---

## Lesson: Speeding Up Row-Scale Kernels (diag(A) * B)

### Symptom

The row-wise broadcast multiply kernel `C[i,j] = A[i] * B[i,j]` is correct but only about 0.62x as fast as `torch.compile()`.

### Root Causes

- Per-element integer division (`row = idx / M`) in the hot loop
- Scalar bf16 loads/stores and scalar conversion overhead
- Weak memory coalescing pattern across lanes

### Fix Pattern

1. Map one work-group to one row
- Use `row = item.get_group(0)`.
- Remove integer division from the inner loop entirely.

2. Process bf16 pairs as `uint32`
- Load/store two bf16 values at once.
- This doubles useful bytes per memory op with naturally aligned 32-bit accesses.

3. Use coalesced lane mapping along columns
- Let consecutive local IDs handle consecutive pair indices.
- Stride by work-group size in the column loop.

4. Use explicit sub-group size on Xe
- Add `[[sycl::reqd_sub_group_size(16)]]` for predictable lane behavior.

5. Convert bf16 manually with bit operations
- `bf16 -> f32`: left-shift by 16 and bit-cast.
- `f32 -> bf16`: round-to-nearest-even using bias and sticky-bit logic.

### Minimal Kernel Structure

```cpp
const int pairs_per_row = M / 2;
enum { WG = 256 };

q.submit([&](sycl::handler& h) {
h.parallel_for<class RowScaleKernel>(
sycl::nd_range<1>((size_t)N * WG, WG),
[=](sycl::nd_item<1> item) [[sycl::reqd_sub_group_size(16)]] {
int row = item.get_group(0);
int lid = item.get_local_id(0);

float a_val = bf16_to_f32(a[row]);
const uint32_t* b_row = b32 + row * pairs_per_row;
uint32_t* c_row = c32 + row * pairs_per_row;

for (int pi = lid; pi < pairs_per_row; pi += WG) {
uint32_t p = b_row[pi];
c_row[pi] = mul_pair_bf16(a_val, p);
}
});
});
```

### Measured Impact

- Versus eager PyTorch: improved to about 1.46x
- Versus `torch.compile()`: improved from about 0.62x to about 0.93x

### oneAPI Guide Alignment

- Section 6: Thread mapping and occupancy (work-group decomposition)
- Section 7: Memory hierarchy and coalesced access
- Section 11: Prefer wide/coalesced accesses and remove expensive operations from inner loops

---

## Lesson: Never Use `-fsycl-unnamed-lambda`

### Symptom

Compilation fails with:

```text
icpx: error: cannot specify '-fsycl-unnamed-lambda' along with '-fsycl-host-compiler'
```

### Cause

`-fsycl-unnamed-lambda` conflicts with the `-fsycl-host-compiler` flag that the
PyTorch `load_inline` build system injects automatically. The two flags cannot
coexist.

### Fix

Always use **named kernel classes** in `parallel_for`:

```cpp
h.parallel_for<class MyKernelName>(
sycl::nd_range<1>(global_size, WG_SIZE),
[=](sycl::nd_item<1> item) {
// kernel body
}
);
```

Never add `-fsycl-unnamed-lambda` to `extra_sycl_cflags` in `load_inline`.

### Additional: `constexpr` Captures Cause Kernel Size Mismatch

Capturing `constexpr` float variables from the enclosing host scope causes a
static assertion failure at compile time:

```text
error: static assertion failed: Unexpected kernel lambda size.
This can be caused by an external host compiler producing a lambda with an
unexpected layout. In many cases the difference is related to capturing
constexpr variables.
```

**Fix**: define float constants **inside** the kernel lambda, not outside it:

```cpp
// WRONG — constexpr captured from host scope
constexpr float S2PI = 0.7978845608028654f;
q->submit([&](sycl::handler& h) {
h.parallel_for<class K>(range, [=](sycl::nd_item<1> item) {
float x = S2PI * val; // S2PI captured — size mismatch!
});
});

// CORRECT — constant defined inside the lambda
q->submit([&](sycl::handler& h) {
h.parallel_for<class K>(range, [=](sycl::nd_item<1> item) {
const float S2PI = 0.7978845608028654f;
float x = S2PI * val; // no capture, no mismatch
});
});
```

Use `enum` for integer compile-time constants (safe to define in host scope):

```cpp
enum { WG = 256, ITERS = 16, SG_SZ = 16 }; // safe — integer enums, not floats
```
Loading
Loading