diff --git a/.claude/skills/ck-debugging/SKILL.md b/.claude/skills/ck-debugging/SKILL.md
new file mode 100644
index 000000000..467046c69
--- /dev/null
+++ b/.claude/skills/ck-debugging/SKILL.md
@@ -0,0 +1,337 @@
+---
+name: ck-debugging
+description: Triage, investigate, debug, and isolate CK/AITER Fused Attention failures in TransformerEngine as integration vs kernel issues.
+---
+
+# CK Fused Attention Debugging Guide (TransformerEngine, ROCm)
+
+Use this playbook to quickly answer one question:
+**Is the failure in TE↔CK integration, or in the CK/AITER kernel itself?**
+
+---
+
+## 1) File layout and integration surface
+
+### Backend selection and dispatch (hipified — edit CUDA source, not `*_hip.cpp`)
+| File | Role |
+|---|---|
+| `transformer_engine/common/fused_attn_rocm/fused_attn.cpp` | Runtime backend selection (`nvte_get_fused_attn_backend`), all `nvte_fused_attn_{fwd,bwd}*` entry points that dispatch to CK or AOTriton |
+| `transformer_engine/common/fused_attn_rocm/fused_attn_ck.cpp` | CK eligibility (`is_ck_backend_supported`), type/mask/stride conversions, workspace allocation, calls into `ck_fused_attn::ck_attn_{fwd,bwd}` and `ck_attn_varlen_{fwd,bwd}` |
+| `transformer_engine/common/fused_attn_rocm/fused_attn_aotriton.cpp` | AOTriton equivalent (useful for comparison when CK fails but AOTriton passes) |
+| `transformer_engine/common/fused_attn_rocm/utils.{h,cpp}` | `generateMatrixStrides`, `NVTE_QKV_Matrix` enum — stride computation shared by CK and AOTriton |
+
+### CK kernel wrappers (native ROCm — edit directly, NOT hipified)
+| File | Role |
+|---|---|
+| `transformer_engine/common/ck_fused_attn/include/ck_fused_attn/ck_fused_attn.hpp` | Public API: `ck_attn_fwd`, `ck_attn_varlen_fwd`, `ck_attn_bwd`, `ck_attn_varlen_bwd` + `DType`, `MaskType`, `BiasType` enums |
+| `transformer_engine/common/ck_fused_attn/src/ck_fused_attn_fwd.cpp` | Forward kernel dispatch (calls `fmha_fwd` from ck_tile) |
+| `transformer_engine/common/ck_fused_attn/src/ck_fused_attn_bwd.cpp` | Backward kernel dispatch |
+| `transformer_engine/common/ck_fused_attn/src/ck_fused_attn_utils.{hpp,cpp}` | dtype/bias helpers, `DISPATCH_DTYPE_16BIT` macro |
+
+### Python-level attention dispatch
+| File | Role |
+|---|---|
+| `transformer_engine/pytorch/attention/dot_product_attention/backends.py` | `get_available_attention_backends()` — Python-level backend selection |
+| `transformer_engine/pytorch/attention/dot_product_attention/utils.py` | `AttentionLogging` class, `_NVTE_DEBUG`, `_NVTE_DEBUG_LEVEL` |
+| `transformer_engine/pytorch/cpp_extensions/fused_attn.py` | `FusedAttnBackend` enum, `fused_attn_fwd`/`fused_attn_bwd` Python wrappers |
+
+### Build system
+| File | Role |
+|---|---|
+| `transformer_engine/common/CMakeLists.txt` | Top-level C++ build, `USE_FUSED_ATTN_CK` compile flag |
+| `transformer_engine/common/ck_fused_attn/CMakeLists.txt` | CK wrapper build, links ck_tile and aiter |
+| `transformer_engine/common/ck_fused_attn/aiter_prebuilt.cmake` | AITER prebuilt SO linking |
+
+### Tests
+| File | Role |
+|---|---|
+| `tests/pytorch/attention/test_attention.py` | Main attention tests: `test_dot_product_attention`, `test_dpa_mask`, `test_dpa_bias`, `test_dpa_sliding_window`, `test_dpa_alibi_slopes`, `test_dpa_qkv_layout`, `test_dpa_qkv_layout_thd` |
+| `3rdparty/aiter/op_tests/test_mha.py` | AITER standalone Python MHA tests |
+| `3rdparty/aiter/op_tests/cpp/mha/` | Standalone C++ MHA executables: `benchmark_mha_fwd`, `benchmark_mha_bwd` |
+
+---
+
+## 2) Environment variables reference
+
+### Backend selection
+| Env var | Default | Effect |
+|---|---|---|
+| `NVTE_FUSED_ATTN` | `1` (enabled) | Master toggle for all fused attention; set `0` to disable |
+| `NVTE_FUSED_ATTN_CK` | follows `NVTE_FUSED_ATTN` | CK backend toggle |
+| `NVTE_FUSED_ATTN_AOTRITON` | follows `NVTE_FUSED_ATTN` | AOTriton backend toggle |
+| `NVTE_FLASH_ATTN` | `1` (enabled) | Flash attention toggle |
+
+### CK kernel tuning
+| Env var | Default | Effect |
+|---|---|---|
+| `NVTE_CK_USES_FWD_V3` | `1` | Use ASM v3 forward kernel (faster, narrower config support) |
+| `NVTE_CK_USES_BWD_V3` | `1` | Use ASM v3 backward kernel |
+| `NVTE_CK_IS_V3_ATOMIC_FP32` | `1` | Use fp32 atomics in bwd v3 (more accurate, slower) |
+| `NVTE_CK_HOW_V3_BF16_CVT` | `1` | bf16 conversion method for v3 kernels |
+| `NVTE_CK_ZERO_OUT_PAD` | `1` | Zero out padded positions in output |
+
+### Debug/logging (all layers, use together for full trace)
+| Env var | Layer | What it logs |
+|---|---|---|
+| `NVTE_DEBUG=1` + `NVTE_DEBUG_LEVEL=2` | Python (PyTorch) | Backend selection decisions, attention config |
+| `NVTE_LOG_FUSED_ATTN_CONFIG=1` | C++ dispatch (`fused_attn.cpp`) | Shape, dtype, layout, mask, window for each fwd/bwd call |
+| `NVTE_LOG_CK_CONFIG=1` | C++ CK glue (`fused_attn_ck.cpp`) | CK eligibility filter results, workspace sizes, strides, v3 flags |
+| `NVTE_LOG_AOTRITON_CONFIG=1` | C++ AOTriton glue | AOTriton-specific dispatch logging |
+| `CK_FUSED_ATTN_LOG_CONFIG=1` | CK kernel wrapper (`ck_fused_attn_fwd/bwd.cpp`) | fmha_traits, fmha_args, kernel name selected |
+
+**Full debug command prefix:**
+```bash
+NVTE_DEBUG=1 NVTE_DEBUG_LEVEL=2 NVTE_LOG_FUSED_ATTN_CONFIG=1 NVTE_LOG_CK_CONFIG=1 CK_FUSED_ATTN_LOG_CONFIG=1 <test command>
+```
+
+### CI backend configs (`ci/_utils.sh::configure_fused_attn_env`)
+| Mode | `NVTE_FUSED_ATTN` | `NVTE_FUSED_ATTN_CK` | `NVTE_FUSED_ATTN_AOTRITON` | `NVTE_FLASH_ATTN` |
+|---|---|---|---|---|
+| `auto` | unset | unset | unset | unset |
+| `ck` | unset | unset | `0` | `0` |
+| `aotriton` | unset | `0` | unset | `0` |
+| `flash` | `0` | `0` | `0` | unset |
+| `unfused` | `0` | unset | unset | `0` |
+
+---
+
+## 3) Gather minimum reproducibility context (before changing code)
+
+Capture these from logs or user report:
+- Forward vs backward failure (`fwd` / `bwd`)
+- Exact shape/config: batch (`b`), seq lengths (`s_q`, `s_kv`), num heads (`h`), num kv heads (`hg`), head dim (`d_qk`, `d_v`)
+- Data type(s): fp16 / bf16 (CK does not support fp8 in fused attn)
+- QKV layout: which `NVTE_QKV_Layout` (e.g., `BSHD_BSHD_BSHD`, `BS3HD`, `THD_THD_THD`)
+- Mask type: `NO_MASK`, `CAUSAL_MASK`, `PADDING_CAUSAL_MASK`, `CAUSAL_BOTTOM_RIGHT_MASK`, etc.
+- Bias type: `NO_BIAS`, `POST_SCALE_BIAS`, `ALIBI`
+- Dropout probability
+- Sliding window size: `(window_size_left, window_size_right)`
+- GQA/MQA details: `h` vs `hg` ratio
+- GPU architecture (`gfx942`, `gfx950`, etc.) + ROCm version + TE commit
+- Whether fallback backend (AOTriton/unfused) succeeds with same config
+
+If config info is incomplete, request it first; otherwise debugging is noisy and slow.
+
+When self-collecting logs, enable full logging:
+```bash
+NVTE_LOG_FUSED_ATTN_CONFIG=1 NVTE_LOG_CK_CONFIG=1 CK_FUSED_ATTN_LOG_CONFIG=1 <test command>
+```
+
+If a segfault occurs, rerun under `rocgdb`:
+```bash
+rocgdb --args python -m pytest <test> -x -s
+# then: (gdb) run, wait for crash, (gdb) bt
+```
+
+---
+
+## 4) CK eligibility checks (`is_ck_backend_supported`)
+
+The function in `fused_attn_ck.cpp:23-152` applies these filters in order. When CK is rejected, `NVTE_LOG_CK_CONFIG=1` prints the reason. The filters are:
+
+1. **GQA groups**: `num_gqa_groups > 0` and `num_attn_heads % num_gqa_groups == 0`
+2. **Data type**: `q_dtype == kv_dtype` and both are fp16 or bf16 (no fp8)
+3. **Bias type**: only `NO_BIAS`, `ALIBI`, or `POST_SCALE_BIAS` (no `PRE_SCALE_BIAS`)
+4. **Head dim**: `head_dim_qk < 512` and `head_dim_v < 512`
+5. **Causal + window**: if causal mask, window must be `(-1, 0)` or `(>=0, 0)`
+6. **No mask + window**: if no mask, window must be `(-1, -1)` or `(>=0, >=0)`
+7. **QKV packed + GQA**: MQA/GQA cannot use qkvpacked layouts (`3HD`, `H3D`)
+8. **QKV packed + seqlen**: qkvpacked requires `s_q == s_kv`
+9. **THD + padding**: ragged (THD) format requires a padding mask type
+10. **Padding + bias**: padding mask cannot combine with `POST_SCALE_BIAS` or `ALIBI`
+
+If CK is rejected, the runtime falls through to AOTriton, then to `NVTE_No_Backend` (which causes `NVTE_ERROR`).
+
+---
+
+## 5) Common error signatures and where they come from
+
+### From dispatch layer (`fused_attn.cpp`)
+- `"Invalid combination of data type and sequence length for rocm fused attention."` — no backend accepted the config. Check eligibility with `NVTE_LOG_CK_CONFIG=1`.
+- `"qkv_layout not supported!"` — unknown layout enum value.
+- `"window_size should be (-1, 0) or (>=0, 0) for attn_mask_type=..."` — window/mask mismatch.
+
+### From CK glue (`fused_attn_ck.cpp`)
+- `"NVTE_3HD NVTE_H3D should have h=hg."` — packed layout with GQA mismatch.
+- `"Unexpected Aux_CTX_Tensors->size."` — wrong number of auxiliary tensors passed to fwd/bwd.
+- `"Unexpected workspace_size."` — workspace allocation mismatch between first call (size query) and second call (execute).
+- `"CK fused attn backend not compiled."` — `USE_FUSED_ATTN_CK` not set at build time.
+
+### From CK kernel wrappers (`ck_fused_attn_fwd/bwd.cpp`)
+- `"fused attn configs not supported in ck_fused_attn fwd pass."` — config doesn't match any compiled CK tile kernel.
+- `"fused attn configs not supported in ck_fused_attn bwd pass."` — same for backward.
+- `"Invalid dtype in ck_fused_attn."` — bad dtype conversion.
+- `"Invalid bias_type in ck_fused_attn."` / `"Invalid bias_shape in ck_fused_attn."` — bias type/shape not recognized.
+- `"Invalid type for 16 bit.."` — `DISPATCH_DTYPE_16BIT` macro failure.
+
+### From HIP runtime
+- `hipError_t` from `NVTE_CHECK_CUDA(...)` wrapping CK calls — usually a kernel launch failure or illegal memory access.
+
+---
+
+## 6) Reproduce in controlled CK-only path
+
+### Path A: TE pytest with CK forced
+```bash
+# Force CK-only backend
+export NVTE_FLASH_ATTN=0
+export NVTE_FUSED_ATTN_AOTRITON=0
+# Full logging
+export NVTE_LOG_FUSED_ATTN_CONFIG=1 NVTE_LOG_CK_CONFIG=1 CK_FUSED_ATTN_LOG_CONFIG=1
+
+pytest tests/pytorch/attention/test_attention.py::test_dot_product_attention -x -s -k "<filter>"
+```
+
+### Path B: AITER Python JIT (isolates from TE integration)
+1. Install aiter: `cd 3rdparty/aiter && pip install -e .`
+2. Use `3rdparty/aiter/op_tests/test_mha.py` or write a minimal reproducer.
+3. Call MHA functions directly (e.g. `mha_fwd`, `fmha_v3_fwd`).
+
+### Path C: Standalone C++ executables (maximum isolation)
+1. Build:
+   ```bash
+   cd 3rdparty/aiter/op_tests/cpp/mha
+   bash build_mha.sh fwd   # or: bwd, fwd_v3, bwd_v3, or no arg for all
+   ```
+2. Run with proper env:
+   ```bash
+   export LD_LIBRARY_PATH=<TE_ROOT>/transformer_engine/lib:${LD_LIBRARY_PATH}
+   export AITER_ASM_DIR=$(realpath ../../../hsa)  # or equivalent absolute path
+   ```
+3. Use `-?` flag to list all arguments.
+4. Example commands mapping to TE configs:
+   ```bash
+   # Forward: batch=4, heads=32, kv_heads=8, dim=128, seq=4096, causal, bf16
+   ./benchmark_mha_fwd -prec=bf16 -b=4 -h=32 -h_k=8 -d=128 -s=4096 \
+     -iperm=1 -operm=1 -mask=1 -lse=1 -mode=0 -kname=1 -v=1
+
+   # Backward (same config)
+   ./benchmark_mha_bwd -prec=bf16 -b=4 -h=32 -h_k=8 -d=128 -s=4096 \
+     -iperm=1 -operm=1 -mask=1 -mode=0 -kname=1 -v=1
+   ```
+5. Key argument mappings:
+   - `-iperm=1 -operm=1` → BSHD layout (TE default)
+   - `-iperm=0 -operm=0` → SBHD layout
+   - `-mask=0` → no mask, `-mask=1` → causal top-left, `-mask=2` → causal bottom-right
+   - `-mask=t:L,R` → SWA top-left, `-mask=b:L,R` → SWA bottom-right
+   - `-lse=1` → store LSE (TE always does this)
+   - `-mode=0` → batch mode, `-mode=1` → group/varlen mode
+   - `-bias=n` → no bias, `-bias=e` → elementwise, `-bias=a` → alibi
+   - `-fwd_v3=1` / `-bwd_v3=1` → use ASM v3 kernels
+   - `-v3_atomic_fp32=0|1` → bwd atomic precision
+
+---
+
+## 7) Decision tree: integration bug vs kernel bug
+
+### Case 1: Fails in TE, passes in standalone `benchmark_mha_{fwd,bwd}` with equivalent config
+→ **Likely TE integration bug**. Focus on:
+- Argument marshaling in `fused_attn_ck.cpp`: type conversions (`nvte_to_ck_dtype`, `nvte_to_ck_bias_type`, `set_ck_mask`), stride computation (`generateMatrixStrides`), workspace layout
+- Backend selection conditions in `fused_attn.cpp` — is the right config reaching CK?
+- Padding removal/addition logic (`remove_padding`, `add_padding`, `add_padding_softmax_lse`)
+- BSHD-to-THD conversion path (`bshd_to_thd`, `generate_cu_seqlen_padded`)
+
+### Case 2: Fails both in TE and standalone
+→ **Likely CK/AITER kernel issue** (or unsupported config). Produce a minimal standalone reproducer and hand off to AITER/CK team.
+
+### Case 3: Passes in TE only when fallback backend (AOTriton) is chosen
+→ **CK eligibility guard likely wrong**. Inspect filters in `is_ck_backend_supported`.
+
+### Case 4: Numerical mismatch (passes but wrong values)
+→ Compare CK output vs AOTriton output on same config. If CK-standalone also gives wrong values, kernel bug. If only TE-CK path gives wrong values, check:
+- Stride ordering (batch vs head vs seq strides differ between batched and varlen paths)
+- LSE storage format (padded vs unpadded, h×s_q vs s_q×h ordering)
+- Workspace buffer reuse / overlap
+- `NVTE_CK_ZERO_OUT_PAD` behavior
+
+---
+
+## 8) High-value integration checks
+
+When the failure is TE-side, verify these in `fused_attn_ck.cpp`:
+
+### Stride computation
+- `generateMatrixStrides` in `utils.cpp` computes 4-element strides `[batch, head, seq, dim]` for each matrix
+- Batched CK API (`ck_attn_fwd`) expects `stride_b, stride_h, stride_s` (3 strides, dim=1 implied)
+- Varlen CK API (`ck_attn_varlen_fwd`) expects `stride_h, stride_s` (2 strides, no batch stride)
+- When SBHD+padding triggers pad removal, the varlen strides are recomputed: `stride_h=q_stride[1]`, `stride_s=min(q_stride[0], q_stride[2])`
+
+### Workspace allocation
+- First call with `workspace==nullptr` queries size, second call executes
+- Workspace sections are allocated sequentially: alibi slopes → softmax LSE → Q/K/V/O without-padding buffers → cu_seqlen_padded
+- Mismatch between query and execute allocations causes `"Unexpected workspace_size"` errors
+
+### Type/mask/bias mapping
+- NVTE `CAUSAL_MASK` / `PADDING_CAUSAL_MASK` → CK `mask_top_left`
+- NVTE `CAUSAL_BOTTOM_RIGHT_MASK` / `PADDING_CAUSAL_BOTTOM_RIGHT_MASK` → CK `mask_bottom_right`
+- NVTE `NO_MASK` / `PADDING_MASK` with SWA `(>=0, >=0)` → CK `mask_bottom_right` (not `window_generic`)
+- NVTE `POST_SCALE_BIAS` → CK `elementwise_bias`
+- NVTE `ALIBI` → CK `alibi` (slope array auto-generated in workspace)
+
+### Backward-specific
+- `dq_acc_ptr` workspace for split-K accumulation: sized as `float * nsplits * h * max_tokens_q * d_qk`
+- `dk_expanded_ptr` / `dv_expanded_ptr` for GQA: expanded to full head count, then reduced
+- `dbias_expanded_ptr` → `dbias_ptr` reduction when bias dims differ
+- Deterministic mode always set to `false` (TODO in source)
+- LSE from forward must be passed correctly via `Aux_CTX_Tensors->tensors[0]`
+
+---
+
+## 9) Running TE tests for fused attention
+
+### Single test
+```bash
+NVTE_FLASH_ATTN=0 NVTE_FUSED_ATTN_AOTRITON=0 \
+  pytest tests/pytorch/attention/test_attention.py::test_dot_product_attention -x -s \
+  -k "model_name_and_params"
+```
+
+### Key test functions
+| Test | What it covers |
+|---|---|
+| `test_dot_product_attention` | Core fwd/bwd correctness across dtypes, models |
+| `test_dpa_mask` | All mask types |
+| `test_dpa_bias` / `test_dpa_bias_shapes` | Bias types and shapes |
+| `test_dpa_sliding_window` | SWA with different window sizes |
+| `test_dpa_alibi_slopes` | ALiBi attention bias |
+| `test_dpa_qkv_layout` | All QKV layout variants |
+| `test_dpa_qkv_layout_thd` | THD (ragged) layouts |
+| `test_dpa_qkv_layout_thd_mqa_gqa` | THD + MQA/GQA |
+| `test_gqa_mla_thd` | GQA/MLA with THD format, CK backend check |
+| `test_mha_fp8_vs_f16` / `test_dpa_fp8_vs_f16` | FP8 vs FP16 comparison |
+
+### CI invocation
+```bash
+# From ci/pytorch.sh — runs with specific backend
+ci/pytorch.sh  # uses TEST_LEVEL, TEST_SGPU, TEST_FILTER
+# Backend set via configure_fused_attn_env in ci/_utils.sh
+```
+
+---
+
+## 10) Common pitfalls
+
+1. **Stride mismatch between batched and varlen paths**: SBHD+padding triggers pad removal which changes the varlen stride computation. The `min(stride[0], stride[2])` logic can produce unexpected results for certain layouts.
+2. **Workspace size queried with different params than execute call**: Any change to config between the two calls will cause workspace size mismatch.
+3. **Treating unsupported config as runtime failure instead of eligibility failure**: If CK doesn't support a config, it should be caught by `is_ck_backend_supported`, not crash at kernel launch.
+4. **Missing backward-only failures**: Always test both fwd and bwd. Some configs work in fwd but fail in bwd (e.g., due to expanded gradient buffers in GQA).
+5. **Mismatch between TE-side defaults and standalone binary defaults**: TE always stores LSE (`-lse=1`), always uses `iperm=1 operm=1` for BSHD. Standalone defaults may differ.
+6. **Comparing non-equivalent configs across TE and standalone paths**: Ensure mask type, window size, dropout, and all flags match exactly.
+7. **v3 kernel fallback**: v3 ASM kernels support a narrower config range than CK fallback. If `NVTE_CK_USES_FWD_V3=1` but the config isn't supported by v3, the kernel wrapper falls back to CK tile. Check `CK_FUSED_ATTN_LOG_CONFIG=1` output for which kernel was actually selected.
+8. **cu_seqlen_padded generation**: For BSHD+padding→THD conversion, `generate_cu_seqlen_padded` creates synthetic padded seqlens. If actual padding pattern doesn't match assumptions, results will be wrong.
+9. **Build flag `USE_FUSED_ATTN_CK` not set**: If CK backend returns `false` for everything and `NVTE_LOG_CK_CONFIG` produces no output, check that the build included CK. The `is_ck_backend_supported` function returns `false` when compiled without `USE_FUSED_ATTN_CK`.
+
+---
+
+## 11) Output artifact requirements (always produce)
+
+For each investigated failure, record:
+
+**Concise handoff format:**
+- **Config:** `B=?, Sq=?, Skv=?, H=?, Hg=?, Dqk=?, Dv=?, dtype=?, layout=?, causal=?, dropout=?, mask=?, bias=?, window=?`
+- **TE result:** pass/fail + key error
+- **Standalone result:** pass/fail + key error
+- **Conclusion:** `integration` / `kernel` / `unsupported-config`
+- **Owner:** TE vs AITER/CK
+
+For comprehensive output, reference [TEMPLATE.md](TEMPLATE.md).
diff --git a/.claude/skills/ck-debugging/TEMPLATE.md b/.claude/skills/ck-debugging/TEMPLATE.md
new file mode 100644
index 000000000..17c393f6d
--- /dev/null
+++ b/.claude/skills/ck-debugging/TEMPLATE.md
@@ -0,0 +1,149 @@
+# CK/AITER Fused-Attn Debug Handoff Template
+
+Use this template when handing off a failure investigation to TE or AITER/CK owners.
+
+---
+
+## 1) Summary
+- **Classification:** `integration` | `kernel` | `unsupported-config` | `unknown`
+- **Direction:** `fwd` | `bwd` | `both`
+
+## 2) Environment
+- **TE commit:**
+- **AITER commit/submodule ref:**
+- **ROCm version:**
+- **GPU architecture (gfx):**
+- **Container image/tag (if applicable):**
+
+## 3) Failing Configuration
+- **Batch (B):**
+- **Query seq (Sq):**
+- **KV seq (Skv):**
+- **Num Q heads (H):**
+- **Num KV heads (Hg):**
+- **Head dim QK (Dqk):**
+- **Head dim V (Dv):**
+- **DType(s):** fp16 / bf16
+- **QKV Layout:** (e.g., `BSHD_BSHD_BSHD`, `BS3HD`, `THD_THD_THD`)
+- **Mask type:** (e.g., `CAUSAL_MASK`, `PADDING_CAUSAL_MASK`, `NO_MASK`)
+- **Bias type:** (e.g., `NO_BIAS`, `POST_SCALE_BIAS`, `ALIBI`)
+- **Dropout:**
+- **Window size:** `(left, right)`
+- **GQA/MQA details:**
+
+## 4) TE Reproducer
+- **Backend intent:** CK only / auto / fallback allowed
+- **Command or test entrypoint:**
+- **Key env vars:**
+  ```bash
+  NVTE_FLASH_ATTN=0 NVTE_FUSED_ATTN_AOTRITON=0
+  NVTE_LOG_FUSED_ATTN_CONFIG=1 NVTE_LOG_CK_CONFIG=1 CK_FUSED_ATTN_LOG_CONFIG=1
+  ```
+- **Observed result:** pass/fail
+- **First failing log line / error signature:**
+- **Backend actually selected (from log):**
+
+## 5) Standalone AITER Reproducer (`benchmark_mha_fwd` / `benchmark_mha_bwd`)
+- **Build location:** `3rdparty/aiter/op_tests/cpp/mha`
+- **Build command:** `bash build_mha.sh` (or `bash build_mha.sh fwd` / `bwd`)
+- **Runtime env:**
+	- `LD_LIBRARY_PATH=<TE_ROOT>/transformer_engine/lib:${LD_LIBRARY_PATH}`
+	- `AITER_ASM_DIR=$(realpath ../../../hsa)`
+- **Exact standalone command(s):**
+- **Observed result:** pass/fail
+- **First failing log line / error signature:**
+
+## 6) Equivalence Check (TE vs Standalone)
+- **Are shape/dtype/flags exactly matched?** yes/no
+- **Layout mapping verified?** (`-iperm=1 -operm=1` for BSHD, etc.)
+- **LSE enabled?** (`-lse=1` — TE always stores LSE)
+- **v3 kernel flags matched?** (`NVTE_CK_USES_FWD_V3` → `-fwd_v3=`, etc.)
+- **Any default mismatch noticed?**
+- **Notes:**
+
+## 7) Conclusion and Ownership
+- **Conclusion:** integration vs kernel vs unsupported-config
+- **Likely owner:** TE (`fused_attn_ck.cpp` / `fused_attn.cpp` / `ck_fused_attn_*`) or AITER/CK kernel team
+- **Specific area (if integration):**
+  - [ ] Stride computation (`generateMatrixStrides` / stride mapping to CK API)
+  - [ ] Type/mask/bias conversion (`nvte_to_ck_*`, `set_ck_mask`)
+  - [ ] Workspace allocation/layout
+  - [ ] Padding removal/addition
+  - [ ] Backend eligibility check (`is_ck_backend_supported`)
+  - [ ] Aux tensor handling (LSE, rng_state, bias in `Aux_CTX_Tensors`)
+  - [ ] Other: ___
+- **Requested next action:**
+
+## 8) Artifacts
+- **Logs attached:**
+- **Minimal reproducer commands attached:**
+- **Patch/commit links (if any):**
+
+---
+
+# Example (Filled)
+
+## 1) Summary
+- **Classification:** `integration`
+- **Direction:** `bwd`
+
+## 2) Environment
+- **TE commit:** `abc1234`
+- **AITER commit/submodule ref:** `def5678`
+- **ROCm version:** 6.2.1
+- **GPU architecture (gfx):** gfx942
+
+## 3) Failing Configuration
+- **Batch (B):** 4
+- **Query seq (Sq):** 4096
+- **KV seq (Skv):** 4096
+- **Num Q heads (H):** 32
+- **Num KV heads (Hg):** 8
+- **Head dim QK (Dqk):** 128
+- **Head dim V (Dv):** 128
+- **DType(s):** bf16
+- **QKV Layout:** `BSHD_BSHD_BSHD`
+- **Mask type:** `CAUSAL_MASK`
+- **Bias type:** `NO_BIAS`
+- **Dropout:** 0.0
+- **Window size:** `(-1, 0)`
+- **GQA/MQA details:** GQA with ratio 4:1
+
+## 4) TE Reproducer
+- **Backend intent:** CK only
+- **Command or test entrypoint:** `pytest tests/pytorch/attention/test_attention.py::test_dot_product_attention -x -s -k "test_config"`
+- **Key env vars:** `NVTE_FLASH_ATTN=0 NVTE_FUSED_ATTN_AOTRITON=0 NVTE_LOG_CK_CONFIG=1`
+- **Observed result:** fail
+- **First failing log line / error signature:** `invalid argument: ck_bwd workspace size mismatch`
+- **Backend actually selected (from log):** NVTE_CK
+
+## 5) Standalone AITER Reproducer
+- **Build location:** `3rdparty/aiter/op_tests/cpp/mha`
+- **Build command:** `bash build_mha.sh bwd`
+- **Runtime env:**
+	- `LD_LIBRARY_PATH=<TE_ROOT>/transformer_engine/lib:${LD_LIBRARY_PATH}`
+	- `AITER_ASM_DIR=$(realpath ../../../hsa)`
+- **Exact standalone command(s):**
+	- `./benchmark_mha_bwd -prec=bf16 -b=4 -h=32 -h_k=8 -d=128 -s=4096 -iperm=1 -operm=1 -mask=1 -mode=0 -kname=1 -v=1`
+- **Observed result:** pass
+- **First failing log line / error signature:** N/A
+
+## 6) Equivalence Check
+- **Are shape/dtype/flags exactly matched?** yes
+- **Layout mapping verified?** yes (`-iperm=1 -operm=1`)
+- **LSE enabled?** yes (implicit in bwd)
+- **v3 kernel flags matched?** yes (default v3 enabled)
+- **Any default mismatch noticed?** TE-side workspace alignment default differs from standalone path
+- **Notes:** Likely marshaling/normalization issue before CK call
+
+## 7) Conclusion and Ownership
+- **Conclusion:** integration
+- **Likely owner:** TE (`fused_attn_ck.cpp` argument preparation)
+- **Specific area:**
+  - [x] Workspace allocation/layout
+- **Requested next action:** Inspect workspace-size and alignment mapping in TE→CK bwd path for GQA expanded gradients
+
+## 8) Artifacts
+- **Logs attached:** `te_fail.log`, `standalone_pass.log`
+- **Minimal reproducer commands attached:** yes
+- **Patch/commit links (if any):** none
diff --git a/.claude/skills/ifu-merge/SKILL.md b/.claude/skills/ifu-merge/SKILL.md
new file mode 100644
index 000000000..1abb72336
--- /dev/null
+++ b/.claude/skills/ifu-merge/SKILL.md
@@ -0,0 +1,227 @@
+---
+name: ifu-merge
+description: >
+  Guide for performing IFU (Internal Feature Update) merges on the TransformerEngine ROCm fork.
+  This skill walks through resolving merge conflicts between upstream NVIDIA TransformerEngine
+  and the AMD ROCm fork, then fixing build and runtime errors systematically. Use this skill
+  whenever the user mentions IFU, merge upstream, merge NVIDIA, upstream merge,
+  IFU merge, internal feature update, or is working on integrating new NVIDIA TE changes into
+  the ROCm fork — even if they don't use the term "IFU" explicitly.
+---
+
+# TransformerEngine ROCm — IFU Merge Guide
+
+An IFU (Internal Feature Update) merges upstream NVIDIA TransformerEngine commits into the AMD ROCm fork. The merge is often committed with conflict markers still in place, then conflicts are resolved in follow-up commits.
+
+This guide captures generalized strategies for resolving these merges.
+
+---
+
+## Strategy 1: Conflict Resolution Mental Model
+
+Every conflict falls into one of these categories. Recognizing the category tells you how to resolve it:
+
+| Category | Recognition | Resolution |
+|---|---|---|
+| **API Refactor** | Upstream restructured function signatures, parameter passing, or return values | Take upstream's new API structure; integrate ROCm-specific fields into the new pattern |
+| **Feature Addition** | Both sides added new parameters, attributes, docstrings, or code paths | Keep both — upstream features and ROCm features are independent and coexist |
+| **Guard Removal** | Upstream removed `#ifdef`/`#ifndef` platform guards (they don't need them) | Re-add the guards — they protect ROCm-specific code paths |
+| **Feature Support Divergence** | Upstream adds code for features not yet supported on ROCm | Guard with `#ifndef USE_ROCM`; verify current ROCm support status first — features may become supported over time |
+| **Convention Change** | Upstream changed a data format, tensor shape, or API contract (e.g., different tensor dimensions, new enum values, reordered arguments) | Adopt the new convention everywhere — then find and update every downstream consumer, including test code. The compiler won't catch these; they cause silent wrong results or cryptic runtime errors |
+| **Copyright/Whitespace** | Trivial formatting or copyright header differences | Keep ROCm (AMD) copyright headers; take upstream for whitespace |
+
+**Key principle**: When upstream refactors a pattern (e.g., bundling arguments into a tuple), always adopt the new pattern and adapt ROCm-specific fields into it. Fighting the upstream structure creates maintenance debt.
+
+**Watch for paired sites**: If upstream refactors a function signature, there are always at least TWO places to update — where values are packed into the new structure and where they are unpacked. Missing one causes silent bugs.
+
+---
+
+## Strategy 2: Understanding the Hipify Pipeline
+
+PyTorch's hipify tool automatically converts CUDA source files to HIP. Understanding what it does and doesn't convert is essential.
+
+**What hipify converts:**
+- `#include` directives (e.g., `cuda.h` -> `hip/hip_runtime.h`)
+- File extensions in generated output (`.cu` -> `.hip`, `.cpp` -> `_hip.cpp`)
+
+**What hipify does NOT convert:**
+- C++ namespace references (e.g., `c10::cuda::CUDAGuard` stays as-is)
+- Semantic differences between CUDA and HIP APIs (e.g., warp size, mask types)
+- ROCm-specific device behavior (e.g., tensor device masquerading)
+
+**What hipify preserves faithfully:**
+- Preprocessor guards (`#ifndef USE_ROCM`, `#ifdef __HIP_PLATFORM_AMD__`). This means adding guards to source `.cpp` files will propagate into the generated `_hip.cpp` output. Use this to exclude CUDA-only code paths from ROCm builds.
+
+**Rules that follow:**
+- Never edit `*_hip.cpp` or `.hip` files — they are regenerated from source files
+- Always edit the source `.cpp`/`.cu` files — hipify will process them
+- When you see a build error in a `_hip.cpp` file, fix the corresponding source file
+- Use conditional compilation to handle namespace/API differences that hipify can't resolve
+
+**Platform guard conventions by code layer:**
+
+| Layer | Guard Macro | Example |
+|---|---|---|
+| PyTorch CSRC (`.cpp` source files) | `#ifdef USE_ROCM` / `#ifndef USE_ROCM` | DeviceGuard, scale swizzling |
+| Common layer (`.cu` files that get hipified) | `#ifdef __HIP_PLATFORM_AMD__` | Warp masks, kernel dispatch |
+| Python code | `IS_HIP_EXTENSION` (from `torch.utils.cpp_extension`) | Workspace sizing, feature flags |
+
+Using the wrong guard macro for a given layer is a common source of subtle bugs — the code compiles but the guard doesn't activate correctly.
+
+---
+
+## Strategy 3: Auditing Non-Conflicting Changes
+
+**This is the single most important strategy and the one most often skipped.** Conflict markers are visible and localized. The real danger lies in changes that merge cleanly — upstream modifications that `git merge` silently accepts because the ROCm fork didn't touch those exact lines. A removed platform guard, a changed convention, a new CUDA-only code path — these slip through without any marker.
+
+**Five patterns to watch for:**
+
+1. **Guard Removal**: Upstream deletes platform guards because they don't need them. If ROCm didn't modify those exact lines, the merge silently removes the guards.
+
+2. **New CUDA-only Code**: Upstream adds new functions or code paths using CUDA-only APIs. These compile fine on CUDA but fail on ROCm.
+
+3. **New Upstream Files**: Upstream introduces entirely new files (e.g., `swizzle.cpp`) that have no ROCm guards at all. These files get hipified and compiled on ROCm, potentially referencing CUDA-only symbols. Check every new file for functions that should be guarded.
+
+4. **Refactored Utilities**: Upstream moves or rewrites utility functions that had ROCm-specific behavior. The new version may lack the ROCm-specific handling, and the fork's old copy becomes stale.
+
+5. **Convention Changes**: Upstream changes a data format, tensor shape, or API contract without any code conflict. Every downstream consumer of that convention must be updated manually — the compiler won't catch these.
+
+**How to systematically audit:**
+
+```bash
+# Get the merge parents
+git cat-file -p <merge-commit>
+
+# See ALL upstream changes (not just conflicting ones)
+git diff <rocm-parent>..<upstream-parent> --stat
+
+# Check for removed guards
+git diff <rocm-parent>..<upstream-parent> -- <file> | grep -E "^-.*(__HIP_PLATFORM_AMD__|USE_ROCM|IS_HIP_EXTENSION)"
+
+# Find entirely new files from upstream
+git diff <rocm-parent>..<upstream-parent> --diff-filter=A --name-only
+
+# Check CMakeLists.txt for source list changes
+git diff <rocm-parent>..<upstream-parent> -- "*/CMakeLists.txt"
+```
+
+**What to look for in each changed file:**
+- Removed platform guards (`USE_ROCM`, `__HIP_PLATFORM_AMD__`, `IS_HIP_EXTENSION`)
+- New CUDA API usage (e.g., `CUDAGuard`, cuBLAS/cuDNN calls) without ROCm equivalents
+- Refactored utility functions that previously had ROCm-specific behavior
+- Changes to CMakeLists.txt source file lists — files moved between `if(USE_CUDA)` and shared lists, or vice versa
+- Changes to workspace allocation, scale computation, or hardware-specific sizing
+- Convention changes in tensor shapes, data formats, or function contracts
+
+**What to look for in CMakeLists.txt specifically:**
+- Files that should be in `if(USE_CUDA)` but ended up in the shared source list (causes ROCm compilation of CUDA-only code)
+- Files that were in `if(USE_CUDA)` on `dev` but were moved during conflict resolution
+- New source files from upstream that need to be placed in the correct list
+
+---
+
+## Strategy 4: The Guard Dependency Contract
+
+Platform guards form a dependency graph. If you guard an implementation, every unguarded call site becomes a dangling reference. If you remove a guard, the implementation must work on both platforms.
+
+**The rule**: guarded implementations require guarded call sites, and unguarded call sites require unguarded implementations.
+
+When you fix an issue by adding a guard (e.g., moving a `.cu` file into `if(USE_CUDA)`), trace all references to the symbols it defines:
+1. Search for every call site of the guarded function
+2. Verify each call site is also guarded, OR has an alternative code path for ROCm
+3. Check header files — the declaration may still be visible even if the implementation is guarded, causing linker errors instead of compile errors
+4. Check pybind registrations — a Python-exposed function that calls a guarded symbol will fail at import time
+
+A common pattern: the implementation is guarded in one file, a second call site is properly guarded, but a third call site in a different file (especially a new upstream file) is not. The fix for the implementation creates the linker error at the third site.
+
+---
+
+## Strategy 5: Iterative Build-Test-Fix
+
+After conflict resolution and auditing, use an iterative loop:
+
+1. **Build** — compile the project
+2. **Categorize the error** — missing include? undeclared identifier? duplicate symbol? preprocessor guard issue? API incompatibility? undefined symbol at link time?
+3. **Find the root cause, not the symptom** — when a CUDA-only file fails to compile on ROCm due to a type mismatch, the surface fix is to change the type. But the root cause may be that the file shouldn't compile on ROCm at all. Always ask: "Is the right fix to make this code work here, or to prevent it from running here?"
+4. **Apply the appropriate fix pattern** — update the source code with the appropriate fix, guarding as needed
+5. **Trace the dependency chain** — after applying a fix, ask "what depends on what I just changed?" and verify both directions
+6. **Rebuild** and repeat until clean
+
+**After a clean build, test in order of increasing scope:**
+1. Basic module import — catches missing symbols, broken dynamic linking
+2. Core operations (GEMM, normalization) — catches API mismatches, incorrect workspace sizing
+3. Higher-level tests (attention, transformer layers) — catches integration issues
+4. Full test suites — catches edge cases
+
+Each phase catches a different class of error, and errors from earlier phases are much cheaper to fix.
+
+**Runtime errors require different investigation than build errors.** A common pattern: upstream refactors a Python function and the new version works on CUDA but uses different hardware-specific parameters (workspace sizes, alignment requirements, feature flags). These only manifest at runtime with specific configurations (dtype, dimensions, GPU architecture).
+
+**When investigating "works on dev, fails here"**: the critical question isn't "what's different?" — it's "why did it work on dev at all?" A test might pass not because the feature is supported, but because a chain of incidental conditions causes it to be skipped. Understanding the mechanism of "working" is essential to understanding when it will break.
+
+---
+
+## Strategy 6: Stale Fork-Specific Code
+
+Over successive IFUs, the ROCm fork accumulates its own implementations of functions that upstream later refactors or relocates. When upstream moves a function from file A to a new file B, the merge brings in file B but doesn't remove the fork's copy in file A. Now you have two definitions in the same namespace — a potential duplicate symbol error on CUDA, and confusion about which version is canonical.
+
+**After each IFU:**
+1. Check if the ROCm fork has files that don't exist on upstream (e.g., `util.cpp` with `#ifndef USE_ROCM`)
+2. For each fork-specific file, check if upstream added equivalent functionality in a different file
+3. If upstream's new file supersedes the fork's copy, verify the fork's copy is either removed or properly isolated
+
+---
+
+## Strategy 7: ROCm vs CUDA Feature Parity
+
+Not all CUDA features are available on ROCm. The set of supported features evolves over time as AMD adds capabilities. When resolving an IFU:
+
+1. **Check current support** before deciding how to guard new upstream code
+2. **Look at existing codebase patterns** — if similar code is already guarded elsewhere, follow the same pattern
+3. **When in doubt, guard conservatively** and add a TODO comment indicating the feature may become available
+4. **Ask the developer** if you're unsure about the current support status of a specific feature
+
+---
+
+## Workflow
+
+### Step 1: Setup
+
+```bash
+git cat-file -p <merge-commit>  # Identify parent1 (ROCm) and parent2 (upstream)
+grep -rn "^<<<<<<<\|^=======\|^>>>>>>>" . \
+  --include="*.py" --include="*.cpp" --include="*.cu" \
+  --include="*.h" --include="*.cuh"  # Count remaining conflicts
+```
+
+### Step 2: Resolve Conflict Markers
+
+Apply Strategy 1 (conflict resolution mental model) to each file. Verify zero markers remain.
+
+### Step 3: Audit Non-Conflicting Changes
+
+Apply Strategy 3 — audit ALL upstream changes, not just conflicting ones. Check new files, removed guards, convention changes, and CMakeLists.txt source list placement. This step is most often skipped and most often the source of post-merge failures.
+
+### Step 4: Build and Fix Iteratively
+
+Apply Strategy 5. Python syntax check first (`python -m py_compile`), then C++ build, then runtime tests. For every fix, apply Strategy 4 — trace the guard dependency chain to completion before rebuilding.
+
+### Step 5: Run Tests
+
+Test in order of increasing scope per Strategy 5. Investigate runtime failures by checking whether upstream changed hardware-specific parameters, utility functions, or data format conventions.
+
+### Step 6: Clean Up Stale Code
+
+Apply Strategy 6. Check fork-specific files for functions that upstream has superseded in new files.
+
+---
+
+## Hard Rules
+
+- **Never edit `*_hip.cpp` files** — they are auto-generated by hipify
+- **Always edit source `.cpp`/`.cu` files** — hipify will process them
+- **Use the correct platform guard for each layer** (see Strategy 2)
+- **Guard unsupported features** — check current ROCm support status; guard conservatively if unsure
+- **Audit all upstream changes** — not just the ones that caused conflicts
+- **Trace the guard dependency chain** — guarding an implementation requires guarding every call site (see Strategy 4)
+- **Test failures are informational, not directional** — trace backward from the failure to find the violated invariant; resist the urge to fix the test
diff --git a/CLAUDE.md b/CLAUDE.md
new file mode 100644
index 000000000..b711be97e
--- /dev/null
+++ b/CLAUDE.md
@@ -0,0 +1,78 @@
+# Agent instructions for TransformerEngine (ROCm fork)
+
+## Docker containers
+- We work in Docker containers for reproducibility.
+- Run build/test commands **only** inside the designated container (not on host).
+- If container is unspecified, ask for the exact image/tag and launch command **before** running anything expensive.
+- Prefer editable installs (`pip install -e .`).
+- Before debugging, record: container image/tag, ROCm version, GPU arch, TE commit, submodule state.
+- If results are suspicious, first verify you are in the expected container and that GPU devices/libs are exposed correctly.
+
+## Architecture
+- One core C++/HIP library + optional framework bindings:
+  - core: `transformer_engine/common` → `libtransformer_engine.so`
+  - PyTorch: `transformer_engine/pytorch` + `transformer_engine/pytorch/csrc`
+  - JAX: `transformer_engine/jax` + `transformer_engine/jax/csrc/extensions`
+- Python import flow:
+  - framework selection: `transformer_engine/__init__.py` (`NVTE_FRAMEWORK` = `pytorch|jax|all|none`)
+  - `.so` loading: `transformer_engine/common/__init__.py` (`load_framework_extension`)
+- Build orchestration: `setup.py` + `build_tools/*.py` + CMake.
+  - `build_tools/utils.py::rocm_build()` auto-detects ROCm first, then CUDA, unless `NVTE_USE_ROCM` is set.
+- 3rdparty submodules: `aiter`, `aotriton`, `cudnn-frontend`, `cutlass`, `googletest`, `hipify_torch`.
+
+## Hipify convention
+The build auto-generates `*_hip.cpp` files from CUDA `.cpp` sources via `hipify_torch`. These are marked with `// !!! This is a file automatically generated by hipify!!!` at line 1. **Never edit `*_hip.cpp` files directly** — edit the CUDA source instead.
+
+The following directories are **excluded** from hipify (native ROCm code — edit directly):
+- `transformer_engine/common/ck_fused_attn/` — CK kernel wrappers
+- `transformer_engine/common/amd_detail/` — AMD-specific utilities
+- `transformer_engine/common/rocshmem_api/` — ROCshmem wrappers
+
+Framework bindings (`pytorch/csrc`, `jax/csrc`) are hipified separately via `build_tools/pytorch.py` and `build_tools/jax.py`.
+
+## Fused attention backends
+Backends are gated by env vars (set to `0` to disable, unset or `1` to enable):
+
+| Env var | Controls |
+|---|---|
+| `NVTE_FUSED_ATTN` | Master toggle for all fused attention |
+| `NVTE_FUSED_ATTN_CK` | CK backend |
+| `NVTE_FUSED_ATTN_AOTRITON` | AOTriton backend |
+| `NVTE_FLASH_ATTN` | Flash attention |
+
+CI backend configs (`ci/_utils.sh::configure_fused_attn_env`): `auto`, `ck`, `aotriton`, `flash`, `unfused`.
+
+### ROCm fused-attn file layout
+- **Runtime backend selection/dispatch**: `transformer_engine/common/fused_attn_rocm/fused_attn.cpp` (hipified)
+- **CK dispatch glue**: `transformer_engine/common/fused_attn_rocm/fused_attn_ck.cpp` (hipified)
+- **AOTriton dispatch glue**: `transformer_engine/common/fused_attn_rocm/fused_attn_aotriton.cpp` (hipified)
+- **CK kernel wrappers** (native, not hipified):
+  - `transformer_engine/common/ck_fused_attn/src/ck_fused_attn_{fwd,bwd,utils}.cpp`
+  - `transformer_engine/common/ck_fused_attn/include/ck_fused_attn/ck_fused_attn.hpp`
+
+### Debug logging env vars
+- `NVTE_DEBUG=1` + `NVTE_DEBUG_LEVEL={0,1,2}` — Python-level attention debug output
+- `NVTE_LOG_FUSED_ATTN_CONFIG=1` — C++ backend selection logging
+- `NVTE_LOG_CK_CONFIG=1` — CK-specific config logging
+- `NVTE_LOG_AOTRITON_CONFIG=1` — AOTriton-specific config logging
+- `CK_FUSED_ATTN_LOG_CONFIG=1` — CK kernel wrapper logging
+
+## Developer workflows
+- Always init submodules first: `git submodule update --init --recursive`.
+- Source install: `pip install . --no-build-isolation`.
+- C++ tests: `qa/L0_cppunittest/test.sh` or `ci/core.sh`.
+- Framework CI tests (shell scripts, not bare pytest):
+  - PyTorch: `ci/pytorch.sh` | JAX: `ci/jax.sh`
+  - Control via `TEST_LEVEL`, `TEST_SGPU`, `TEST_MGPU`, `TEST_FILTER` (from `ci/_utils.sh`).
+- Lint/format: `qa/format.sh` (pre-commit), `qa/L0_pytorch_lint/test.sh`, `qa/L0_jax_lint/test.sh`.
+
+## Code conventions
+- Edit `transformer_engine/*`, `build_tools/*`, `tests/*`, `ci/*`; avoid `3rdparty/*` unless explicitly required.
+- Keep env-var behavior stable; tests toggle flags intentionally.
+- Python: Black, line length 100. C/C++: cpplint + `.clang-format`.
+
+## Troubleshooting pointers
+- **Missing `.so` on import**: check path resolution in `transformer_engine/common/__init__.py`.
+- **Framework extension won't build on ROCm**: check `build_tools/utils.py::get_frameworks()`.
+- **Fused-attn regression**: reproduce under multiple backend configs (`auto`, `ck`, `aotriton`, `unfused`).
+- **CK/AITER kernel failures**: use the `ck-debugging` skill for structured triage and isolation.