From 8a8ea813678a37b286dffaa201763bcf926cbd0e Mon Sep 17 00:00:00 2001
From: Meekail Zain <zainmeekail@gmail.com>
Date: Thu, 12 Feb 2026 11:26:10 -0600
Subject: [PATCH 1/3] Added initial AI Agent instructions

---
 .claude/skills/ck-debugging/SKILL.md    | 103 ++++++++++++++++++++
 .claude/skills/ck-debugging/TEMPLATE.md | 121 ++++++++++++++++++++++++
 CLAUDE.md                               |  57 +++++++++++
 3 files changed, 281 insertions(+)
 create mode 100644 .claude/skills/ck-debugging/SKILL.md
 create mode 100644 .claude/skills/ck-debugging/TEMPLATE.md
 create mode 100644 CLAUDE.md
diff --git a/.claude/skills/ck-debugging/SKILL.md b/.claude/skills/ck-debugging/SKILL.md
new file mode 100644
index 000000000..a7779990b
--- /dev/null
+++ b/.claude/skills/ck-debugging/SKILL.md
@@ -0,0 +1,103 @@
+---
+name: ck-debugging
+description: Triage, investigate, debug, and isolate CK/AITER Fused Attention failures in TransformerEngine as integration vs kernel issues.
+---
+
+# CK Fused Attention Debugging Guide (TransformerEngine, ROCm)
+
+Use this playbook to quickly answer one question:
+**Is the failure in TE↔CK integration, or in the CK/AITER kernel itself?**
+
+## 1) Map the integration surface first
+- Build-time CK args parsing/validation:
+  - `transformer_engine/common/CMakeLists.txt`
+  - `tools/check_aiter_mha_args_usage.py`
+- CK fused-attn kernel wrappers/entry points:
+  - `transformer_engine/common/ck_fused_attn/ck_fused_attn_*`
+- CK backend preprocessing and dispatch glue:
+  - `transformer_engine/common/fused_attn_rocm/fused_attn_ck.cpp`
+- Runtime backend selection / fallback path:
+  - `transformer_engine/common/fused_attn_rocm/fused_attn.cpp`
+
+## 2) Gather minimum reproducibility context (before changing code)
+Capture these from logs or user report:
+- Forward vs backward failure (`fwd` / `bwd`)
+- Exact shape/config: batch, seq lengths (`s_q`, `s_kv`), num heads, head dim
+- Data type(s): fp16/bf16/fp8
+- Mask/dropout/causal/windowing/alibi/padding settings
+- GQA/MQA/group mode details if used
+- GPU architecture + ROCm version + TE commit
+- Whether fallback backend succeeds
+
+When self-collecting logs (for example, rerunning a failing pytest), enable full config logging in the same command: `NVTE_LOG_FUSED_ATTN_CONFIG=1 NVTE_LOG_CK_CONFIG=1 CK_FUSED_ATTN_LOG_CONFIG=1 <test command>`.
+
+If reproducing triggers a segmentation fault, rerun under `rocgdb` to capture a usable backtrace: `rocgdb --args python -m pytest ...` (then run and collect `bt`).
+
+If config info is incomplete, request it first; otherwise debugging is noisy and slow.
+
+## 3) Reproduce in controlled CK-only path
+Preferred path (AITER Python JIT):
+1. Start from `3rdparty/aiter/op_tests/test_mha.py` to reproduce through the same Python JIT interface used in many real flows.
+2. Add a minimal wrapper test (for example, `test_te_reproducer`) that pins only the failing TE config.
+3. Call the Python-level MHA functions directly (e.g. `mha_fwd` and `fmha_v3_fwd`).
+4. Record the exact test invocation, pinned parameters, and first failing log line.
+
+Secondary path (native executables for isolation/confirmation):
+1. From `3rdparty/aiter/op_tests/cpp/mha`, build with `mha_build.sh`.
+2. Keep env explicit when running:
+   - `LD_LIBRARY_PATH=<TE_ROOT>/transformer_engine/lib:${LD_LIBRARY_PATH}`
+   - `AITER_ASM_DIR=$(realpath 3rdparty/aiter/hsa)` (or equivalent absolute path)
+3. Use `fwd.exe -?` / `bwd.exe -?` to confirm argument mapping.
+4. Re-encode the same failing config in `fwd.exe` / `bwd.exe` and compare behavior vs Python JIT.
+5. Keep in mind that TE always stores LSE, hence use `-lse=1`.
+6. Record full commands to include in handoff.
+
+## 4) Decision tree: integration bug vs kernel bug
+1. **Fails in TE, but passes in `fwd.exe`/`bwd.exe` with equivalent config**
+   - Likely TE integration bug.
+   - Focus on argument marshaling/normalization in:
+     - `fused_attn_ck.cpp`
+     - `ck_fused_attn_*`
+     - backend selection conditions in `fused_attn.cpp`
+
+2. **Fails both in TE and standalone `fwd.exe`/`bwd.exe`**
+   - Likely CK/AITER kernel issue (or unsupported config).
+   - Produce a minimal standalone reproducer command and hand off.
+
+3. **Passes in TE only when fallback backend is chosen**
+   - CK eligibility/selection guard likely wrong.
+   - Inspect backend capability checks and shape constraints in `fused_attn.cpp`.
+
+## 5) High-value checks when it is integration-related
+- Verify all expected CK args are present and in the right order/type.
+- Check TE→CK conversions for:
+  - layout / strides
+  - sequence length semantics (`s_q` vs `s_kv`)
+  - grouped-query mapping
+  - mask/bias/dropout flags
+  - causal/windowing flags
+  - dtype/accumulator assumptions
+- Confirm no silent defaulting for missing fields.
+- Confirm runtime-selected backend matches intent (no accidental fallback/misroute).
+
+## 6) Output artifact requirements (always produce)
+For each investigated failure, record:
+- TE reproducer summary (shapes, dtype, flags)
+- Standalone command(s) tested (`fwd.exe`/`bwd.exe`) and result
+- Classification: `integration` or `kernel`
+- Owning component and next action
+
+Suggested concise handoff format:
+- **Config:** `B=?, Sq=?, Skv=?, H=?, D=?, dtype=?, causal=?, dropout=?, mask=?`
+- **TE result:** pass/fail + key error
+- **Standalone result:** pass/fail + key error
+- **Conclusion:** integration vs kernel
+- **Owner:** TE vs AITER/CK
+
+For more comprehensive output formatting, reference [TEMPLATE.md](TEMPLATE.md)
+
+## 7) Common pitfalls
+- Mismatch between TE-side defaults and standalone binary defaults.
+- Treating unsupported config as runtime failure instead of eligibility failure.
+- Comparing non-equivalent configs across TE and standalone paths.
+- Missing backward-only failures (always test both directions when applicable).
\ No newline at end of file
diff --git a/.claude/skills/ck-debugging/TEMPLATE.md b/.claude/skills/ck-debugging/TEMPLATE.md
new file mode 100644
index 000000000..1cacdf24e
--- /dev/null
+++ b/.claude/skills/ck-debugging/TEMPLATE.md
@@ -0,0 +1,121 @@
+# CK/AITER Fused-Attn Debug Handoff Template
+
+Use this template when handing off a failure investigation to TE or AITER/CK owners.
+
+---
+
+## 1) Summary
+- **Classification:** `integration` | `kernel` | `unknown`
+- **Direction:** `fwd` | `bwd` | `both`
+
+## 2) Environment
+- **TE commit:**
+- **AITER commit/submodule ref:**
+- **ROCm version:**
+- **GPU architecture (gfx):**
+
+## 3) Failing Configuration
+- **Batch (B):**
+- **Query seq (Sq):**
+- **KV seq (Skv):**
+- **Num heads (H):**
+- **Head dim (D):**
+- **DType(s):** fp16 / bf16 / fp8
+- **Causal:** true/false
+- **Dropout:**
+- **Mask/Bias mode:**
+- **Windowing/Alibi/Padding:**
+- **GQA/MQA details:**
+
+## 4) TE Reproducer
+- **Backend intent:** CK only / auto / fallback allowed
+- **Command or test entrypoint:**
+- **Key env vars:**
+- **Observed result:** pass/fail
+- **First failing log line / error signature:**
+
+## 5) Standalone AITER Reproducer (`fwd.exe` / `bwd.exe`)
+- **Build location:** `3rdparty/aiter/op_tests/cpp/mha`
+- **Build command:**
+- **Runtime env:**
+	- `LD_LIBRARY_PATH=<TE_ROOT>/transformer_engine/lib:${LD_LIBRARY_PATH}`
+	- `AITER_ASM_DIR=$(realpath ../../../hsa)`
+- **Exact standalone command(s):**
+- **Observed result:** pass/fail
+- **First failing log line / error signature:**
+
+## 6) Equivalence Check (TE vs Standalone)
+- **Are shape/dtype/flags exactly matched?** yes/no
+- **Any default mismatch noticed?**
+- **Notes:**
+
+## 7) Conclusion and Ownership
+- **Conclusion:** integration vs kernel vs unsupported-config
+- **Likely owner:** TE (`fused_attn_ck.cpp` / `fused_attn.cpp` / `ck_fused_attn_*`) or AITER/CK kernel team
+- **Requested next action:**
+
+## 8) Artifacts
+- **Logs attached:**
+- **Minimal reproducer commands attached:**
+- **Patch/commit links (if any):**
+
+---
+
+# Example (Filled)
+
+## 1) Summary
+- **Classification:** `integration`
+- **Direction:** `bwd`
+
+## 2) Environment
+- **TE commit:** `abc1234`
+- **AITER commit/submodule ref:** `def5678`
+- **ROCm version:** 6.2.1
+- **GPU architecture (gfx):** gfx942
+
+## 3) Failing Configuration
+- **Batch (B):** 4
+- **Query seq (Sq):** 4096
+- **KV seq (Skv):** 4096
+- **Num heads (H):** 32
+- **Head dim (D):** 128
+- **DType(s):** bf16
+- **Causal:** true
+- **Dropout:** 0.0
+- **Mask/Bias mode:** causal mask only
+- **Windowing/Alibi/Padding:** none
+- **GQA/MQA details:** none
+
+## 4) TE Reproducer
+- **Backend intent:** CK only
+- **Command or test entrypoint:** `pytest tests/pytorch/fused_attn/test_fused_attn.py::test_bwd_case_x`
+- **Key env vars:** CK backend forced; debug logging enabled
+- **Observed result:** fail
+- **First failing log line / error signature:** `invalid argument: ck_bwd workspace size mismatch`
+
+## 5) Standalone AITER Reproducer (`fwd.exe` / `bwd.exe`)
+- **Build location:** `3rdparty/aiter/op_tests/cpp/mha`
+- **Build command:** `./mha_build.sh`
+- **Runtime env:**
+	- `LD_LIBRARY_PATH=<TE_ROOT>/transformer_engine/lib:${LD_LIBRARY_PATH}`
+	- `AITER_ASM_DIR=$(realpath ../../../hsa)`
+- **Exact standalone command(s):**
+	- `./bwd.exe <equivalent args>`
+	- `./fwd.exe <equivalent args>`
+- **Observed result:** pass (both)
+- **First failing log line / error signature:** N/A
+
+## 6) Equivalence Check (TE vs Standalone)
+- **Are shape/dtype/flags exactly matched?** yes
+- **Any default mismatch noticed?** TE-side workspace/alignment default differs from standalone path
+- **Notes:** likely marshaling/normalization issue before CK call
+
+## 7) Conclusion and Ownership
+- **Conclusion:** integration
+- **Likely owner:** TE (`fused_attn_ck.cpp` argument preparation)
+- **Requested next action:** inspect workspace-size and alignment mapping in TE→CK bwd path
+
+## 8) Artifacts
+- **Logs attached:** `te_fail.log`, `standalone_pass.log`
+- **Minimal reproducer commands attached:** yes
+- **Patch/commit links (if any):** none
diff --git a/CLAUDE.md b/CLAUDE.md
new file mode 100644
index 000000000..a2c0ec041
--- /dev/null
+++ b/CLAUDE.md
@@ -0,0 +1,57 @@
+# Agent instructions for TransformerEngine (ROCm fork)
+
+## Using Docker containers
+- We generally work in Docker containers for reproducibility.
+- For live debugging/investigations, run build/test commands **only** inside the designated container (not on host).
+- If container is unspecified, ask for the exact image/tag and launch command **before** running anything expensive.
+- Before debugging, record runtime context in notes/logs:
+  - container image/tag
+  - ROCm version in container
+  - GPU architecture visible in container
+  - TE commit/submodule state
+- If results are suspicious, first verify you are in the expected container and that GPU devices/libs are exposed correctly.
+
+## Big picture
+- This repo builds **one core C++/HIP library** plus optional framework bindings:
+  - core: `transformer_engine/common` (CMake project producing `libtransformer_engine.so`)
+  - PyTorch binding: `transformer_engine/pytorch` + `transformer_engine/pytorch/csrc`
+  - JAX binding: `transformer_engine/jax` + `transformer_engine/jax/csrc/extensions`
+- Python import flow is split:
+  - top-level framework selection in `transformer_engine/__init__.py` (`NVTE_FRAMEWORK` controls `pytorch|jax|all|none`)
+  - `.so` discovery/loading logic in `transformer_engine/common/__init__.py` (`load_framework_extension`, wheel/source/editable layouts)
+- Build orchestration is in `setup.py` + `build_tools/*.py`, not only in CMake.
+  - `build_tools/utils.py::rocm_build()` auto-detects ROCm first, then CUDA, unless `NVTE_USE_ROCM` is set.
+
+## Platform/backends
+- ROCm path is first-class in this fork (`README.rst`, `transformer_engine/common/CMakeLists.txt`).
+- Fused attention backends are runtime/compile-time gated by env vars:
+  - `NVTE_FUSED_ATTN`, `NVTE_FUSED_ATTN_CK`, `NVTE_FUSED_ATTN_AOTRITON`
+- ROCm fused-attn implementation is in `transformer_engine/common/fused_attn_rocm/*`; CK and AOTriton integration is wired in `transformer_engine/common/CMakeLists.txt`.
+- Build-time validation for CK args runs from `setup.py` via `tools/check_aiter_mha_args_usage.py`.
+
+## Developer workflows you should follow
+- Always initialize submodules before debugging build failures: `git submodule update --init --recursive` (required by CMake for 3rdparty deps).
+- Typical source install in this repo: `pip install . --no-build-isolation` (see `README.rst`).
+- C++ tests: build/run from `tests/cpp` with CMake+Ninja (`qa/L0_cppunittest/test.sh`, `ci/core.sh`).
+- CI-style framework test entrypoints are shell scripts, not a single pytest command:
+  - PyTorch: `ci/pytorch.sh`
+  - JAX: `ci/jax.sh`
+  - They use `TEST_LEVEL`, `TEST_SGPU`, `TEST_MGPU`, `TEST_FILTER` from `ci/_utils.sh`.
+- Lint/format workflow is repo-specific:
+  - local formatting: `qa/format.sh` (pre-commit hooks)
+  - cpplint+pylint flows: `qa/L0_pytorch_lint/test.sh`, `qa/L0_jax_lint/test.sh`
+
+## Code conventions and change boundaries
+- Prefer edits in `transformer_engine/*`, `build_tools/*`, `tests/*`, `ci/*`; avoid changing `3rdparty/*` unless explicitly required.
+- Preserve dual-platform structure when modifying kernels/build logic:
+  - shared sources are often `.cu` then hipified for ROCm (`transformer_engine/common/CMakeLists.txt`, `build_tools/pytorch.py`, `build_tools/jax.py`).
+  - never edit HIP files directly -- instead, edit the CUDA source and let the build system generate HIP variants.
+- Keep environment-variable behavior stable; many tests intentionally toggle flags (examples in `ci/pytorch.sh` and `ci/jax.sh`).
+- Respect existing tooling/style:
+  - Python formatted by Black (line length 100) via `.pre-commit-config.yaml`
+  - C/C++ style checked by cpplint and `.clang-format`
+
+## Practical pointers for AI agents
+- If import fails with missing TE extension `.so`, inspect `transformer_engine/common/__init__.py` path resolution before changing packaging.
+- If framework extension unexpectedly does not build on ROCm, check framework detection in `build_tools/utils.py::get_frameworks()` (ROCm-capable torch/jax checks).
+- For fused-attn regressions, reproduce under multiple backend configs (`auto`, `ck`, `aotriton`, `unfused`) like CI scripts do.

From cdab3396b48a887db9f546e94c08486abf33e22a Mon Sep 17 00:00:00 2001
From: Meekail Zain <zainmeekail@gmail.com>
Date: Thu, 5 Mar 2026 16:19:32 -0600
Subject: [PATCH 2/3] Updated w/ Claude's refinement

---
 .claude/skills/ck-debugging/SKILL.md    | 396 +++++++++++++++++++-----
 .claude/skills/ck-debugging/TEMPLATE.md |  80 +++--
 CLAUDE.md                               | 121 +++++---
 3 files changed, 440 insertions(+), 157 deletions(-)

diff --git a/.claude/skills/ck-debugging/SKILL.md b/.claude/skills/ck-debugging/SKILL.md
index a7779990b..467046c69 100644
--- a/.claude/skills/ck-debugging/SKILL.md
+++ b/.claude/skills/ck-debugging/SKILL.md
@@ -8,96 +8,330 @@ description: Triage, investigate, debug, and isolate CK/AITER Fused Attention fa
 Use this playbook to quickly answer one question:
 **Is the failure in TE↔CK integration, or in the CK/AITER kernel itself?**
 
-## 1) Map the integration surface first
-- Build-time CK args parsing/validation:
-  - `transformer_engine/common/CMakeLists.txt`
-  - `tools/check_aiter_mha_args_usage.py`
-- CK fused-attn kernel wrappers/entry points:
-  - `transformer_engine/common/ck_fused_attn/ck_fused_attn_*`
-- CK backend preprocessing and dispatch glue:
-  - `transformer_engine/common/fused_attn_rocm/fused_attn_ck.cpp`
-- Runtime backend selection / fallback path:
-  - `transformer_engine/common/fused_attn_rocm/fused_attn.cpp`
-
-## 2) Gather minimum reproducibility context (before changing code)
+---
+
+## 1) File layout and integration surface
+
+### Backend selection and dispatch (hipified — edit CUDA source, not `*_hip.cpp`)
+| File | Role |
+|---|---|
+| `transformer_engine/common/fused_attn_rocm/fused_attn.cpp` | Runtime backend selection (`nvte_get_fused_attn_backend`), all `nvte_fused_attn_{fwd,bwd}*` entry points that dispatch to CK or AOTriton |
+| `transformer_engine/common/fused_attn_rocm/fused_attn_ck.cpp` | CK eligibility (`is_ck_backend_supported`), type/mask/stride conversions, workspace allocation, calls into `ck_fused_attn::ck_attn_{fwd,bwd}` and `ck_attn_varlen_{fwd,bwd}` |
+| `transformer_engine/common/fused_attn_rocm/fused_attn_aotriton.cpp` | AOTriton equivalent (useful for comparison when CK fails but AOTriton passes) |
+| `transformer_engine/common/fused_attn_rocm/utils.{h,cpp}` | `generateMatrixStrides`, `NVTE_QKV_Matrix` enum — stride computation shared by CK and AOTriton |
+
+### CK kernel wrappers (native ROCm — edit directly, NOT hipified)
+| File | Role |
+|---|---|
+| `transformer_engine/common/ck_fused_attn/include/ck_fused_attn/ck_fused_attn.hpp` | Public API: `ck_attn_fwd`, `ck_attn_varlen_fwd`, `ck_attn_bwd`, `ck_attn_varlen_bwd` + `DType`, `MaskType`, `BiasType` enums |
+| `transformer_engine/common/ck_fused_attn/src/ck_fused_attn_fwd.cpp` | Forward kernel dispatch (calls `fmha_fwd` from ck_tile) |
+| `transformer_engine/common/ck_fused_attn/src/ck_fused_attn_bwd.cpp` | Backward kernel dispatch |
+| `transformer_engine/common/ck_fused_attn/src/ck_fused_attn_utils.{hpp,cpp}` | dtype/bias helpers, `DISPATCH_DTYPE_16BIT` macro |
+
+### Python-level attention dispatch
+| File | Role |
+|---|---|
+| `transformer_engine/pytorch/attention/dot_product_attention/backends.py` | `get_available_attention_backends()` — Python-level backend selection |
+| `transformer_engine/pytorch/attention/dot_product_attention/utils.py` | `AttentionLogging` class, `_NVTE_DEBUG`, `_NVTE_DEBUG_LEVEL` |
+| `transformer_engine/pytorch/cpp_extensions/fused_attn.py` | `FusedAttnBackend` enum, `fused_attn_fwd`/`fused_attn_bwd` Python wrappers |
+
+### Build system
+| File | Role |
+|---|---|
+| `transformer_engine/common/CMakeLists.txt` | Top-level C++ build, `USE_FUSED_ATTN_CK` compile flag |
+| `transformer_engine/common/ck_fused_attn/CMakeLists.txt` | CK wrapper build, links ck_tile and aiter |
+| `transformer_engine/common/ck_fused_attn/aiter_prebuilt.cmake` | AITER prebuilt SO linking |
+
+### Tests
+| File | Role |
+|---|---|
+| `tests/pytorch/attention/test_attention.py` | Main attention tests: `test_dot_product_attention`, `test_dpa_mask`, `test_dpa_bias`, `test_dpa_sliding_window`, `test_dpa_alibi_slopes`, `test_dpa_qkv_layout`, `test_dpa_qkv_layout_thd` |
+| `3rdparty/aiter/op_tests/test_mha.py` | AITER standalone Python MHA tests |
+| `3rdparty/aiter/op_tests/cpp/mha/` | Standalone C++ MHA executables: `benchmark_mha_fwd`, `benchmark_mha_bwd` |
+
+---
+
+## 2) Environment variables reference
+
+### Backend selection
+| Env var | Default | Effect |
+|---|---|---|
+| `NVTE_FUSED_ATTN` | `1` (enabled) | Master toggle for all fused attention; set `0` to disable |
+| `NVTE_FUSED_ATTN_CK` | follows `NVTE_FUSED_ATTN` | CK backend toggle |
+| `NVTE_FUSED_ATTN_AOTRITON` | follows `NVTE_FUSED_ATTN` | AOTriton backend toggle |
+| `NVTE_FLASH_ATTN` | `1` (enabled) | Flash attention toggle |
+
+### CK kernel tuning
+| Env var | Default | Effect |
+|---|---|---|
+| `NVTE_CK_USES_FWD_V3` | `1` | Use ASM v3 forward kernel (faster, narrower config support) |
+| `NVTE_CK_USES_BWD_V3` | `1` | Use ASM v3 backward kernel |
+| `NVTE_CK_IS_V3_ATOMIC_FP32` | `1` | Use fp32 atomics in bwd v3 (more accurate, slower) |
+| `NVTE_CK_HOW_V3_BF16_CVT` | `1` | bf16 conversion method for v3 kernels |
+| `NVTE_CK_ZERO_OUT_PAD` | `1` | Zero out padded positions in output |
+
+### Debug/logging (all layers, use together for full trace)
+| Env var | Layer | What it logs |
+|---|---|---|
+| `NVTE_DEBUG=1` + `NVTE_DEBUG_LEVEL=2` | Python (PyTorch) | Backend selection decisions, attention config |
+| `NVTE_LOG_FUSED_ATTN_CONFIG=1` | C++ dispatch (`fused_attn.cpp`) | Shape, dtype, layout, mask, window for each fwd/bwd call |
+| `NVTE_LOG_CK_CONFIG=1` | C++ CK glue (`fused_attn_ck.cpp`) | CK eligibility filter results, workspace sizes, strides, v3 flags |
+| `NVTE_LOG_AOTRITON_CONFIG=1` | C++ AOTriton glue | AOTriton-specific dispatch logging |
+| `CK_FUSED_ATTN_LOG_CONFIG=1` | CK kernel wrapper (`ck_fused_attn_fwd/bwd.cpp`) | fmha_traits, fmha_args, kernel name selected |
+
+**Full debug command prefix:**
+```bash
+NVTE_DEBUG=1 NVTE_DEBUG_LEVEL=2 NVTE_LOG_FUSED_ATTN_CONFIG=1 NVTE_LOG_CK_CONFIG=1 CK_FUSED_ATTN_LOG_CONFIG=1 <test command>
+```
+
+### CI backend configs (`ci/_utils.sh::configure_fused_attn_env`)
+| Mode | `NVTE_FUSED_ATTN` | `NVTE_FUSED_ATTN_CK` | `NVTE_FUSED_ATTN_AOTRITON` | `NVTE_FLASH_ATTN` |
+|---|---|---|---|---|
+| `auto` | unset | unset | unset | unset |
+| `ck` | unset | unset | `0` | `0` |
+| `aotriton` | unset | `0` | unset | `0` |
+| `flash` | `0` | `0` | `0` | unset |
+| `unfused` | `0` | unset | unset | `0` |
+
+---
+
+## 3) Gather minimum reproducibility context (before changing code)
+
 Capture these from logs or user report:
 - Forward vs backward failure (`fwd` / `bwd`)
-- Exact shape/config: batch, seq lengths (`s_q`, `s_kv`), num heads, head dim
-- Data type(s): fp16/bf16/fp8
-- Mask/dropout/causal/windowing/alibi/padding settings
-- GQA/MQA/group mode details if used
-- GPU architecture + ROCm version + TE commit
-- Whether fallback backend succeeds
+- Exact shape/config: batch (`b`), seq lengths (`s_q`, `s_kv`), num heads (`h`), num kv heads (`hg`), head dim (`d_qk`, `d_v`)
+- Data type(s): fp16 / bf16 (CK does not support fp8 in fused attn)
+- QKV layout: which `NVTE_QKV_Layout` (e.g., `BSHD_BSHD_BSHD`, `BS3HD`, `THD_THD_THD`)
+- Mask type: `NO_MASK`, `CAUSAL_MASK`, `PADDING_CAUSAL_MASK`, `CAUSAL_BOTTOM_RIGHT_MASK`, etc.
+- Bias type: `NO_BIAS`, `POST_SCALE_BIAS`, `ALIBI`
+- Dropout probability
+- Sliding window size: `(window_size_left, window_size_right)`
+- GQA/MQA details: `h` vs `hg` ratio
+- GPU architecture (`gfx942`, `gfx950`, etc.) + ROCm version + TE commit
+- Whether fallback backend (AOTriton/unfused) succeeds with same config
 
-When self-collecting logs (for example, rerunning a failing pytest), enable full config logging in the same command: `NVTE_LOG_FUSED_ATTN_CONFIG=1 NVTE_LOG_CK_CONFIG=1 CK_FUSED_ATTN_LOG_CONFIG=1 <test command>`.
+If config info is incomplete, request it first; otherwise debugging is noisy and slow.
 
-If reproducing triggers a segmentation fault, rerun under `rocgdb` to capture a usable backtrace: `rocgdb --args python -m pytest ...` (then run and collect `bt`).
+When self-collecting logs, enable full logging:
+```bash
+NVTE_LOG_FUSED_ATTN_CONFIG=1 NVTE_LOG_CK_CONFIG=1 CK_FUSED_ATTN_LOG_CONFIG=1 <test command>
+```
 
-If config info is incomplete, request it first; otherwise debugging is noisy and slow.
+If a segfault occurs, rerun under `rocgdb`:
+```bash
+rocgdb --args python -m pytest <test> -x -s
+# then: (gdb) run, wait for crash, (gdb) bt
+```
+
+---
+
+## 4) CK eligibility checks (`is_ck_backend_supported`)
+
+The function in `fused_attn_ck.cpp:23-152` applies these filters in order. When CK is rejected, `NVTE_LOG_CK_CONFIG=1` prints the reason. The filters are:
+
+1. **GQA groups**: `num_gqa_groups > 0` and `num_attn_heads % num_gqa_groups == 0`
+2. **Data type**: `q_dtype == kv_dtype` and both are fp16 or bf16 (no fp8)
+3. **Bias type**: only `NO_BIAS`, `ALIBI`, or `POST_SCALE_BIAS` (no `PRE_SCALE_BIAS`)
+4. **Head dim**: `head_dim_qk < 512` and `head_dim_v < 512`
+5. **Causal + window**: if causal mask, window must be `(-1, 0)` or `(>=0, 0)`
+6. **No mask + window**: if no mask, window must be `(-1, -1)` or `(>=0, >=0)`
+7. **QKV packed + GQA**: MQA/GQA cannot use qkvpacked layouts (`3HD`, `H3D`)
+8. **QKV packed + seqlen**: qkvpacked requires `s_q == s_kv`
+9. **THD + padding**: ragged (THD) format requires a padding mask type
+10. **Padding + bias**: padding mask cannot combine with `POST_SCALE_BIAS` or `ALIBI`
+
+If CK is rejected, the runtime falls through to AOTriton, then to `NVTE_No_Backend` (which causes `NVTE_ERROR`).
+
+---
+
+## 5) Common error signatures and where they come from
+
+### From dispatch layer (`fused_attn.cpp`)
+- `"Invalid combination of data type and sequence length for rocm fused attention."` — no backend accepted the config. Check eligibility with `NVTE_LOG_CK_CONFIG=1`.
+- `"qkv_layout not supported!"` — unknown layout enum value.
+- `"window_size should be (-1, 0) or (>=0, 0) for attn_mask_type=..."` — window/mask mismatch.
+
+### From CK glue (`fused_attn_ck.cpp`)
+- `"NVTE_3HD NVTE_H3D should have h=hg."` — packed layout with GQA mismatch.
+- `"Unexpected Aux_CTX_Tensors->size."` — wrong number of auxiliary tensors passed to fwd/bwd.
+- `"Unexpected workspace_size."` — workspace allocation mismatch between first call (size query) and second call (execute).
+- `"CK fused attn backend not compiled."` — `USE_FUSED_ATTN_CK` not set at build time.
+
+### From CK kernel wrappers (`ck_fused_attn_fwd/bwd.cpp`)
+- `"fused attn configs not supported in ck_fused_attn fwd pass."` — config doesn't match any compiled CK tile kernel.
+- `"fused attn configs not supported in ck_fused_attn bwd pass."` — same for backward.
+- `"Invalid dtype in ck_fused_attn."` — bad dtype conversion.
+- `"Invalid bias_type in ck_fused_attn."` / `"Invalid bias_shape in ck_fused_attn."` — bias type/shape not recognized.
+- `"Invalid type for 16 bit.."` — `DISPATCH_DTYPE_16BIT` macro failure.
+
+### From HIP runtime
+- `hipError_t` from `NVTE_CHECK_CUDA(...)` wrapping CK calls — usually a kernel launch failure or illegal memory access.
+
+---
+
+## 6) Reproduce in controlled CK-only path
+
+### Path A: TE pytest with CK forced
+```bash
+# Force CK-only backend
+export NVTE_FLASH_ATTN=0
+export NVTE_FUSED_ATTN_AOTRITON=0
+# Full logging
+export NVTE_LOG_FUSED_ATTN_CONFIG=1 NVTE_LOG_CK_CONFIG=1 CK_FUSED_ATTN_LOG_CONFIG=1
+
+pytest tests/pytorch/attention/test_attention.py::test_dot_product_attention -x -s -k "<filter>"
+```
+
+### Path B: AITER Python JIT (isolates from TE integration)
+1. Install aiter: `cd 3rdparty/aiter && pip install -e .`
+2. Use `3rdparty/aiter/op_tests/test_mha.py` or write a minimal reproducer.
+3. Call MHA functions directly (e.g. `mha_fwd`, `fmha_v3_fwd`).
+
+### Path C: Standalone C++ executables (maximum isolation)
+1. Build:
+   ```bash
+   cd 3rdparty/aiter/op_tests/cpp/mha
+   bash build_mha.sh fwd   # or: bwd, fwd_v3, bwd_v3, or no arg for all
+   ```
+2. Run with proper env:
+   ```bash
+   export LD_LIBRARY_PATH=<TE_ROOT>/transformer_engine/lib:${LD_LIBRARY_PATH}
+   export AITER_ASM_DIR=$(realpath ../../../hsa)  # or equivalent absolute path
+   ```
+3. Use `-?` flag to list all arguments.
+4. Example commands mapping to TE configs:
+   ```bash
+   # Forward: batch=4, heads=32, kv_heads=8, dim=128, seq=4096, causal, bf16
+   ./benchmark_mha_fwd -prec=bf16 -b=4 -h=32 -h_k=8 -d=128 -s=4096 \
+     -iperm=1 -operm=1 -mask=1 -lse=1 -mode=0 -kname=1 -v=1
+
+   # Backward (same config)
+   ./benchmark_mha_bwd -prec=bf16 -b=4 -h=32 -h_k=8 -d=128 -s=4096 \
+     -iperm=1 -operm=1 -mask=1 -mode=0 -kname=1 -v=1
+   ```
+5. Key argument mappings:
+   - `-iperm=1 -operm=1` → BSHD layout (TE default)
+   - `-iperm=0 -operm=0` → SBHD layout
+   - `-mask=0` → no mask, `-mask=1` → causal top-left, `-mask=2` → causal bottom-right
+   - `-mask=t:L,R` → SWA top-left, `-mask=b:L,R` → SWA bottom-right
+   - `-lse=1` → store LSE (TE always does this)
+   - `-mode=0` → batch mode, `-mode=1` → group/varlen mode
+   - `-bias=n` → no bias, `-bias=e` → elementwise, `-bias=a` → alibi
+   - `-fwd_v3=1` / `-bwd_v3=1` → use ASM v3 kernels
+   - `-v3_atomic_fp32=0|1` → bwd atomic precision
+
+---
+
+## 7) Decision tree: integration bug vs kernel bug
+
+### Case 1: Fails in TE, passes in standalone `benchmark_mha_{fwd,bwd}` with equivalent config
+→ **Likely TE integration bug**. Focus on:
+- Argument marshaling in `fused_attn_ck.cpp`: type conversions (`nvte_to_ck_dtype`, `nvte_to_ck_bias_type`, `set_ck_mask`), stride computation (`generateMatrixStrides`), workspace layout
+- Backend selection conditions in `fused_attn.cpp` — is the right config reaching CK?
+- Padding removal/addition logic (`remove_padding`, `add_padding`, `add_padding_softmax_lse`)
+- BSHD-to-THD conversion path (`bshd_to_thd`, `generate_cu_seqlen_padded`)
+
+### Case 2: Fails both in TE and standalone
+→ **Likely CK/AITER kernel issue** (or unsupported config). Produce a minimal standalone reproducer and hand off to AITER/CK team.
+
+### Case 3: Passes in TE only when fallback backend (AOTriton) is chosen
+→ **CK eligibility guard likely wrong**. Inspect filters in `is_ck_backend_supported`.
+
+### Case 4: Numerical mismatch (passes but wrong values)
+→ Compare CK output vs AOTriton output on same config. If CK-standalone also gives wrong values, kernel bug. If only TE-CK path gives wrong values, check:
+- Stride ordering (batch vs head vs seq strides differ between batched and varlen paths)
+- LSE storage format (padded vs unpadded, h×s_q vs s_q×h ordering)
+- Workspace buffer reuse / overlap
+- `NVTE_CK_ZERO_OUT_PAD` behavior
+
+---
+
+## 8) High-value integration checks
+
+When the failure is TE-side, verify these in `fused_attn_ck.cpp`:
+
+### Stride computation
+- `generateMatrixStrides` in `utils.cpp` computes 4-element strides `[batch, head, seq, dim]` for each matrix
+- Batched CK API (`ck_attn_fwd`) expects `stride_b, stride_h, stride_s` (3 strides, dim=1 implied)
+- Varlen CK API (`ck_attn_varlen_fwd`) expects `stride_h, stride_s` (2 strides, no batch stride)
+- When SBHD+padding triggers pad removal, the varlen strides are recomputed: `stride_h=q_stride[1]`, `stride_s=min(q_stride[0], q_stride[2])`
+
+### Workspace allocation
+- First call with `workspace==nullptr` queries size, second call executes
+- Workspace sections are allocated sequentially: alibi slopes → softmax LSE → Q/K/V/O without-padding buffers → cu_seqlen_padded
+- Mismatch between query and execute allocations causes `"Unexpected workspace_size"` errors
+
+### Type/mask/bias mapping
+- NVTE `CAUSAL_MASK` / `PADDING_CAUSAL_MASK` → CK `mask_top_left`
+- NVTE `CAUSAL_BOTTOM_RIGHT_MASK` / `PADDING_CAUSAL_BOTTOM_RIGHT_MASK` → CK `mask_bottom_right`
+- NVTE `NO_MASK` / `PADDING_MASK` with SWA `(>=0, >=0)` → CK `mask_bottom_right` (not `window_generic`)
+- NVTE `POST_SCALE_BIAS` → CK `elementwise_bias`
+- NVTE `ALIBI` → CK `alibi` (slope array auto-generated in workspace)
+
+### Backward-specific
+- `dq_acc_ptr` workspace for split-K accumulation: sized as `float * nsplits * h * max_tokens_q * d_qk`
+- `dk_expanded_ptr` / `dv_expanded_ptr` for GQA: expanded to full head count, then reduced
+- `dbias_expanded_ptr` → `dbias_ptr` reduction when bias dims differ
+- Deterministic mode always set to `false` (TODO in source)
+- LSE from forward must be passed correctly via `Aux_CTX_Tensors->tensors[0]`
+
+---
+
+## 9) Running TE tests for fused attention
+
+### Single test
+```bash
+NVTE_FLASH_ATTN=0 NVTE_FUSED_ATTN_AOTRITON=0 \
+  pytest tests/pytorch/attention/test_attention.py::test_dot_product_attention -x -s \
+  -k "model_name_and_params"
+```
+
+### Key test functions
+| Test | What it covers |
+|---|---|
+| `test_dot_product_attention` | Core fwd/bwd correctness across dtypes, models |
+| `test_dpa_mask` | All mask types |
+| `test_dpa_bias` / `test_dpa_bias_shapes` | Bias types and shapes |
+| `test_dpa_sliding_window` | SWA with different window sizes |
+| `test_dpa_alibi_slopes` | ALiBi attention bias |
+| `test_dpa_qkv_layout` | All QKV layout variants |
+| `test_dpa_qkv_layout_thd` | THD (ragged) layouts |
+| `test_dpa_qkv_layout_thd_mqa_gqa` | THD + MQA/GQA |
+| `test_gqa_mla_thd` | GQA/MLA with THD format, CK backend check |
+| `test_mha_fp8_vs_f16` / `test_dpa_fp8_vs_f16` | FP8 vs FP16 comparison |
+
+### CI invocation
+```bash
+# From ci/pytorch.sh — runs with specific backend
+ci/pytorch.sh  # uses TEST_LEVEL, TEST_SGPU, TEST_FILTER
+# Backend set via configure_fused_attn_env in ci/_utils.sh
+```
+
+---
+
+## 10) Common pitfalls
+
+1. **Stride mismatch between batched and varlen paths**: SBHD+padding triggers pad removal which changes the varlen stride computation. The `min(stride[0], stride[2])` logic can produce unexpected results for certain layouts.
+2. **Workspace size queried with different params than execute call**: Any change to config between the two calls will cause workspace size mismatch.
+3. **Treating unsupported config as runtime failure instead of eligibility failure**: If CK doesn't support a config, it should be caught by `is_ck_backend_supported`, not crash at kernel launch.
+4. **Missing backward-only failures**: Always test both fwd and bwd. Some configs work in fwd but fail in bwd (e.g., due to expanded gradient buffers in GQA).
+5. **Mismatch between TE-side defaults and standalone binary defaults**: TE always stores LSE (`-lse=1`), always uses `iperm=1 operm=1` for BSHD. Standalone defaults may differ.
+6. **Comparing non-equivalent configs across TE and standalone paths**: Ensure mask type, window size, dropout, and all flags match exactly.
+7. **v3 kernel fallback**: v3 ASM kernels support a narrower config range than CK fallback. If `NVTE_CK_USES_FWD_V3=1` but the config isn't supported by v3, the kernel wrapper falls back to CK tile. Check `CK_FUSED_ATTN_LOG_CONFIG=1` output for which kernel was actually selected.
+8. **cu_seqlen_padded generation**: For BSHD+padding→THD conversion, `generate_cu_seqlen_padded` creates synthetic padded seqlens. If actual padding pattern doesn't match assumptions, results will be wrong.
+9. **Build flag `USE_FUSED_ATTN_CK` not set**: If CK backend returns `false` for everything and `NVTE_LOG_CK_CONFIG` produces no output, check that the build included CK. The `is_ck_backend_supported` function returns `false` when compiled without `USE_FUSED_ATTN_CK`.
+
+---
+
+## 11) Output artifact requirements (always produce)
 
-## 3) Reproduce in controlled CK-only path
-Preferred path (AITER Python JIT):
-1. Start from `3rdparty/aiter/op_tests/test_mha.py` to reproduce through the same Python JIT interface used in many real flows.
-2. Add a minimal wrapper test (for example, `test_te_reproducer`) that pins only the failing TE config.
-3. Call the Python-level MHA functions directly (e.g. `mha_fwd` and `fmha_v3_fwd`).
-4. Record the exact test invocation, pinned parameters, and first failing log line.
-
-Secondary path (native executables for isolation/confirmation):
-1. From `3rdparty/aiter/op_tests/cpp/mha`, build with `mha_build.sh`.
-2. Keep env explicit when running:
-   - `LD_LIBRARY_PATH=<TE_ROOT>/transformer_engine/lib:${LD_LIBRARY_PATH}`
-   - `AITER_ASM_DIR=$(realpath 3rdparty/aiter/hsa)` (or equivalent absolute path)
-3. Use `fwd.exe -?` / `bwd.exe -?` to confirm argument mapping.
-4. Re-encode the same failing config in `fwd.exe` / `bwd.exe` and compare behavior vs Python JIT.
-5. Keep in mind that TE always stores LSE, hence use `-lse=1`.
-6. Record full commands to include in handoff.
-
-## 4) Decision tree: integration bug vs kernel bug
-1. **Fails in TE, but passes in `fwd.exe`/`bwd.exe` with equivalent config**
-   - Likely TE integration bug.
-   - Focus on argument marshaling/normalization in:
-     - `fused_attn_ck.cpp`
-     - `ck_fused_attn_*`
-     - backend selection conditions in `fused_attn.cpp`
-
-2. **Fails both in TE and standalone `fwd.exe`/`bwd.exe`**
-   - Likely CK/AITER kernel issue (or unsupported config).
-   - Produce a minimal standalone reproducer command and hand off.
-
-3. **Passes in TE only when fallback backend is chosen**
-   - CK eligibility/selection guard likely wrong.
-   - Inspect backend capability checks and shape constraints in `fused_attn.cpp`.
-
-## 5) High-value checks when it is integration-related
-- Verify all expected CK args are present and in the right order/type.
-- Check TE→CK conversions for:
-  - layout / strides
-  - sequence length semantics (`s_q` vs `s_kv`)
-  - grouped-query mapping
-  - mask/bias/dropout flags
-  - causal/windowing flags
-  - dtype/accumulator assumptions
-- Confirm no silent defaulting for missing fields.
-- Confirm runtime-selected backend matches intent (no accidental fallback/misroute).
-
-## 6) Output artifact requirements (always produce)
 For each investigated failure, record:
-- TE reproducer summary (shapes, dtype, flags)
-- Standalone command(s) tested (`fwd.exe`/`bwd.exe`) and result
-- Classification: `integration` or `kernel`
-- Owning component and next action
 
-Suggested concise handoff format:
-- **Config:** `B=?, Sq=?, Skv=?, H=?, D=?, dtype=?, causal=?, dropout=?, mask=?`
+**Concise handoff format:**
+- **Config:** `B=?, Sq=?, Skv=?, H=?, Hg=?, Dqk=?, Dv=?, dtype=?, layout=?, causal=?, dropout=?, mask=?, bias=?, window=?`
 - **TE result:** pass/fail + key error
 - **Standalone result:** pass/fail + key error
-- **Conclusion:** integration vs kernel
+- **Conclusion:** `integration` / `kernel` / `unsupported-config`
 - **Owner:** TE vs AITER/CK
 
-For more comprehensive output formatting, reference [TEMPLATE.md](TEMPLATE.md)
-
-## 7) Common pitfalls
-- Mismatch between TE-side defaults and standalone binary defaults.
-- Treating unsupported config as runtime failure instead of eligibility failure.
-- Comparing non-equivalent configs across TE and standalone paths.
-- Missing backward-only failures (always test both directions when applicable).
\ No newline at end of file
+For comprehensive output, reference [TEMPLATE.md](TEMPLATE.md).
diff --git a/.claude/skills/ck-debugging/TEMPLATE.md b/.claude/skills/ck-debugging/TEMPLATE.md
index 1cacdf24e..17c393f6d 100644
--- a/.claude/skills/ck-debugging/TEMPLATE.md
+++ b/.claude/skills/ck-debugging/TEMPLATE.md
@@ -5,7 +5,7 @@ Use this template when handing off a failure investigation to TE or AITER/CK own
 ---
 
 ## 1) Summary
-- **Classification:** `integration` | `kernel` | `unknown`
+- **Classification:** `integration` | `kernel` | `unsupported-config` | `unknown`
 - **Direction:** `fwd` | `bwd` | `both`
 
 ## 2) Environment
@@ -13,30 +13,39 @@ Use this template when handing off a failure investigation to TE or AITER/CK own
 - **AITER commit/submodule ref:**
 - **ROCm version:**
 - **GPU architecture (gfx):**
+- **Container image/tag (if applicable):**
 
 ## 3) Failing Configuration
 - **Batch (B):**
 - **Query seq (Sq):**
 - **KV seq (Skv):**
-- **Num heads (H):**
-- **Head dim (D):**
-- **DType(s):** fp16 / bf16 / fp8
-- **Causal:** true/false
+- **Num Q heads (H):**
+- **Num KV heads (Hg):**
+- **Head dim QK (Dqk):**
+- **Head dim V (Dv):**
+- **DType(s):** fp16 / bf16
+- **QKV Layout:** (e.g., `BSHD_BSHD_BSHD`, `BS3HD`, `THD_THD_THD`)
+- **Mask type:** (e.g., `CAUSAL_MASK`, `PADDING_CAUSAL_MASK`, `NO_MASK`)
+- **Bias type:** (e.g., `NO_BIAS`, `POST_SCALE_BIAS`, `ALIBI`)
 - **Dropout:**
-- **Mask/Bias mode:**
-- **Windowing/Alibi/Padding:**
+- **Window size:** `(left, right)`
 - **GQA/MQA details:**
 
 ## 4) TE Reproducer
 - **Backend intent:** CK only / auto / fallback allowed
 - **Command or test entrypoint:**
 - **Key env vars:**
+  ```bash
+  NVTE_FLASH_ATTN=0 NVTE_FUSED_ATTN_AOTRITON=0
+  NVTE_LOG_FUSED_ATTN_CONFIG=1 NVTE_LOG_CK_CONFIG=1 CK_FUSED_ATTN_LOG_CONFIG=1
+  ```
 - **Observed result:** pass/fail
 - **First failing log line / error signature:**
+- **Backend actually selected (from log):**
 
-## 5) Standalone AITER Reproducer (`fwd.exe` / `bwd.exe`)
+## 5) Standalone AITER Reproducer (`benchmark_mha_fwd` / `benchmark_mha_bwd`)
 - **Build location:** `3rdparty/aiter/op_tests/cpp/mha`
-- **Build command:**
+- **Build command:** `bash build_mha.sh` (or `bash build_mha.sh fwd` / `bwd`)
 - **Runtime env:**
 	- `LD_LIBRARY_PATH=<TE_ROOT>/transformer_engine/lib:${LD_LIBRARY_PATH}`
 	- `AITER_ASM_DIR=$(realpath ../../../hsa)`
@@ -46,12 +55,23 @@ Use this template when handing off a failure investigation to TE or AITER/CK own
 
 ## 6) Equivalence Check (TE vs Standalone)
 - **Are shape/dtype/flags exactly matched?** yes/no
+- **Layout mapping verified?** (`-iperm=1 -operm=1` for BSHD, etc.)
+- **LSE enabled?** (`-lse=1` — TE always stores LSE)
+- **v3 kernel flags matched?** (`NVTE_CK_USES_FWD_V3` → `-fwd_v3=`, etc.)
 - **Any default mismatch noticed?**
 - **Notes:**
 
 ## 7) Conclusion and Ownership
 - **Conclusion:** integration vs kernel vs unsupported-config
 - **Likely owner:** TE (`fused_attn_ck.cpp` / `fused_attn.cpp` / `ck_fused_attn_*`) or AITER/CK kernel team
+- **Specific area (if integration):**
+  - [ ] Stride computation (`generateMatrixStrides` / stride mapping to CK API)
+  - [ ] Type/mask/bias conversion (`nvte_to_ck_*`, `set_ck_mask`)
+  - [ ] Workspace allocation/layout
+  - [ ] Padding removal/addition
+  - [ ] Backend eligibility check (`is_ck_backend_supported`)
+  - [ ] Aux tensor handling (LSE, rng_state, bias in `Aux_CTX_Tensors`)
+  - [ ] Other: ___
 - **Requested next action:**
 
 ## 8) Artifacts
@@ -77,43 +97,51 @@ Use this template when handing off a failure investigation to TE or AITER/CK own
 - **Batch (B):** 4
 - **Query seq (Sq):** 4096
 - **KV seq (Skv):** 4096
-- **Num heads (H):** 32
-- **Head dim (D):** 128
+- **Num Q heads (H):** 32
+- **Num KV heads (Hg):** 8
+- **Head dim QK (Dqk):** 128
+- **Head dim V (Dv):** 128
 - **DType(s):** bf16
-- **Causal:** true
+- **QKV Layout:** `BSHD_BSHD_BSHD`
+- **Mask type:** `CAUSAL_MASK`
+- **Bias type:** `NO_BIAS`
 - **Dropout:** 0.0
-- **Mask/Bias mode:** causal mask only
-- **Windowing/Alibi/Padding:** none
-- **GQA/MQA details:** none
+- **Window size:** `(-1, 0)`
+- **GQA/MQA details:** GQA with ratio 4:1
 
 ## 4) TE Reproducer
 - **Backend intent:** CK only
-- **Command or test entrypoint:** `pytest tests/pytorch/fused_attn/test_fused_attn.py::test_bwd_case_x`
-- **Key env vars:** CK backend forced; debug logging enabled
+- **Command or test entrypoint:** `pytest tests/pytorch/attention/test_attention.py::test_dot_product_attention -x -s -k "test_config"`
+- **Key env vars:** `NVTE_FLASH_ATTN=0 NVTE_FUSED_ATTN_AOTRITON=0 NVTE_LOG_CK_CONFIG=1`
 - **Observed result:** fail
 - **First failing log line / error signature:** `invalid argument: ck_bwd workspace size mismatch`
+- **Backend actually selected (from log):** NVTE_CK
 
-## 5) Standalone AITER Reproducer (`fwd.exe` / `bwd.exe`)
+## 5) Standalone AITER Reproducer
 - **Build location:** `3rdparty/aiter/op_tests/cpp/mha`
-- **Build command:** `./mha_build.sh`
+- **Build command:** `bash build_mha.sh bwd`
 - **Runtime env:**
 	- `LD_LIBRARY_PATH=<TE_ROOT>/transformer_engine/lib:${LD_LIBRARY_PATH}`
 	- `AITER_ASM_DIR=$(realpath ../../../hsa)`
 - **Exact standalone command(s):**
-	- `./bwd.exe <equivalent args>`
-	- `./fwd.exe <equivalent args>`
-- **Observed result:** pass (both)
+	- `./benchmark_mha_bwd -prec=bf16 -b=4 -h=32 -h_k=8 -d=128 -s=4096 -iperm=1 -operm=1 -mask=1 -mode=0 -kname=1 -v=1`
+- **Observed result:** pass
 - **First failing log line / error signature:** N/A
 
-## 6) Equivalence Check (TE vs Standalone)
+## 6) Equivalence Check
 - **Are shape/dtype/flags exactly matched?** yes
-- **Any default mismatch noticed?** TE-side workspace/alignment default differs from standalone path
-- **Notes:** likely marshaling/normalization issue before CK call
+- **Layout mapping verified?** yes (`-iperm=1 -operm=1`)
+- **LSE enabled?** yes (implicit in bwd)
+- **v3 kernel flags matched?** yes (default v3 enabled)
+- **Any default mismatch noticed?** TE-side workspace alignment default differs from standalone path
+- **Notes:** Likely marshaling/normalization issue before CK call
 
 ## 7) Conclusion and Ownership
 - **Conclusion:** integration
 - **Likely owner:** TE (`fused_attn_ck.cpp` argument preparation)
-- **Requested next action:** inspect workspace-size and alignment mapping in TE→CK bwd path
+- **Specific area:**
+  - [x] Workspace allocation/layout
+- **Requested next action:** Inspect workspace-size and alignment mapping in TE→CK bwd path for GQA expanded gradients
 
 ## 8) Artifacts
 - **Logs attached:** `te_fail.log`, `standalone_pass.log`
diff --git a/CLAUDE.md b/CLAUDE.md
index a2c0ec041..b711be97e 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -1,57 +1,78 @@
 # Agent instructions for TransformerEngine (ROCm fork)
 
-## Using Docker containers
-- We generally work in Docker containers for reproducibility.
-- For live debugging/investigations, run build/test commands **only** inside the designated container (not on host).
+## Docker containers
+- We work in Docker containers for reproducibility.
+- Run build/test commands **only** inside the designated container (not on host).
 - If container is unspecified, ask for the exact image/tag and launch command **before** running anything expensive.
-- Before debugging, record runtime context in notes/logs:
-  - container image/tag
-  - ROCm version in container
-  - GPU architecture visible in container
-  - TE commit/submodule state
+- Prefer editable installs (`pip install -e .`).
+- Before debugging, record: container image/tag, ROCm version, GPU arch, TE commit, submodule state.
 - If results are suspicious, first verify you are in the expected container and that GPU devices/libs are exposed correctly.
 
-## Big picture
-- This repo builds **one core C++/HIP library** plus optional framework bindings:
-  - core: `transformer_engine/common` (CMake project producing `libtransformer_engine.so`)
-  - PyTorch binding: `transformer_engine/pytorch` + `transformer_engine/pytorch/csrc`
-  - JAX binding: `transformer_engine/jax` + `transformer_engine/jax/csrc/extensions`
-- Python import flow is split:
-  - top-level framework selection in `transformer_engine/__init__.py` (`NVTE_FRAMEWORK` controls `pytorch|jax|all|none`)
-  - `.so` discovery/loading logic in `transformer_engine/common/__init__.py` (`load_framework_extension`, wheel/source/editable layouts)
-- Build orchestration is in `setup.py` + `build_tools/*.py`, not only in CMake.
+## Architecture
+- One core C++/HIP library + optional framework bindings:
+  - core: `transformer_engine/common` → `libtransformer_engine.so`
+  - PyTorch: `transformer_engine/pytorch` + `transformer_engine/pytorch/csrc`
+  - JAX: `transformer_engine/jax` + `transformer_engine/jax/csrc/extensions`
+- Python import flow:
+  - framework selection: `transformer_engine/__init__.py` (`NVTE_FRAMEWORK` = `pytorch|jax|all|none`)
+  - `.so` loading: `transformer_engine/common/__init__.py` (`load_framework_extension`)
+- Build orchestration: `setup.py` + `build_tools/*.py` + CMake.
   - `build_tools/utils.py::rocm_build()` auto-detects ROCm first, then CUDA, unless `NVTE_USE_ROCM` is set.
+- 3rdparty submodules: `aiter`, `aotriton`, `cudnn-frontend`, `cutlass`, `googletest`, `hipify_torch`.
 
-## Platform/backends
-- ROCm path is first-class in this fork (`README.rst`, `transformer_engine/common/CMakeLists.txt`).
-- Fused attention backends are runtime/compile-time gated by env vars:
-  - `NVTE_FUSED_ATTN`, `NVTE_FUSED_ATTN_CK`, `NVTE_FUSED_ATTN_AOTRITON`
-- ROCm fused-attn implementation is in `transformer_engine/common/fused_attn_rocm/*`; CK and AOTriton integration is wired in `transformer_engine/common/CMakeLists.txt`.
-- Build-time validation for CK args runs from `setup.py` via `tools/check_aiter_mha_args_usage.py`.
-
-## Developer workflows you should follow
-- Always initialize submodules before debugging build failures: `git submodule update --init --recursive` (required by CMake for 3rdparty deps).
-- Typical source install in this repo: `pip install . --no-build-isolation` (see `README.rst`).
-- C++ tests: build/run from `tests/cpp` with CMake+Ninja (`qa/L0_cppunittest/test.sh`, `ci/core.sh`).
-- CI-style framework test entrypoints are shell scripts, not a single pytest command:
-  - PyTorch: `ci/pytorch.sh`
-  - JAX: `ci/jax.sh`
-  - They use `TEST_LEVEL`, `TEST_SGPU`, `TEST_MGPU`, `TEST_FILTER` from `ci/_utils.sh`.
-- Lint/format workflow is repo-specific:
-  - local formatting: `qa/format.sh` (pre-commit hooks)
-  - cpplint+pylint flows: `qa/L0_pytorch_lint/test.sh`, `qa/L0_jax_lint/test.sh`
-
-## Code conventions and change boundaries
-- Prefer edits in `transformer_engine/*`, `build_tools/*`, `tests/*`, `ci/*`; avoid changing `3rdparty/*` unless explicitly required.
-- Preserve dual-platform structure when modifying kernels/build logic:
-  - shared sources are often `.cu` then hipified for ROCm (`transformer_engine/common/CMakeLists.txt`, `build_tools/pytorch.py`, `build_tools/jax.py`).
-  - never edit HIP files directly -- instead, edit the CUDA source and let the build system generate HIP variants.
-- Keep environment-variable behavior stable; many tests intentionally toggle flags (examples in `ci/pytorch.sh` and `ci/jax.sh`).
-- Respect existing tooling/style:
-  - Python formatted by Black (line length 100) via `.pre-commit-config.yaml`
-  - C/C++ style checked by cpplint and `.clang-format`
-
-## Practical pointers for AI agents
-- If import fails with missing TE extension `.so`, inspect `transformer_engine/common/__init__.py` path resolution before changing packaging.
-- If framework extension unexpectedly does not build on ROCm, check framework detection in `build_tools/utils.py::get_frameworks()` (ROCm-capable torch/jax checks).
-- For fused-attn regressions, reproduce under multiple backend configs (`auto`, `ck`, `aotriton`, `unfused`) like CI scripts do.
+## Hipify convention
+The build auto-generates `*_hip.cpp` files from CUDA `.cpp` sources via `hipify_torch`. These are marked with `// !!! This is a file automatically generated by hipify!!!` at line 1. **Never edit `*_hip.cpp` files directly** — edit the CUDA source instead.
+
+The following directories are **excluded** from hipify (native ROCm code — edit directly):
+- `transformer_engine/common/ck_fused_attn/` — CK kernel wrappers
+- `transformer_engine/common/amd_detail/` — AMD-specific utilities
+- `transformer_engine/common/rocshmem_api/` — ROCshmem wrappers
+
+Framework bindings (`pytorch/csrc`, `jax/csrc`) are hipified separately via `build_tools/pytorch.py` and `build_tools/jax.py`.
+
+## Fused attention backends
+Backends are gated by env vars (set to `0` to disable, unset or `1` to enable):
+
+| Env var | Controls |
+|---|---|
+| `NVTE_FUSED_ATTN` | Master toggle for all fused attention |
+| `NVTE_FUSED_ATTN_CK` | CK backend |
+| `NVTE_FUSED_ATTN_AOTRITON` | AOTriton backend |
+| `NVTE_FLASH_ATTN` | Flash attention |
+
+CI backend configs (`ci/_utils.sh::configure_fused_attn_env`): `auto`, `ck`, `aotriton`, `flash`, `unfused`.
+
+### ROCm fused-attn file layout
+- **Runtime backend selection/dispatch**: `transformer_engine/common/fused_attn_rocm/fused_attn.cpp` (hipified)
+- **CK dispatch glue**: `transformer_engine/common/fused_attn_rocm/fused_attn_ck.cpp` (hipified)
+- **AOTriton dispatch glue**: `transformer_engine/common/fused_attn_rocm/fused_attn_aotriton.cpp` (hipified)
+- **CK kernel wrappers** (native, not hipified):
+  - `transformer_engine/common/ck_fused_attn/src/ck_fused_attn_{fwd,bwd,utils}.cpp`
+  - `transformer_engine/common/ck_fused_attn/include/ck_fused_attn/ck_fused_attn.hpp`
+
+### Debug logging env vars
+- `NVTE_DEBUG=1` + `NVTE_DEBUG_LEVEL={0,1,2}` — Python-level attention debug output
+- `NVTE_LOG_FUSED_ATTN_CONFIG=1` — C++ backend selection logging
+- `NVTE_LOG_CK_CONFIG=1` — CK-specific config logging
+- `NVTE_LOG_AOTRITON_CONFIG=1` — AOTriton-specific config logging
+- `CK_FUSED_ATTN_LOG_CONFIG=1` — CK kernel wrapper logging
+
+## Developer workflows
+- Always init submodules first: `git submodule update --init --recursive`.
+- Source install: `pip install . --no-build-isolation`.
+- C++ tests: `qa/L0_cppunittest/test.sh` or `ci/core.sh`.
+- Framework CI tests (shell scripts, not bare pytest):
+  - PyTorch: `ci/pytorch.sh` | JAX: `ci/jax.sh`
+  - Control via `TEST_LEVEL`, `TEST_SGPU`, `TEST_MGPU`, `TEST_FILTER` (from `ci/_utils.sh`).
+- Lint/format: `qa/format.sh` (pre-commit), `qa/L0_pytorch_lint/test.sh`, `qa/L0_jax_lint/test.sh`.
+
+## Code conventions
+- Edit `transformer_engine/*`, `build_tools/*`, `tests/*`, `ci/*`; avoid `3rdparty/*` unless explicitly required.
+- Keep env-var behavior stable; tests toggle flags intentionally.
+- Python: Black, line length 100. C/C++: cpplint + `.clang-format`.
+
+## Troubleshooting pointers
+- **Missing `.so` on import**: check path resolution in `transformer_engine/common/__init__.py`.
+- **Framework extension won't build on ROCm**: check `build_tools/utils.py::get_frameworks()`.
+- **Fused-attn regression**: reproduce under multiple backend configs (`auto`, `ck`, `aotriton`, `unfused`).
+- **CK/AITER kernel failures**: use the `ck-debugging` skill for structured triage and isolation.

From 7838ef93dcce18521e132c95de68c2abff9048be Mon Sep 17 00:00:00 2001
From: Meekail Zain <zainmeekail@gmail.com>
Date: Thu, 26 Mar 2026 13:10:33 -0500
Subject: [PATCH 3/3] Added IFU skill

---
 .claude/skills/ifu-merge/SKILL.md | 227 ++++++++++++++++++++++++++++++
 1 file changed, 227 insertions(+)
 create mode 100644 .claude/skills/ifu-merge/SKILL.md

diff --git a/.claude/skills/ifu-merge/SKILL.md b/.claude/skills/ifu-merge/SKILL.md
new file mode 100644
index 000000000..1abb72336
--- /dev/null
+++ b/.claude/skills/ifu-merge/SKILL.md
@@ -0,0 +1,227 @@
+---
+name: ifu-merge
+description: >
+  Guide for performing IFU (Internal Feature Update) merges on the TransformerEngine ROCm fork.
+  This skill walks through resolving merge conflicts between upstream NVIDIA TransformerEngine
+  and the AMD ROCm fork, then fixing build and runtime errors systematically. Use this skill
+  whenever the user mentions IFU, merge upstream, merge NVIDIA, upstream merge,
+  IFU merge, internal feature update, or is working on integrating new NVIDIA TE changes into
+  the ROCm fork — even if they don't use the term "IFU" explicitly.
+---
+
+# TransformerEngine ROCm — IFU Merge Guide
+
+An IFU (Internal Feature Update) merges upstream NVIDIA TransformerEngine commits into the AMD ROCm fork. The merge is often committed with conflict markers still in place, then conflicts are resolved in follow-up commits.
+
+This guide captures generalized strategies for resolving these merges.
+
+---
+
+## Strategy 1: Conflict Resolution Mental Model
+
+Every conflict falls into one of these categories. Recognizing the category tells you how to resolve it:
+
+| Category | Recognition | Resolution |
+|---|---|---|
+| **API Refactor** | Upstream restructured function signatures, parameter passing, or return values | Take upstream's new API structure; integrate ROCm-specific fields into the new pattern |
+| **Feature Addition** | Both sides added new parameters, attributes, docstrings, or code paths | Keep both — upstream features and ROCm features are independent and coexist |
+| **Guard Removal** | Upstream removed `#ifdef`/`#ifndef` platform guards (they don't need them) | Re-add the guards — they protect ROCm-specific code paths |
+| **Feature Support Divergence** | Upstream adds code for features not yet supported on ROCm | Guard with `#ifndef USE_ROCM`; verify current ROCm support status first — features may become supported over time |
+| **Convention Change** | Upstream changed a data format, tensor shape, or API contract (e.g., different tensor dimensions, new enum values, reordered arguments) | Adopt the new convention everywhere — then find and update every downstream consumer, including test code. The compiler won't catch these; they cause silent wrong results or cryptic runtime errors |
+| **Copyright/Whitespace** | Trivial formatting or copyright header differences | Keep ROCm (AMD) copyright headers; take upstream for whitespace |
+
+**Key principle**: When upstream refactors a pattern (e.g., bundling arguments into a tuple), always adopt the new pattern and adapt ROCm-specific fields into it. Fighting the upstream structure creates maintenance debt.
+
+**Watch for paired sites**: If upstream refactors a function signature, there are always at least TWO places to update — where values are packed into the new structure and where they are unpacked. Missing one causes silent bugs.
+
+---
+
+## Strategy 2: Understanding the Hipify Pipeline
+
+PyTorch's hipify tool automatically converts CUDA source files to HIP. Understanding what it does and doesn't convert is essential.
+
+**What hipify converts:**
+- `#include` directives (e.g., `cuda.h` -> `hip/hip_runtime.h`)
+- File extensions in generated output (`.cu` -> `.hip`, `.cpp` -> `_hip.cpp`)
+
+**What hipify does NOT convert:**
+- C++ namespace references (e.g., `c10::cuda::CUDAGuard` stays as-is)
+- Semantic differences between CUDA and HIP APIs (e.g., warp size, mask types)
+- ROCm-specific device behavior (e.g., tensor device masquerading)
+
+**What hipify preserves faithfully:**
+- Preprocessor guards (`#ifndef USE_ROCM`, `#ifdef __HIP_PLATFORM_AMD__`). This means adding guards to source `.cpp` files will propagate into the generated `_hip.cpp` output. Use this to exclude CUDA-only code paths from ROCm builds.
+
+**Rules that follow:**
+- Never edit `*_hip.cpp` or `.hip` files — they are regenerated from source files
+- Always edit the source `.cpp`/`.cu` files — hipify will process them
+- When you see a build error in a `_hip.cpp` file, fix the corresponding source file
+- Use conditional compilation to handle namespace/API differences that hipify can't resolve
+
+**Platform guard conventions by code layer:**
+
+| Layer | Guard Macro | Example |
+|---|---|---|
+| PyTorch CSRC (`.cpp` source files) | `#ifdef USE_ROCM` / `#ifndef USE_ROCM` | DeviceGuard, scale swizzling |
+| Common layer (`.cu` files that get hipified) | `#ifdef __HIP_PLATFORM_AMD__` | Warp masks, kernel dispatch |
+| Python code | `IS_HIP_EXTENSION` (from `torch.utils.cpp_extension`) | Workspace sizing, feature flags |
+
+Using the wrong guard macro for a given layer is a common source of subtle bugs — the code compiles but the guard doesn't activate correctly.
+
+---
+
+## Strategy 3: Auditing Non-Conflicting Changes
+
+**This is the single most important strategy and the one most often skipped.** Conflict markers are visible and localized. The real danger lies in changes that merge cleanly — upstream modifications that `git merge` silently accepts because the ROCm fork didn't touch those exact lines. A removed platform guard, a changed convention, a new CUDA-only code path — these slip through without any marker.
+
+**Five patterns to watch for:**
+
+1. **Guard Removal**: Upstream deletes platform guards because they don't need them. If ROCm didn't modify those exact lines, the merge silently removes the guards.
+
+2. **New CUDA-only Code**: Upstream adds new functions or code paths using CUDA-only APIs. These compile fine on CUDA but fail on ROCm.
+
+3. **New Upstream Files**: Upstream introduces entirely new files (e.g., `swizzle.cpp`) that have no ROCm guards at all. These files get hipified and compiled on ROCm, potentially referencing CUDA-only symbols. Check every new file for functions that should be guarded.
+
+4. **Refactored Utilities**: Upstream moves or rewrites utility functions that had ROCm-specific behavior. The new version may lack the ROCm-specific handling, and the fork's old copy becomes stale.
+
+5. **Convention Changes**: Upstream changes a data format, tensor shape, or API contract without any code conflict. Every downstream consumer of that convention must be updated manually — the compiler won't catch these.
+
+**How to systematically audit:**
+
+```bash
+# Get the merge parents
+git cat-file -p <merge-commit>
+
+# See ALL upstream changes (not just conflicting ones)
+git diff <rocm-parent>..<upstream-parent> --stat
+
+# Check for removed guards
+git diff <rocm-parent>..<upstream-parent> -- <file> | grep -E "^-.*(__HIP_PLATFORM_AMD__|USE_ROCM|IS_HIP_EXTENSION)"
+
+# Find entirely new files from upstream
+git diff <rocm-parent>..<upstream-parent> --diff-filter=A --name-only
+
+# Check CMakeLists.txt for source list changes
+git diff <rocm-parent>..<upstream-parent> -- "*/CMakeLists.txt"
+```
+
+**What to look for in each changed file:**
+- Removed platform guards (`USE_ROCM`, `__HIP_PLATFORM_AMD__`, `IS_HIP_EXTENSION`)
+- New CUDA API usage (e.g., `CUDAGuard`, cuBLAS/cuDNN calls) without ROCm equivalents
+- Refactored utility functions that previously had ROCm-specific behavior
+- Changes to CMakeLists.txt source file lists — files moved between `if(USE_CUDA)` and shared lists, or vice versa
+- Changes to workspace allocation, scale computation, or hardware-specific sizing
+- Convention changes in tensor shapes, data formats, or function contracts
+
+**What to look for in CMakeLists.txt specifically:**
+- Files that should be in `if(USE_CUDA)` but ended up in the shared source list (causes ROCm compilation of CUDA-only code)
+- Files that were in `if(USE_CUDA)` on `dev` but were moved during conflict resolution
+- New source files from upstream that need to be placed in the correct list
+
+---
+
+## Strategy 4: The Guard Dependency Contract
+
+Platform guards form a dependency graph. If you guard an implementation, every unguarded call site becomes a dangling reference. If you remove a guard, the implementation must work on both platforms.
+
+**The rule**: guarded implementations require guarded call sites, and unguarded call sites require unguarded implementations.
+
+When you fix an issue by adding a guard (e.g., moving a `.cu` file into `if(USE_CUDA)`), trace all references to the symbols it defines:
+1. Search for every call site of the guarded function
+2. Verify each call site is also guarded, OR has an alternative code path for ROCm
+3. Check header files — the declaration may still be visible even if the implementation is guarded, causing linker errors instead of compile errors
+4. Check pybind registrations — a Python-exposed function that calls a guarded symbol will fail at import time
+
+A common pattern: the implementation is guarded in one file, a second call site is properly guarded, but a third call site in a different file (especially a new upstream file) is not. The fix for the implementation creates the linker error at the third site.
+
+---
+
+## Strategy 5: Iterative Build-Test-Fix
+
+After conflict resolution and auditing, use an iterative loop:
+
+1. **Build** — compile the project
+2. **Categorize the error** — missing include? undeclared identifier? duplicate symbol? preprocessor guard issue? API incompatibility? undefined symbol at link time?
+3. **Find the root cause, not the symptom** — when a CUDA-only file fails to compile on ROCm due to a type mismatch, the surface fix is to change the type. But the root cause may be that the file shouldn't compile on ROCm at all. Always ask: "Is the right fix to make this code work here, or to prevent it from running here?"
+4. **Apply the appropriate fix pattern** — update the source code with the appropriate fix, guarding as needed
+5. **Trace the dependency chain** — after applying a fix, ask "what depends on what I just changed?" and verify both directions
+6. **Rebuild** and repeat until clean
+
+**After a clean build, test in order of increasing scope:**
+1. Basic module import — catches missing symbols, broken dynamic linking
+2. Core operations (GEMM, normalization) — catches API mismatches, incorrect workspace sizing
+3. Higher-level tests (attention, transformer layers) — catches integration issues
+4. Full test suites — catches edge cases
+
+Each phase catches a different class of error, and errors from earlier phases are much cheaper to fix.
+
+**Runtime errors require different investigation than build errors.** A common pattern: upstream refactors a Python function and the new version works on CUDA but uses different hardware-specific parameters (workspace sizes, alignment requirements, feature flags). These only manifest at runtime with specific configurations (dtype, dimensions, GPU architecture).
+
+**When investigating "works on dev, fails here"**: the critical question isn't "what's different?" — it's "why did it work on dev at all?" A test might pass not because the feature is supported, but because a chain of incidental conditions causes it to be skipped. Understanding the mechanism of "working" is essential to understanding when it will break.
+
+---
+
+## Strategy 6: Stale Fork-Specific Code
+
+Over successive IFUs, the ROCm fork accumulates its own implementations of functions that upstream later refactors or relocates. When upstream moves a function from file A to a new file B, the merge brings in file B but doesn't remove the fork's copy in file A. Now you have two definitions in the same namespace — a potential duplicate symbol error on CUDA, and confusion about which version is canonical.
+
+**After each IFU:**
+1. Check if the ROCm fork has files that don't exist on upstream (e.g., `util.cpp` with `#ifndef USE_ROCM`)
+2. For each fork-specific file, check if upstream added equivalent functionality in a different file
+3. If upstream's new file supersedes the fork's copy, verify the fork's copy is either removed or properly isolated
+
+---
+
+## Strategy 7: ROCm vs CUDA Feature Parity
+
+Not all CUDA features are available on ROCm. The set of supported features evolves over time as AMD adds capabilities. When resolving an IFU:
+
+1. **Check current support** before deciding how to guard new upstream code
+2. **Look at existing codebase patterns** — if similar code is already guarded elsewhere, follow the same pattern
+3. **When in doubt, guard conservatively** and add a TODO comment indicating the feature may become available
+4. **Ask the developer** if you're unsure about the current support status of a specific feature
+
+---
+
+## Workflow
+
+### Step 1: Setup
+
+```bash
+git cat-file -p <merge-commit>  # Identify parent1 (ROCm) and parent2 (upstream)
+grep -rn "^<<<<<<<\|^=======\|^>>>>>>>" . \
+  --include="*.py" --include="*.cpp" --include="*.cu" \
+  --include="*.h" --include="*.cuh"  # Count remaining conflicts
+```
+
+### Step 2: Resolve Conflict Markers
+
+Apply Strategy 1 (conflict resolution mental model) to each file. Verify zero markers remain.
+
+### Step 3: Audit Non-Conflicting Changes
+
+Apply Strategy 3 — audit ALL upstream changes, not just conflicting ones. Check new files, removed guards, convention changes, and CMakeLists.txt source list placement. This step is most often skipped and most often the source of post-merge failures.
+
+### Step 4: Build and Fix Iteratively
+
+Apply Strategy 5. Python syntax check first (`python -m py_compile`), then C++ build, then runtime tests. For every fix, apply Strategy 4 — trace the guard dependency chain to completion before rebuilding.
+
+### Step 5: Run Tests
+
+Test in order of increasing scope per Strategy 5. Investigate runtime failures by checking whether upstream changed hardware-specific parameters, utility functions, or data format conventions.
+
+### Step 6: Clean Up Stale Code
+
+Apply Strategy 6. Check fork-specific files for functions that upstream has superseded in new files.
+
+---
+
+## Hard Rules
+
+- **Never edit `*_hip.cpp` files** — they are auto-generated by hipify
+- **Always edit source `.cpp`/`.cu` files** — hipify will process them
+- **Use the correct platform guard for each layer** (see Strategy 2)
+- **Guard unsupported features** — check current ROCm support status; guard conservatively if unsure
+- **Audit all upstream changes** — not just the ones that caused conflicts
+- **Trace the guard dependency chain** — guarding an implementation requires guarding every call site (see Strategy 4)
+- **Test failures are informational, not directional** — trace backward from the failure to find the violated invariant; resist the urge to fix the test