Added initial AI Agent instructions

Micky774 · Micky774 · commit 8a8ea813678a · 2026-03-05T12:53:04.000-06:00
diff --git a/.claude/skills/ck-debugging/SKILL.md b/.claude/skills/ck-debugging/SKILL.md
@@ -0,0 +1,103 @@
+---
+name: ck-debugging
+description: Triage, investigate, debug, and isolate CK/AITER Fused Attention failures in TransformerEngine as integration vs kernel issues.
+---
+
+# CK Fused Attention Debugging Guide (TransformerEngine, ROCm)
+
+Use this playbook to quickly answer one question:
+**Is the failure in TE↔CK integration, or in the CK/AITER kernel itself?**
+
+## 1) Map the integration surface first
+- Build-time CK args parsing/validation:
+  - `transformer_engine/common/CMakeLists.txt`
+  - `tools/check_aiter_mha_args_usage.py`
+- CK fused-attn kernel wrappers/entry points:
+  - `transformer_engine/common/ck_fused_attn/ck_fused_attn_*`
+- CK backend preprocessing and dispatch glue:
+  - `transformer_engine/common/fused_attn_rocm/fused_attn_ck.cpp`
+- Runtime backend selection / fallback path:
+  - `transformer_engine/common/fused_attn_rocm/fused_attn.cpp`
+
+## 2) Gather minimum reproducibility context (before changing code)
+Capture these from logs or user report:
+- Forward vs backward failure (`fwd` / `bwd`)
+- Exact shape/config: batch, seq lengths (`s_q`, `s_kv`), num heads, head dim
+- Data type(s): fp16/bf16/fp8
+- Mask/dropout/causal/windowing/alibi/padding settings
+- GQA/MQA/group mode details if used
+- GPU architecture + ROCm version + TE commit
+- Whether fallback backend succeeds
+
+When self-collecting logs (for example, rerunning a failing pytest), enable full config logging in the same command: `NVTE_LOG_FUSED_ATTN_CONFIG=1 NVTE_LOG_CK_CONFIG=1 CK_FUSED_ATTN_LOG_CONFIG=1 <test command>`.
+
+If reproducing triggers a segmentation fault, rerun under `rocgdb` to capture a usable backtrace: `rocgdb --args python -m pytest ...` (then run and collect `bt`).
+
+If config info is incomplete, request it first; otherwise debugging is noisy and slow.
+
+## 3) Reproduce in controlled CK-only path
+Preferred path (AITER Python JIT):
+1. Start from `3rdparty/aiter/op_tests/test_mha.py` to reproduce through the same Python JIT interface used in many real flows.
+2. Add a minimal wrapper test (for example, `test_te_reproducer`) that pins only the failing TE config.
+3. Call the Python-level MHA functions directly (e.g. `mha_fwd` and `fmha_v3_fwd`).
+4. Record the exact test invocation, pinned parameters, and first failing log line.
+
+Secondary path (native executables for isolation/confirmation):
+1. From `3rdparty/aiter/op_tests/cpp/mha`, build with `mha_build.sh`.
+2. Keep env explicit when running:
+   - `LD_LIBRARY_PATH=<TE_ROOT>/transformer_engine/lib:${LD_LIBRARY_PATH}`
+   - `AITER_ASM_DIR=$(realpath 3rdparty/aiter/hsa)` (or equivalent absolute path)
+3. Use `fwd.exe -?` / `bwd.exe -?` to confirm argument mapping.
+4. Re-encode the same failing config in `fwd.exe` / `bwd.exe` and compare behavior vs Python JIT.
+5. Keep in mind that TE always stores LSE, hence use `-lse=1`.
+6. Record full commands to include in handoff.
+
+## 4) Decision tree: integration bug vs kernel bug
+1. **Fails in TE, but passes in `fwd.exe`/`bwd.exe` with equivalent config**
+   - Likely TE integration bug.
+   - Focus on argument marshaling/normalization in:
+     - `fused_attn_ck.cpp`
+     - `ck_fused_attn_*`
+     - backend selection conditions in `fused_attn.cpp`
+
+2. **Fails both in TE and standalone `fwd.exe`/`bwd.exe`**
+   - Likely CK/AITER kernel issue (or unsupported config).
+   - Produce a minimal standalone reproducer command and hand off.
+
+3. **Passes in TE only when fallback backend is chosen**
+   - CK eligibility/selection guard likely wrong.
+   - Inspect backend capability checks and shape constraints in `fused_attn.cpp`.
+
+## 5) High-value checks when it is integration-related
+- Verify all expected CK args are present and in the right order/type.
+- Check TE→CK conversions for:
+  - layout / strides
+  - sequence length semantics (`s_q` vs `s_kv`)
+  - grouped-query mapping
+  - mask/bias/dropout flags
+  - causal/windowing flags
+  - dtype/accumulator assumptions
+- Confirm no silent defaulting for missing fields.
+- Confirm runtime-selected backend matches intent (no accidental fallback/misroute).
+
+## 6) Output artifact requirements (always produce)
+For each investigated failure, record:
+- TE reproducer summary (shapes, dtype, flags)
+- Standalone command(s) tested (`fwd.exe`/`bwd.exe`) and result
+- Classification: `integration` or `kernel`
+- Owning component and next action
+
+Suggested concise handoff format:
+- **Config:** `B=?, Sq=?, Skv=?, H=?, D=?, dtype=?, causal=?, dropout=?, mask=?`
+- **TE result:** pass/fail + key error
+- **Standalone result:** pass/fail + key error
+- **Conclusion:** integration vs kernel
+- **Owner:** TE vs AITER/CK
+
+For more comprehensive output formatting, reference [TEMPLATE.md](TEMPLATE.md)
+
+## 7) Common pitfalls
+- Mismatch between TE-side defaults and standalone binary defaults.
+- Treating unsupported config as runtime failure instead of eligibility failure.
+- Comparing non-equivalent configs across TE and standalone paths.
+- Missing backward-only failures (always test both directions when applicable).
diff --git a/.claude/skills/ck-debugging/TEMPLATE.md b/.claude/skills/ck-debugging/TEMPLATE.md
@@ -0,0 +1,121 @@
+# CK/AITER Fused-Attn Debug Handoff Template
+
+Use this template when handing off a failure investigation to TE or AITER/CK owners.
+
+---
+
+## 1) Summary
+- **Classification:** `integration` | `kernel` | `unknown`
+- **Direction:** `fwd` | `bwd` | `both`
+
+## 2) Environment
+- **TE commit:**
+- **AITER commit/submodule ref:**
+- **ROCm version:**
+- **GPU architecture (gfx):**
+
+## 3) Failing Configuration
+- **Batch (B):**
+- **Query seq (Sq):**
+- **KV seq (Skv):**
+- **Num heads (H):**
+- **Head dim (D):**
+- **DType(s):** fp16 / bf16 / fp8
+- **Causal:** true/false
+- **Dropout:**
+- **Mask/Bias mode:**
+- **Windowing/Alibi/Padding:**
+- **GQA/MQA details:**
+
+## 4) TE Reproducer
+- **Backend intent:** CK only / auto / fallback allowed
+- **Command or test entrypoint:**
+- **Key env vars:**
+- **Observed result:** pass/fail
+- **First failing log line / error signature:**
+
+## 5) Standalone AITER Reproducer (`fwd.exe` / `bwd.exe`)
+- **Build location:** `3rdparty/aiter/op_tests/cpp/mha`
+- **Build command:**
+- **Runtime env:**
+	- `LD_LIBRARY_PATH=<TE_ROOT>/transformer_engine/lib:${LD_LIBRARY_PATH}`
+	- `AITER_ASM_DIR=$(realpath ../../../hsa)`
+- **Exact standalone command(s):**
+- **Observed result:** pass/fail
+- **First failing log line / error signature:**
+
+## 6) Equivalence Check (TE vs Standalone)
+- **Are shape/dtype/flags exactly matched?** yes/no
+- **Any default mismatch noticed?**
+- **Notes:**
+
+## 7) Conclusion and Ownership
+- **Conclusion:** integration vs kernel vs unsupported-config
+- **Likely owner:** TE (`fused_attn_ck.cpp` / `fused_attn.cpp` / `ck_fused_attn_*`) or AITER/CK kernel team
+- **Requested next action:**
+
+## 8) Artifacts
+- **Logs attached:**
+- **Minimal reproducer commands attached:**
+- **Patch/commit links (if any):**
+
+---
+
+# Example (Filled)
+
+## 1) Summary
+- **Classification:** `integration`
+- **Direction:** `bwd`
+
+## 2) Environment
+- **TE commit:** `abc1234`
+- **AITER commit/submodule ref:** `def5678`
+- **ROCm version:** 6.2.1
+- **GPU architecture (gfx):** gfx942
+
+## 3) Failing Configuration
+- **Batch (B):** 4
+- **Query seq (Sq):** 4096
+- **KV seq (Skv):** 4096
+- **Num heads (H):** 32
+- **Head dim (D):** 128
+- **DType(s):** bf16
+- **Causal:** true
+- **Dropout:** 0.0
+- **Mask/Bias mode:** causal mask only
+- **Windowing/Alibi/Padding:** none
+- **GQA/MQA details:** none
+
+## 4) TE Reproducer
+- **Backend intent:** CK only
+- **Command or test entrypoint:** `pytest tests/pytorch/fused_attn/test_fused_attn.py::test_bwd_case_x`
+- **Key env vars:** CK backend forced; debug logging enabled
+- **Observed result:** fail
+- **First failing log line / error signature:** `invalid argument: ck_bwd workspace size mismatch`
+
+## 5) Standalone AITER Reproducer (`fwd.exe` / `bwd.exe`)
+- **Build location:** `3rdparty/aiter/op_tests/cpp/mha`
+- **Build command:** `./mha_build.sh`
+- **Runtime env:**
+	- `LD_LIBRARY_PATH=<TE_ROOT>/transformer_engine/lib:${LD_LIBRARY_PATH}`
+	- `AITER_ASM_DIR=$(realpath ../../../hsa)`
+- **Exact standalone command(s):**
+	- `./bwd.exe <equivalent args>`
+	- `./fwd.exe <equivalent args>`
+- **Observed result:** pass (both)
+- **First failing log line / error signature:** N/A
+
+## 6) Equivalence Check (TE vs Standalone)
+- **Are shape/dtype/flags exactly matched?** yes
+- **Any default mismatch noticed?** TE-side workspace/alignment default differs from standalone path
+- **Notes:** likely marshaling/normalization issue before CK call
+
+## 7) Conclusion and Ownership
+- **Conclusion:** integration
+- **Likely owner:** TE (`fused_attn_ck.cpp` argument preparation)
+- **Requested next action:** inspect workspace-size and alignment mapping in TE→CK bwd path
+
+## 8) Artifacts
+- **Logs attached:** `te_fail.log`, `standalone_pass.log`
+- **Minimal reproducer commands attached:** yes
+- **Patch/commit links (if any):** none
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,57 @@
+# Agent instructions for TransformerEngine (ROCm fork)
+
+## Using Docker containers
+- We generally work in Docker containers for reproducibility.
+- For live debugging/investigations, run build/test commands **only** inside the designated container (not on host).
+- If container is unspecified, ask for the exact image/tag and launch command **before** running anything expensive.
+- Before debugging, record runtime context in notes/logs:
+  - container image/tag
+  - ROCm version in container
+  - GPU architecture visible in container
+  - TE commit/submodule state
+- If results are suspicious, first verify you are in the expected container and that GPU devices/libs are exposed correctly.
+
+## Big picture
+- This repo builds **one core C++/HIP library** plus optional framework bindings:
+  - core: `transformer_engine/common` (CMake project producing `libtransformer_engine.so`)
+  - PyTorch binding: `transformer_engine/pytorch` + `transformer_engine/pytorch/csrc`
+  - JAX binding: `transformer_engine/jax` + `transformer_engine/jax/csrc/extensions`
+- Python import flow is split:
+  - top-level framework selection in `transformer_engine/__init__.py` (`NVTE_FRAMEWORK` controls `pytorch|jax|all|none`)
+  - `.so` discovery/loading logic in `transformer_engine/common/__init__.py` (`load_framework_extension`, wheel/source/editable layouts)
+- Build orchestration is in `setup.py` + `build_tools/*.py`, not only in CMake.
+  - `build_tools/utils.py::rocm_build()` auto-detects ROCm first, then CUDA, unless `NVTE_USE_ROCM` is set.
+
+## Platform/backends
+- ROCm path is first-class in this fork (`README.rst`, `transformer_engine/common/CMakeLists.txt`).
+- Fused attention backends are runtime/compile-time gated by env vars:
+  - `NVTE_FUSED_ATTN`, `NVTE_FUSED_ATTN_CK`, `NVTE_FUSED_ATTN_AOTRITON`
+- ROCm fused-attn implementation is in `transformer_engine/common/fused_attn_rocm/*`; CK and AOTriton integration is wired in `transformer_engine/common/CMakeLists.txt`.
+- Build-time validation for CK args runs from `setup.py` via `tools/check_aiter_mha_args_usage.py`.
+
+## Developer workflows you should follow
+- Always initialize submodules before debugging build failures: `git submodule update --init --recursive` (required by CMake for 3rdparty deps).
+- Typical source install in this repo: `pip install . --no-build-isolation` (see `README.rst`).
+- C++ tests: build/run from `tests/cpp` with CMake+Ninja (`qa/L0_cppunittest/test.sh`, `ci/core.sh`).
+- CI-style framework test entrypoints are shell scripts, not a single pytest command:
+  - PyTorch: `ci/pytorch.sh`
+  - JAX: `ci/jax.sh`
+  - They use `TEST_LEVEL`, `TEST_SGPU`, `TEST_MGPU`, `TEST_FILTER` from `ci/_utils.sh`.
+- Lint/format workflow is repo-specific:
+  - local formatting: `qa/format.sh` (pre-commit hooks)
+  - cpplint+pylint flows: `qa/L0_pytorch_lint/test.sh`, `qa/L0_jax_lint/test.sh`
+
+## Code conventions and change boundaries
+- Prefer edits in `transformer_engine/*`, `build_tools/*`, `tests/*`, `ci/*`; avoid changing `3rdparty/*` unless explicitly required.
+- Preserve dual-platform structure when modifying kernels/build logic:
+  - shared sources are often `.cu` then hipified for ROCm (`transformer_engine/common/CMakeLists.txt`, `build_tools/pytorch.py`, `build_tools/jax.py`).
+  - never edit HIP files directly -- instead, edit the CUDA source and let the build system generate HIP variants.
+- Keep environment-variable behavior stable; many tests intentionally toggle flags (examples in `ci/pytorch.sh` and `ci/jax.sh`).
+- Respect existing tooling/style:
+  - Python formatted by Black (line length 100) via `.pre-commit-config.yaml`
+  - C/C++ style checked by cpplint and `.clang-format`
+
+## Practical pointers for AI agents
+- If import fails with missing TE extension `.so`, inspect `transformer_engine/common/__init__.py` path resolution before changing packaging.
+- If framework extension unexpectedly does not build on ROCm, check framework detection in `build_tools/utils.py::get_frameworks()` (ROCm-capable torch/jax checks).
+- For fused-attn regressions, reproduce under multiple backend configs (`auto`, `ck`, `aotriton`, `unfused`) like CI scripts do.