Skip to content

Commit 8a8ea81

Browse files
committed
Added initial AI Agent instructions
1 parent 78850fd commit 8a8ea81

File tree

3 files changed

+281
-0
lines changed

3 files changed

+281
-0
lines changed
Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
---
2+
name: ck-debugging
3+
description: Triage, investigate, debug, and isolate CK/AITER Fused Attention failures in TransformerEngine as integration vs kernel issues.
4+
---
5+
6+
# CK Fused Attention Debugging Guide (TransformerEngine, ROCm)
7+
8+
Use this playbook to quickly answer one question:
9+
**Is the failure in TE↔CK integration, or in the CK/AITER kernel itself?**
10+
11+
## 1) Map the integration surface first
12+
- Build-time CK args parsing/validation:
13+
- `transformer_engine/common/CMakeLists.txt`
14+
- `tools/check_aiter_mha_args_usage.py`
15+
- CK fused-attn kernel wrappers/entry points:
16+
- `transformer_engine/common/ck_fused_attn/ck_fused_attn_*`
17+
- CK backend preprocessing and dispatch glue:
18+
- `transformer_engine/common/fused_attn_rocm/fused_attn_ck.cpp`
19+
- Runtime backend selection / fallback path:
20+
- `transformer_engine/common/fused_attn_rocm/fused_attn.cpp`
21+
22+
## 2) Gather minimum reproducibility context (before changing code)
23+
Capture these from logs or user report:
24+
- Forward vs backward failure (`fwd` / `bwd`)
25+
- Exact shape/config: batch, seq lengths (`s_q`, `s_kv`), num heads, head dim
26+
- Data type(s): fp16/bf16/fp8
27+
- Mask/dropout/causal/windowing/alibi/padding settings
28+
- GQA/MQA/group mode details if used
29+
- GPU architecture + ROCm version + TE commit
30+
- Whether fallback backend succeeds
31+
32+
When self-collecting logs (for example, rerunning a failing pytest), enable full config logging in the same command: `NVTE_LOG_FUSED_ATTN_CONFIG=1 NVTE_LOG_CK_CONFIG=1 CK_FUSED_ATTN_LOG_CONFIG=1 <test command>`.
33+
34+
If reproducing triggers a segmentation fault, rerun under `rocgdb` to capture a usable backtrace: `rocgdb --args python -m pytest ...` (then run and collect `bt`).
35+
36+
If config info is incomplete, request it first; otherwise debugging is noisy and slow.
37+
38+
## 3) Reproduce in controlled CK-only path
39+
Preferred path (AITER Python JIT):
40+
1. Start from `3rdparty/aiter/op_tests/test_mha.py` to reproduce through the same Python JIT interface used in many real flows.
41+
2. Add a minimal wrapper test (for example, `test_te_reproducer`) that pins only the failing TE config.
42+
3. Call the Python-level MHA functions directly (e.g. `mha_fwd` and `fmha_v3_fwd`).
43+
4. Record the exact test invocation, pinned parameters, and first failing log line.
44+
45+
Secondary path (native executables for isolation/confirmation):
46+
1. From `3rdparty/aiter/op_tests/cpp/mha`, build with `mha_build.sh`.
47+
2. Keep env explicit when running:
48+
- `LD_LIBRARY_PATH=<TE_ROOT>/transformer_engine/lib:${LD_LIBRARY_PATH}`
49+
- `AITER_ASM_DIR=$(realpath 3rdparty/aiter/hsa)` (or equivalent absolute path)
50+
3. Use `fwd.exe -?` / `bwd.exe -?` to confirm argument mapping.
51+
4. Re-encode the same failing config in `fwd.exe` / `bwd.exe` and compare behavior vs Python JIT.
52+
5. Keep in mind that TE always stores LSE, hence use `-lse=1`.
53+
6. Record full commands to include in handoff.
54+
55+
## 4) Decision tree: integration bug vs kernel bug
56+
1. **Fails in TE, but passes in `fwd.exe`/`bwd.exe` with equivalent config**
57+
- Likely TE integration bug.
58+
- Focus on argument marshaling/normalization in:
59+
- `fused_attn_ck.cpp`
60+
- `ck_fused_attn_*`
61+
- backend selection conditions in `fused_attn.cpp`
62+
63+
2. **Fails both in TE and standalone `fwd.exe`/`bwd.exe`**
64+
- Likely CK/AITER kernel issue (or unsupported config).
65+
- Produce a minimal standalone reproducer command and hand off.
66+
67+
3. **Passes in TE only when fallback backend is chosen**
68+
- CK eligibility/selection guard likely wrong.
69+
- Inspect backend capability checks and shape constraints in `fused_attn.cpp`.
70+
71+
## 5) High-value checks when it is integration-related
72+
- Verify all expected CK args are present and in the right order/type.
73+
- Check TE→CK conversions for:
74+
- layout / strides
75+
- sequence length semantics (`s_q` vs `s_kv`)
76+
- grouped-query mapping
77+
- mask/bias/dropout flags
78+
- causal/windowing flags
79+
- dtype/accumulator assumptions
80+
- Confirm no silent defaulting for missing fields.
81+
- Confirm runtime-selected backend matches intent (no accidental fallback/misroute).
82+
83+
## 6) Output artifact requirements (always produce)
84+
For each investigated failure, record:
85+
- TE reproducer summary (shapes, dtype, flags)
86+
- Standalone command(s) tested (`fwd.exe`/`bwd.exe`) and result
87+
- Classification: `integration` or `kernel`
88+
- Owning component and next action
89+
90+
Suggested concise handoff format:
91+
- **Config:** `B=?, Sq=?, Skv=?, H=?, D=?, dtype=?, causal=?, dropout=?, mask=?`
92+
- **TE result:** pass/fail + key error
93+
- **Standalone result:** pass/fail + key error
94+
- **Conclusion:** integration vs kernel
95+
- **Owner:** TE vs AITER/CK
96+
97+
For more comprehensive output formatting, reference [TEMPLATE.md](TEMPLATE.md)
98+
99+
## 7) Common pitfalls
100+
- Mismatch between TE-side defaults and standalone binary defaults.
101+
- Treating unsupported config as runtime failure instead of eligibility failure.
102+
- Comparing non-equivalent configs across TE and standalone paths.
103+
- Missing backward-only failures (always test both directions when applicable).
Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
# CK/AITER Fused-Attn Debug Handoff Template
2+
3+
Use this template when handing off a failure investigation to TE or AITER/CK owners.
4+
5+
---
6+
7+
## 1) Summary
8+
- **Classification:** `integration` | `kernel` | `unknown`
9+
- **Direction:** `fwd` | `bwd` | `both`
10+
11+
## 2) Environment
12+
- **TE commit:**
13+
- **AITER commit/submodule ref:**
14+
- **ROCm version:**
15+
- **GPU architecture (gfx):**
16+
17+
## 3) Failing Configuration
18+
- **Batch (B):**
19+
- **Query seq (Sq):**
20+
- **KV seq (Skv):**
21+
- **Num heads (H):**
22+
- **Head dim (D):**
23+
- **DType(s):** fp16 / bf16 / fp8
24+
- **Causal:** true/false
25+
- **Dropout:**
26+
- **Mask/Bias mode:**
27+
- **Windowing/Alibi/Padding:**
28+
- **GQA/MQA details:**
29+
30+
## 4) TE Reproducer
31+
- **Backend intent:** CK only / auto / fallback allowed
32+
- **Command or test entrypoint:**
33+
- **Key env vars:**
34+
- **Observed result:** pass/fail
35+
- **First failing log line / error signature:**
36+
37+
## 5) Standalone AITER Reproducer (`fwd.exe` / `bwd.exe`)
38+
- **Build location:** `3rdparty/aiter/op_tests/cpp/mha`
39+
- **Build command:**
40+
- **Runtime env:**
41+
- `LD_LIBRARY_PATH=<TE_ROOT>/transformer_engine/lib:${LD_LIBRARY_PATH}`
42+
- `AITER_ASM_DIR=$(realpath ../../../hsa)`
43+
- **Exact standalone command(s):**
44+
- **Observed result:** pass/fail
45+
- **First failing log line / error signature:**
46+
47+
## 6) Equivalence Check (TE vs Standalone)
48+
- **Are shape/dtype/flags exactly matched?** yes/no
49+
- **Any default mismatch noticed?**
50+
- **Notes:**
51+
52+
## 7) Conclusion and Ownership
53+
- **Conclusion:** integration vs kernel vs unsupported-config
54+
- **Likely owner:** TE (`fused_attn_ck.cpp` / `fused_attn.cpp` / `ck_fused_attn_*`) or AITER/CK kernel team
55+
- **Requested next action:**
56+
57+
## 8) Artifacts
58+
- **Logs attached:**
59+
- **Minimal reproducer commands attached:**
60+
- **Patch/commit links (if any):**
61+
62+
---
63+
64+
# Example (Filled)
65+
66+
## 1) Summary
67+
- **Classification:** `integration`
68+
- **Direction:** `bwd`
69+
70+
## 2) Environment
71+
- **TE commit:** `abc1234`
72+
- **AITER commit/submodule ref:** `def5678`
73+
- **ROCm version:** 6.2.1
74+
- **GPU architecture (gfx):** gfx942
75+
76+
## 3) Failing Configuration
77+
- **Batch (B):** 4
78+
- **Query seq (Sq):** 4096
79+
- **KV seq (Skv):** 4096
80+
- **Num heads (H):** 32
81+
- **Head dim (D):** 128
82+
- **DType(s):** bf16
83+
- **Causal:** true
84+
- **Dropout:** 0.0
85+
- **Mask/Bias mode:** causal mask only
86+
- **Windowing/Alibi/Padding:** none
87+
- **GQA/MQA details:** none
88+
89+
## 4) TE Reproducer
90+
- **Backend intent:** CK only
91+
- **Command or test entrypoint:** `pytest tests/pytorch/fused_attn/test_fused_attn.py::test_bwd_case_x`
92+
- **Key env vars:** CK backend forced; debug logging enabled
93+
- **Observed result:** fail
94+
- **First failing log line / error signature:** `invalid argument: ck_bwd workspace size mismatch`
95+
96+
## 5) Standalone AITER Reproducer (`fwd.exe` / `bwd.exe`)
97+
- **Build location:** `3rdparty/aiter/op_tests/cpp/mha`
98+
- **Build command:** `./mha_build.sh`
99+
- **Runtime env:**
100+
- `LD_LIBRARY_PATH=<TE_ROOT>/transformer_engine/lib:${LD_LIBRARY_PATH}`
101+
- `AITER_ASM_DIR=$(realpath ../../../hsa)`
102+
- **Exact standalone command(s):**
103+
- `./bwd.exe <equivalent args>`
104+
- `./fwd.exe <equivalent args>`
105+
- **Observed result:** pass (both)
106+
- **First failing log line / error signature:** N/A
107+
108+
## 6) Equivalence Check (TE vs Standalone)
109+
- **Are shape/dtype/flags exactly matched?** yes
110+
- **Any default mismatch noticed?** TE-side workspace/alignment default differs from standalone path
111+
- **Notes:** likely marshaling/normalization issue before CK call
112+
113+
## 7) Conclusion and Ownership
114+
- **Conclusion:** integration
115+
- **Likely owner:** TE (`fused_attn_ck.cpp` argument preparation)
116+
- **Requested next action:** inspect workspace-size and alignment mapping in TE→CK bwd path
117+
118+
## 8) Artifacts
119+
- **Logs attached:** `te_fail.log`, `standalone_pass.log`
120+
- **Minimal reproducer commands attached:** yes
121+
- **Patch/commit links (if any):** none

CLAUDE.md

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# Agent instructions for TransformerEngine (ROCm fork)
2+
3+
## Using Docker containers
4+
- We generally work in Docker containers for reproducibility.
5+
- For live debugging/investigations, run build/test commands **only** inside the designated container (not on host).
6+
- If container is unspecified, ask for the exact image/tag and launch command **before** running anything expensive.
7+
- Before debugging, record runtime context in notes/logs:
8+
- container image/tag
9+
- ROCm version in container
10+
- GPU architecture visible in container
11+
- TE commit/submodule state
12+
- If results are suspicious, first verify you are in the expected container and that GPU devices/libs are exposed correctly.
13+
14+
## Big picture
15+
- This repo builds **one core C++/HIP library** plus optional framework bindings:
16+
- core: `transformer_engine/common` (CMake project producing `libtransformer_engine.so`)
17+
- PyTorch binding: `transformer_engine/pytorch` + `transformer_engine/pytorch/csrc`
18+
- JAX binding: `transformer_engine/jax` + `transformer_engine/jax/csrc/extensions`
19+
- Python import flow is split:
20+
- top-level framework selection in `transformer_engine/__init__.py` (`NVTE_FRAMEWORK` controls `pytorch|jax|all|none`)
21+
- `.so` discovery/loading logic in `transformer_engine/common/__init__.py` (`load_framework_extension`, wheel/source/editable layouts)
22+
- Build orchestration is in `setup.py` + `build_tools/*.py`, not only in CMake.
23+
- `build_tools/utils.py::rocm_build()` auto-detects ROCm first, then CUDA, unless `NVTE_USE_ROCM` is set.
24+
25+
## Platform/backends
26+
- ROCm path is first-class in this fork (`README.rst`, `transformer_engine/common/CMakeLists.txt`).
27+
- Fused attention backends are runtime/compile-time gated by env vars:
28+
- `NVTE_FUSED_ATTN`, `NVTE_FUSED_ATTN_CK`, `NVTE_FUSED_ATTN_AOTRITON`
29+
- ROCm fused-attn implementation is in `transformer_engine/common/fused_attn_rocm/*`; CK and AOTriton integration is wired in `transformer_engine/common/CMakeLists.txt`.
30+
- Build-time validation for CK args runs from `setup.py` via `tools/check_aiter_mha_args_usage.py`.
31+
32+
## Developer workflows you should follow
33+
- Always initialize submodules before debugging build failures: `git submodule update --init --recursive` (required by CMake for 3rdparty deps).
34+
- Typical source install in this repo: `pip install . --no-build-isolation` (see `README.rst`).
35+
- C++ tests: build/run from `tests/cpp` with CMake+Ninja (`qa/L0_cppunittest/test.sh`, `ci/core.sh`).
36+
- CI-style framework test entrypoints are shell scripts, not a single pytest command:
37+
- PyTorch: `ci/pytorch.sh`
38+
- JAX: `ci/jax.sh`
39+
- They use `TEST_LEVEL`, `TEST_SGPU`, `TEST_MGPU`, `TEST_FILTER` from `ci/_utils.sh`.
40+
- Lint/format workflow is repo-specific:
41+
- local formatting: `qa/format.sh` (pre-commit hooks)
42+
- cpplint+pylint flows: `qa/L0_pytorch_lint/test.sh`, `qa/L0_jax_lint/test.sh`
43+
44+
## Code conventions and change boundaries
45+
- Prefer edits in `transformer_engine/*`, `build_tools/*`, `tests/*`, `ci/*`; avoid changing `3rdparty/*` unless explicitly required.
46+
- Preserve dual-platform structure when modifying kernels/build logic:
47+
- shared sources are often `.cu` then hipified for ROCm (`transformer_engine/common/CMakeLists.txt`, `build_tools/pytorch.py`, `build_tools/jax.py`).
48+
- never edit HIP files directly -- instead, edit the CUDA source and let the build system generate HIP variants.
49+
- Keep environment-variable behavior stable; many tests intentionally toggle flags (examples in `ci/pytorch.sh` and `ci/jax.sh`).
50+
- Respect existing tooling/style:
51+
- Python formatted by Black (line length 100) via `.pre-commit-config.yaml`
52+
- C/C++ style checked by cpplint and `.clang-format`
53+
54+
## Practical pointers for AI agents
55+
- If import fails with missing TE extension `.so`, inspect `transformer_engine/common/__init__.py` path resolution before changing packaging.
56+
- If framework extension unexpectedly does not build on ROCm, check framework detection in `build_tools/utils.py::get_frameworks()` (ROCm-capable torch/jax checks).
57+
- For fused-attn regressions, reproduce under multiple backend configs (`auto`, `ck`, `aotriton`, `unfused`) like CI scripts do.

0 commit comments

Comments
 (0)