|
| 1 | +--- |
| 2 | +name: ck-debugging |
| 3 | +description: Triage, investigate, debug, and isolate CK/AITER Fused Attention failures in TransformerEngine as integration vs kernel issues. |
| 4 | +--- |
| 5 | + |
| 6 | +# CK Fused Attention Debugging Guide (TransformerEngine, ROCm) |
| 7 | + |
| 8 | +Use this playbook to quickly answer one question: |
| 9 | +**Is the failure in TE↔CK integration, or in the CK/AITER kernel itself?** |
| 10 | + |
| 11 | +## 1) Map the integration surface first |
| 12 | +- Build-time CK args parsing/validation: |
| 13 | + - `transformer_engine/common/CMakeLists.txt` |
| 14 | + - `tools/check_aiter_mha_args_usage.py` |
| 15 | +- CK fused-attn kernel wrappers/entry points: |
| 16 | + - `transformer_engine/common/ck_fused_attn/ck_fused_attn_*` |
| 17 | +- CK backend preprocessing and dispatch glue: |
| 18 | + - `transformer_engine/common/fused_attn_rocm/fused_attn_ck.cpp` |
| 19 | +- Runtime backend selection / fallback path: |
| 20 | + - `transformer_engine/common/fused_attn_rocm/fused_attn.cpp` |
| 21 | + |
| 22 | +## 2) Gather minimum reproducibility context (before changing code) |
| 23 | +Capture these from logs or user report: |
| 24 | +- Forward vs backward failure (`fwd` / `bwd`) |
| 25 | +- Exact shape/config: batch, seq lengths (`s_q`, `s_kv`), num heads, head dim |
| 26 | +- Data type(s): fp16/bf16/fp8 |
| 27 | +- Mask/dropout/causal/windowing/alibi/padding settings |
| 28 | +- GQA/MQA/group mode details if used |
| 29 | +- GPU architecture + ROCm version + TE commit |
| 30 | +- Whether fallback backend succeeds |
| 31 | + |
| 32 | +When self-collecting logs (for example, rerunning a failing pytest), enable full config logging in the same command: `NVTE_LOG_FUSED_ATTN_CONFIG=1 NVTE_LOG_CK_CONFIG=1 CK_FUSED_ATTN_LOG_CONFIG=1 <test command>`. |
| 33 | + |
| 34 | +If reproducing triggers a segmentation fault, rerun under `rocgdb` to capture a usable backtrace: `rocgdb --args python -m pytest ...` (then run and collect `bt`). |
| 35 | + |
| 36 | +If config info is incomplete, request it first; otherwise debugging is noisy and slow. |
| 37 | + |
| 38 | +## 3) Reproduce in controlled CK-only path |
| 39 | +Preferred path (AITER Python JIT): |
| 40 | +1. Start from `3rdparty/aiter/op_tests/test_mha.py` to reproduce through the same Python JIT interface used in many real flows. |
| 41 | +2. Add a minimal wrapper test (for example, `test_te_reproducer`) that pins only the failing TE config. |
| 42 | +3. Call the Python-level MHA functions directly (e.g. `mha_fwd` and `fmha_v3_fwd`). |
| 43 | +4. Record the exact test invocation, pinned parameters, and first failing log line. |
| 44 | + |
| 45 | +Secondary path (native executables for isolation/confirmation): |
| 46 | +1. From `3rdparty/aiter/op_tests/cpp/mha`, build with `mha_build.sh`. |
| 47 | +2. Keep env explicit when running: |
| 48 | + - `LD_LIBRARY_PATH=<TE_ROOT>/transformer_engine/lib:${LD_LIBRARY_PATH}` |
| 49 | + - `AITER_ASM_DIR=$(realpath 3rdparty/aiter/hsa)` (or equivalent absolute path) |
| 50 | +3. Use `fwd.exe -?` / `bwd.exe -?` to confirm argument mapping. |
| 51 | +4. Re-encode the same failing config in `fwd.exe` / `bwd.exe` and compare behavior vs Python JIT. |
| 52 | +5. Keep in mind that TE always stores LSE, hence use `-lse=1`. |
| 53 | +6. Record full commands to include in handoff. |
| 54 | + |
| 55 | +## 4) Decision tree: integration bug vs kernel bug |
| 56 | +1. **Fails in TE, but passes in `fwd.exe`/`bwd.exe` with equivalent config** |
| 57 | + - Likely TE integration bug. |
| 58 | + - Focus on argument marshaling/normalization in: |
| 59 | + - `fused_attn_ck.cpp` |
| 60 | + - `ck_fused_attn_*` |
| 61 | + - backend selection conditions in `fused_attn.cpp` |
| 62 | + |
| 63 | +2. **Fails both in TE and standalone `fwd.exe`/`bwd.exe`** |
| 64 | + - Likely CK/AITER kernel issue (or unsupported config). |
| 65 | + - Produce a minimal standalone reproducer command and hand off. |
| 66 | + |
| 67 | +3. **Passes in TE only when fallback backend is chosen** |
| 68 | + - CK eligibility/selection guard likely wrong. |
| 69 | + - Inspect backend capability checks and shape constraints in `fused_attn.cpp`. |
| 70 | + |
| 71 | +## 5) High-value checks when it is integration-related |
| 72 | +- Verify all expected CK args are present and in the right order/type. |
| 73 | +- Check TE→CK conversions for: |
| 74 | + - layout / strides |
| 75 | + - sequence length semantics (`s_q` vs `s_kv`) |
| 76 | + - grouped-query mapping |
| 77 | + - mask/bias/dropout flags |
| 78 | + - causal/windowing flags |
| 79 | + - dtype/accumulator assumptions |
| 80 | +- Confirm no silent defaulting for missing fields. |
| 81 | +- Confirm runtime-selected backend matches intent (no accidental fallback/misroute). |
| 82 | + |
| 83 | +## 6) Output artifact requirements (always produce) |
| 84 | +For each investigated failure, record: |
| 85 | +- TE reproducer summary (shapes, dtype, flags) |
| 86 | +- Standalone command(s) tested (`fwd.exe`/`bwd.exe`) and result |
| 87 | +- Classification: `integration` or `kernel` |
| 88 | +- Owning component and next action |
| 89 | + |
| 90 | +Suggested concise handoff format: |
| 91 | +- **Config:** `B=?, Sq=?, Skv=?, H=?, D=?, dtype=?, causal=?, dropout=?, mask=?` |
| 92 | +- **TE result:** pass/fail + key error |
| 93 | +- **Standalone result:** pass/fail + key error |
| 94 | +- **Conclusion:** integration vs kernel |
| 95 | +- **Owner:** TE vs AITER/CK |
| 96 | + |
| 97 | +For more comprehensive output formatting, reference [TEMPLATE.md](TEMPLATE.md) |
| 98 | + |
| 99 | +## 7) Common pitfalls |
| 100 | +- Mismatch between TE-side defaults and standalone binary defaults. |
| 101 | +- Treating unsupported config as runtime failure instead of eligibility failure. |
| 102 | +- Comparing non-equivalent configs across TE and standalone paths. |
| 103 | +- Missing backward-only failures (always test both directions when applicable). |
0 commit comments