gsitaram · conde-amd · Oct 28, 2025 · Oct 28, 2025 · Oct 28, 2025 · Oct 28, 2025
diff --git a/.gitignore b/.gitignore
@@ -52,3 +52,18 @@ Mkfile.old
 dkms.conf
 __pycache__/
 *.pyc
+
+# Profiling artifacts
+*.pftrace
+rocprof_*/
+pytorch_profiles/
+
+# Downloaded datasets
+MLExamples/PyTorch_Profiling/data/
+
+# Generated profiling traces
+MLExamples/PyTorch_Profiling/rocprofv3/single_process/
+MLExamples/TinyTransformer/*/counters/
+MLExamples/TinyTransformer/*/traces/
+MLExamples/TinyTransformer/*/github_issue_test/
+MLExamples/inference_benchmark/profiling_results/
diff --git a/MLExamples/TinyTransformer/README.md b/MLExamples/TinyTransformer/README.md
diff --git a/MLExamples/TinyTransformer/VERSION_COMPARISON.md b/MLExamples/TinyTransformer/VERSION_COMPARISON.md
@@ -0,0 +1,227 @@
+# Version 1 vs Version 2 vs Version 3 vs Version 4 Profiling Comparison
+
+## Executive Summary
+
+All four versions successfully profile with rocprofv3. The GitHub issue #1386 "no device activity" does not reproduce with ROCm 6.4.4 on RX 7900 XTX.
+
+**Key Finding**: Both version3 (Triton custom kernels) and version4 (PyTorch SDPA + Triton) achieve **4.4x speedup** over version1 baseline, with similar performance characteristics. Version2 (PyTorch fusion) provides minimal gains.
+
+## Test Configuration
+
+- **GPU**: AMD Radeon RX 7900 XTX (gfx1100)
+- **ROCm**: 6.4.4
+- **Profiler**: rocprofv3
+- **Test parameters**: batch-size 8, seq-len 128, num-steps 10
+
+## Profiling Results Comparison
+
+### Trace File Sizes (Runtime Trace)
+
+| Version | Trace Size | Result |
+|---------|-----------|---------|
+| Version 1 | 44 MB | Success - full device activity captured |
+| Version 2 | 41 MB | Success - full device activity captured |
+| Version 3 | Not tested | Kernel trace tested instead (3.0 MB) |
+| Version 4 | 9.7 MB | Success - full device activity captured |
+
+### Kernel Trace Analysis
+
+| Metric | Version 1 | Version 2 | Version 3 | Version 4 | V3/V4 vs V1 |
+|--------|-----------|-----------|-----------|-----------|-------------|
+| Total kernel dispatches | 22,284 | 22,479 | 4,727 | 5,493 | -76.3% to -78.8% |
+| Unique kernel types | 64 | 55 | 32 | 33 | -48.4% to -50.0% |
+| Total GPU time | 346.21 ms | 378.06 ms | 104.49 ms | 103.36 ms | -70.1% to -69.8% |
+
+### Top 3 Kernels by GPU Time
+
+#### Version 1 (PyTorch Baseline)
+
+1. **GEMM kernel** (Cijk_Alik_Bljk...): 30,658 us (127.74 us avg) - 240 calls
+2. **GEMM kernel** (Cijk_Ailk_Bljk...): 29,954 us (124.81 us avg) - 240 calls
+3. **GEMM kernel** (Cijk_Alik_Bljk...): 26,641 us (74.00 us avg) - 360 calls
+
+**Total top 3**: 87,253 us (25.2% of total GPU time)
+
+#### Version 2 (PyTorch Fused)
+
+1. **GEMM kernel** (Cijk_Ailk_Bljk...): 54,678 us (455.65 us avg) - 120 calls
+2. **GEMM kernel** (Cijk_Alik_Bljk...): 25,482 us (212.35 us avg) - 120 calls
+3. **bwd_kernel_fuse**: 24,814 us (206.78 us avg) - 120 calls
+
+**Total top 3**: 104,974 us (27.8% of total GPU time)
+
+#### Version 3 (Triton Custom Kernels)
+
+1. **GEMM kernel** (Cijk_Alik_Bljk...): 29,710 us (123.79 us avg) - 240 calls
+2. **GEMM kernel** (Cijk_Alik_Bljk...): 28,442 us (79.01 us avg) - 360 calls
+3. **flash_attention_kernel**: 15,557 us (129.64 us avg) - 120 calls
+
+**Total top 3**: 73,709 us (70.5% of total GPU time)
+
+**Note**: Version3's top 3 kernels account for 70.5% of GPU time vs 25-28% in V1/V2, showing much better kernel concentration.
+
+#### Version 4 (PyTorch SDPA + Triton)
+
+1. **GEMM kernel** (Cijk_Alik_Bljk...): 29,641 us (123.50 us avg) - 240 calls
+2. **GEMM kernel** (Cijk_Alik_Bljk...): 28,320 us (78.67 us avg) - 360 calls
+3. **attn_fwd** (PyTorch SDPA): 13,045 us (108.71 us avg) - 120 calls
+
+**Total top 3**: 71,006 us (68.7% of total GPU time)
+
+**Note**: Version4 uses PyTorch SDPA (`attn_fwd`) instead of custom flash attention, but achieves similar performance to version3.
+
+### Key Observations
+
+1. **Version3 and Version4 achieve similar performance through different approaches**:
+   - **Version3**: Custom Triton kernels (`flash_attention_kernel`, `rmsnorm_kernel`)
+   - **Version4**: PyTorch SDPA (`attn_fwd`) with Triton fallbacks
+   - Both: 78-76% fewer kernel dispatches than version1
+   - Both: ~50% fewer unique kernel types than version1
+   - V3 flash attention: 15,557 us (129.64 us avg)
+   - V4 SDPA attention: 13,045 us (108.71 us avg) - slightly faster!
+
+2. **Version2 fused kernels**:
+   - `bwd_kernel_fuse` (24,814 us total) - backward pass fusion
+   - `attn_fwd` (12,639 us total) - attention forward fusion
+   - These are custom fused operations not present in version1
+   - 14.1% fewer unique kernel types than version1
+   - Marginal performance impact (slightly slower)
+
+3. **Performance progression**:
+   - Version1: Many small kernels, high launch overhead
+   - Version2: Some fusion, but still many PyTorch framework kernels
+   - Version3: Aggressive fusion with custom Triton kernels
+     - 69.8% reduction in GPU time vs version1
+     - 72.4% reduction in GPU time vs version2
+     - 78.8% fewer kernel launches vs version1
+
+4. **Memory efficiency**:
+   - Version1: 434.3 MB peak memory
+   - Version2: 434.3 MB peak memory
+   - Version3: 193.8 MB peak memory (55.4% reduction)
+   - Triton kernels use significantly less memory
+
+5. **Profiler functionality**:
+   - rocprofv3 successfully captures all GPU activity on all three versions
+   - No "no device activity" issue observed
+   - GitHub issue #1386 likely fixed in ROCm 6.4.4
+
+## Performance Comparison
+
+### Throughput
+
+| Version | Samples/sec | Tokens/sec | Speedup vs V1 |
+|---------|-------------|------------|---------------|
+| Version 1 | 240.6 | 30,803 | 1.00x (baseline) |
+| Version 2 | 247.4 | 31,672 | 1.03x |
+| Version 3 | 1,054.8 | 135,014 | **4.38x** |
+| Version 4 | 1,054.5 | 134,972 | **4.38x** |
+
+Version3 and Version4 both achieve **4.38x speedup** over version1 and **4.26x speedup** over version2.
+
+### Batch Processing Time
+
+| Version | Average Batch Time | Speedup vs V1 |
+|---------|-------------------|---------------|
+| Version 1 | 33.3 ms | 1.00x (baseline) |
+| Version 2 | 32.3 ms | 1.03x |
+| Version 3 | 7.5 ms | **4.44x** |
+| Version 4 | 7.6 ms | **4.38x** |
+
+### Memory Usage
+
+| Version | Peak Memory | Reduction vs V1 |
+|---------|-------------|-----------------|
+| Version 1 | 434.3 MB | baseline |
+| Version 2 | 434.3 MB | 0% |
+| Version 3 | 193.8 MB | **55.4%** |
+| Version 4 | 193.9 MB | **55.3%** |
+
+Version3 and Version4 both use less than half the memory of version1/version2.
+
+## Fusion Impact Analysis
+
+### Version2 (PyTorch Fused)
+
+Version2 reports these fusion optimizations available:
+- QKV Fusion: Available but not active in this run
+- Flash Attention: Available but not active in this run
+- SwiGLU Fusion: Available but not active in this run
+- Torch Compile: Available but failed to activate
+
+The fused kernels observed (`bwd_kernel_fuse`, `attn_fwd`) suggest some fusion is occurring despite the "not active" status. This may be a reporting issue in the code.
+
+**Verdict**: Version2 fusion provides minimal benefit (3% speedup) and may have reporting issues.
+
+### Version3 (Triton Custom Kernels)
+
+Version3 reports active Triton optimizations:
+- RMSNorm Kernel: ACTIVE - Fused variance + normalization (1,167 us total, 4.58 us avg)
+- Flash Attention Kernel: ACTIVE - Memory-efficient attention (15,557 us total, 129.64 us avg)
+- SwiGLU Kernel: ACTIVE (not visible in top kernels, likely very fast)
+
+**Verdict**: Version3 Triton kernels deliver massive performance gains (4.38x speedup) with proper kernel fusion and optimization.
+
+### Version4 (PyTorch SDPA + Triton)
+
+Version4 uses PyTorch's Scaled Dot Product Attention (SDPA) with Triton fallbacks:
+- **attn_fwd** (PyTorch SDPA): 13,045 us total, 108.71 us avg
+  - Slightly faster than V3's custom flash attention (15,557 us)
+  - Leverages PyTorch's optimized SDPA implementation
+- Custom Triton kernels for other operations (RMSNorm, SwiGLU likely present but not in top kernels)
+- 16% more kernel dispatches than V3 (5,493 vs 4,727)
+- One additional unique kernel type (33 vs 32)
+
+**Verdict**: Version4 achieves identical performance to version3 (4.38x speedup) using PyTorch SDPA instead of custom flash attention. PyTorch SDPA is actually slightly more efficient for attention, but V4 has slightly more overhead elsewhere.
+
+## Conclusion
+
+1. **rocprofv3 works correctly** on all four versions with ROCm 6.4.4
+2. **No reproduction of GitHub issue #1386** - all versions show full device activity
+
+3. **Version3 and Version4 are equivalent winners**:
+   - Both: **4.38x faster** than version1 baseline
+   - Both: **4.26x faster** than version2
+   - Both: **~55% less memory** usage
+   - Both: **~77-79% fewer** kernel dispatches
+   - Both: **~70% reduction** in GPU time
+   - V3 uses custom flash attention, V4 uses PyTorch SDPA
+   - V4's SDPA is slightly faster (13.0 ms vs 15.6 ms) but has slightly more overhead elsewhere
+
+4. **Version2 provides minimal gains**:
+   - Only 3% faster than version1
+   - Same memory usage as version1
+   - Some fusion, but not well optimized
+   - May have reporting issues with fusion flags
+
+5. **Performance progression summary**:
+   - V1 baseline: 240.6 samples/sec, 346 ms GPU time, 434 MB memory
+   - V2 fused: 247.4 samples/sec, 378 ms GPU time, 434 MB memory (marginal improvement)
+   - V3 custom Triton: 1,054.8 samples/sec, 104 ms GPU time, 194 MB memory (massive improvement)
+   - V4 PyTorch SDPA: 1,054.5 samples/sec, 103 ms GPU time, 194 MB memory (equivalent to V3)
+
+6. **Key takeaways**:
+   - Custom Triton kernels (V3) deliver transformational performance that PyTorch-level fusion (V2) cannot match
+   - PyTorch SDPA (V4) provides a practical alternative to custom flash attention without sacrificing performance
+   - For production use, V4 may be preferable due to reliance on PyTorch's maintained SDPA implementation
+   - For maximum control and customization, V3's fully custom Triton approach is ideal
+
+## Files Generated
+
+### Version 1
+- Runtime trace: `version1_pytorch_baseline/traces/trace_*/`
+- Kernel trace: `version1_pytorch_baseline/counters/counter_20251028_164804/1f81e102abe6/9544_kernel_trace.csv` (11.6 MB)
+
+### Version 2
+- Runtime trace: `version2_pytorch_fused/traces/trace_20251028_170752/` (41 MB)
+- Runtime trace (50 steps): `version2_pytorch_fused/github_issue_test/test_20251028_172311/` (149 MB)
+- Kernel trace: `version2_pytorch_fused/counters/counter_20251028_172429/1f81e102abe6/17496_kernel_trace.csv` (10.8 MB)
+
+### Version 3
+- Kernel trace: `version3_triton/counters/counter_20251028_173451/1f81e102abe6/20129_kernel_trace.csv` (3.0 MB)
+- Much smaller trace file due to 78.8% fewer kernel dispatches
+
+### Version 4
+- Runtime trace: `version4_pytorch_sdpa/traces/trace_20251028_174853/` (9.7 MB)
+- Kernel trace: `version4_pytorch_sdpa/counters/counter_20251028_174948/1f81e102abe6/23175_kernel_trace.csv` (3.3 MB)
+- Similar trace sizes to version3