forked from amd/HPCTrainingExamples
-
Notifications
You must be signed in to change notification settings - Fork 1
feat(profiling): add comprehensive ROCm profiling infrastructure for inference benchmarks and TinyTransformer #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
conde-amd
wants to merge
41
commits into
gsitaram:main
Choose a base branch
from
conde-amd:main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
41 commits
Select commit
Hold shift + click to select a range
2a19506
test(pytorch-profiling): add rocprofv3 validation script for version1
59ae9ca
docs(pytorch-profiling): document version1 rocprofv3 test results
993b853
chore(profiling): ignore profiling artifacts and traces
409ecf3
chore(profiling): exclude TinyTransformer profiling data
b497c02
docs(tinytransformer): add version comparison analysis
8b7372f
feat(profiling): add ROCProfiler automation scripts
c2f5ad8
feat(profiling): add kernel trace analysis script
a1e2fdc
feat(profiling): add rocprof-compute profiling scripts
0c14684
feat(profiling): add rocprof-sys profiling scripts
a167ad0
chore(profiling): exclude inference_benchmark profiling data
d3a01c0
docs(inference-benchmark): add profiling scripts guide
1dc7b2e
feat(profiling): add inference_benchmark profiling scripts
b9ec2a1
feat(profiling): add kernel trace analysis for inference benchmarks
afc3820
docs(inference-benchmark): document ROCm 6.x/7.x compatibility
f531e23
feat(profiling): add ROCm 7.x SQLite database analysis tools
262108f
refactor(tinytransformer): add ROCm version detection to profiling
8a34687
feat(inference-benchmark): enhance profiling with ROCm 7.x support
939b675
refactor(pytorch_microbench): rename inference_benchmark directory
d92db24
chore(pytorch_microbench): remove custom analysis scripts
9fc76c9
fix(pytorch_microbench): update profiling scripts per PR review
cf46839
docs(pytorch_microbench): rewrite README following GhostExchange format
ad8c1dc
docs(pytorch_microbench): add example outputs from RX 7900 XTX profiling
d8fb3d5
chore(TinyTransformer): remove custom analysis scripts
009a174
fix(TinyTransformer): update profiling scripts per PR review
d2846e1
docs(TinyTransformer): rewrite README following GhostExchange format
ce0440f
docs(tinytransformer): condense markdown documentation
db47525
chore(TinyTransformer): remove custom analysis scripts from version2
afbbe47
refactor(TinyTransformer): update version2 to follow GhostExchange fo…
c8810cf
chore(TinyTransformer): remove custom analysis scripts from version3
5795e40
refactor(TinyTransformer): update version3 to follow GhostExchange fo…
19aed1b
chore(TinyTransformer): remove custom analysis scripts from version4
84e7019
refactor(TinyTransformer): update version4 to follow GhostExchange fo…
5454867
docs(ml): rewrite profiling tutorials in GhostExchange style
8bd8d16
Merge branch 'sid-ghostexchange-voice-docs'
81037c8
refactor(pytorch_microbench): replace profiling scripts with shared i…
6ac3ec0
docs(pytorch_microbench): update docs for renamed profiling scripts
cc4710f
feat(pytorch_microbench): add example plots and generator script
59af571
refactor(TinyTransformer): add shared profile_common.sh and simplify …
5896e34
fix(TinyTransformer): create parent dir before writing profile summary
58866c2
docs(TinyTransformer): rewrite READMEs in GhostExchange format
77b4917
feat(TinyTransformer): add example plots and generator script
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,227 @@ | ||
| # Version 1 vs Version 2 vs Version 3 vs Version 4 Profiling Comparison | ||
|
|
||
| ## Executive Summary | ||
|
|
||
| All four versions successfully profile with rocprofv3. The GitHub issue #1386 "no device activity" does not reproduce with ROCm 6.4.4 on RX 7900 XTX. | ||
|
|
||
| **Key Finding**: Both version3 (Triton custom kernels) and version4 (PyTorch SDPA + Triton) achieve **4.4x speedup** over version1 baseline, with similar performance characteristics. Version2 (PyTorch fusion) provides minimal gains. | ||
|
|
||
| ## Test Configuration | ||
|
|
||
| - **GPU**: AMD Radeon RX 7900 XTX (gfx1100) | ||
| - **ROCm**: 6.4.4 | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Have the examples been tested with newer rocms, or rocm/7.2 or later? |
||
| - **Profiler**: rocprofv3 | ||
| - **Test parameters**: batch-size 8, seq-len 128, num-steps 10 | ||
|
|
||
| ## Profiling Results Comparison | ||
|
|
||
| ### Trace File Sizes (Runtime Trace) | ||
|
|
||
| | Version | Trace Size | Result | | ||
| |---------|-----------|---------| | ||
| | Version 1 | 44 MB | Success - full device activity captured | | ||
| | Version 2 | 41 MB | Success - full device activity captured | | ||
| | Version 3 | Not tested | Kernel trace tested instead (3.0 MB) | | ||
| | Version 4 | 9.7 MB | Success - full device activity captured | | ||
|
|
||
| ### Kernel Trace Analysis | ||
|
|
||
| | Metric | Version 1 | Version 2 | Version 3 | Version 4 | V3/V4 vs V1 | | ||
| |--------|-----------|-----------|-----------|-----------|-------------| | ||
| | Total kernel dispatches | 22,284 | 22,479 | 4,727 | 5,493 | -76.3% to -78.8% | | ||
| | Unique kernel types | 64 | 55 | 32 | 33 | -48.4% to -50.0% | | ||
| | Total GPU time | 346.21 ms | 378.06 ms | 104.49 ms | 103.36 ms | -70.1% to -69.8% | | ||
|
|
||
| ### Top 3 Kernels by GPU Time | ||
|
|
||
| #### Version 1 (PyTorch Baseline) | ||
|
|
||
| 1. **GEMM kernel** (Cijk_Alik_Bljk...): 30,658 us (127.74 us avg) - 240 calls | ||
| 2. **GEMM kernel** (Cijk_Ailk_Bljk...): 29,954 us (124.81 us avg) - 240 calls | ||
| 3. **GEMM kernel** (Cijk_Alik_Bljk...): 26,641 us (74.00 us avg) - 360 calls | ||
|
|
||
| **Total top 3**: 87,253 us (25.2% of total GPU time) | ||
|
|
||
| #### Version 2 (PyTorch Fused) | ||
|
|
||
| 1. **GEMM kernel** (Cijk_Ailk_Bljk...): 54,678 us (455.65 us avg) - 120 calls | ||
| 2. **GEMM kernel** (Cijk_Alik_Bljk...): 25,482 us (212.35 us avg) - 120 calls | ||
| 3. **bwd_kernel_fuse**: 24,814 us (206.78 us avg) - 120 calls | ||
|
|
||
| **Total top 3**: 104,974 us (27.8% of total GPU time) | ||
|
|
||
| #### Version 3 (Triton Custom Kernels) | ||
|
|
||
| 1. **GEMM kernel** (Cijk_Alik_Bljk...): 29,710 us (123.79 us avg) - 240 calls | ||
| 2. **GEMM kernel** (Cijk_Alik_Bljk...): 28,442 us (79.01 us avg) - 360 calls | ||
| 3. **flash_attention_kernel**: 15,557 us (129.64 us avg) - 120 calls | ||
|
|
||
| **Total top 3**: 73,709 us (70.5% of total GPU time) | ||
|
|
||
| **Note**: Version3's top 3 kernels account for 70.5% of GPU time vs 25-28% in V1/V2, showing much better kernel concentration. | ||
|
|
||
| #### Version 4 (PyTorch SDPA + Triton) | ||
|
|
||
| 1. **GEMM kernel** (Cijk_Alik_Bljk...): 29,641 us (123.50 us avg) - 240 calls | ||
| 2. **GEMM kernel** (Cijk_Alik_Bljk...): 28,320 us (78.67 us avg) - 360 calls | ||
| 3. **attn_fwd** (PyTorch SDPA): 13,045 us (108.71 us avg) - 120 calls | ||
|
|
||
| **Total top 3**: 71,006 us (68.7% of total GPU time) | ||
|
|
||
| **Note**: Version4 uses PyTorch SDPA (`attn_fwd`) instead of custom flash attention, but achieves similar performance to version3. | ||
|
|
||
| ### Key Observations | ||
|
|
||
| 1. **Version3 and Version4 achieve similar performance through different approaches**: | ||
| - **Version3**: Custom Triton kernels (`flash_attention_kernel`, `rmsnorm_kernel`) | ||
| - **Version4**: PyTorch SDPA (`attn_fwd`) with Triton fallbacks | ||
| - Both: 78-76% fewer kernel dispatches than version1 | ||
| - Both: ~50% fewer unique kernel types than version1 | ||
| - V3 flash attention: 15,557 us (129.64 us avg) | ||
| - V4 SDPA attention: 13,045 us (108.71 us avg) - slightly faster! | ||
|
|
||
| 2. **Version2 fused kernels**: | ||
| - `bwd_kernel_fuse` (24,814 us total) - backward pass fusion | ||
| - `attn_fwd` (12,639 us total) - attention forward fusion | ||
| - These are custom fused operations not present in version1 | ||
| - 14.1% fewer unique kernel types than version1 | ||
| - Marginal performance impact (slightly slower) | ||
|
|
||
| 3. **Performance progression**: | ||
| - Version1: Many small kernels, high launch overhead | ||
| - Version2: Some fusion, but still many PyTorch framework kernels | ||
| - Version3: Aggressive fusion with custom Triton kernels | ||
| - 69.8% reduction in GPU time vs version1 | ||
| - 72.4% reduction in GPU time vs version2 | ||
| - 78.8% fewer kernel launches vs version1 | ||
|
|
||
| 4. **Memory efficiency**: | ||
| - Version1: 434.3 MB peak memory | ||
| - Version2: 434.3 MB peak memory | ||
| - Version3: 193.8 MB peak memory (55.4% reduction) | ||
| - Triton kernels use significantly less memory | ||
|
|
||
| 5. **Profiler functionality**: | ||
| - rocprofv3 successfully captures all GPU activity on all three versions | ||
| - No "no device activity" issue observed | ||
| - GitHub issue #1386 likely fixed in ROCm 6.4.4 | ||
|
|
||
| ## Performance Comparison | ||
|
|
||
| ### Throughput | ||
|
|
||
| | Version | Samples/sec | Tokens/sec | Speedup vs V1 | | ||
| |---------|-------------|------------|---------------| | ||
| | Version 1 | 240.6 | 30,803 | 1.00x (baseline) | | ||
| | Version 2 | 247.4 | 31,672 | 1.03x | | ||
| | Version 3 | 1,054.8 | 135,014 | **4.38x** | | ||
| | Version 4 | 1,054.5 | 134,972 | **4.38x** | | ||
|
|
||
| Version3 and Version4 both achieve **4.38x speedup** over version1 and **4.26x speedup** over version2. | ||
|
|
||
| ### Batch Processing Time | ||
|
|
||
| | Version | Average Batch Time | Speedup vs V1 | | ||
| |---------|-------------------|---------------| | ||
| | Version 1 | 33.3 ms | 1.00x (baseline) | | ||
| | Version 2 | 32.3 ms | 1.03x | | ||
| | Version 3 | 7.5 ms | **4.44x** | | ||
| | Version 4 | 7.6 ms | **4.38x** | | ||
|
|
||
| ### Memory Usage | ||
|
|
||
| | Version | Peak Memory | Reduction vs V1 | | ||
| |---------|-------------|-----------------| | ||
| | Version 1 | 434.3 MB | baseline | | ||
| | Version 2 | 434.3 MB | 0% | | ||
| | Version 3 | 193.8 MB | **55.4%** | | ||
| | Version 4 | 193.9 MB | **55.3%** | | ||
|
|
||
| Version3 and Version4 both use less than half the memory of version1/version2. | ||
|
|
||
| ## Fusion Impact Analysis | ||
|
|
||
| ### Version2 (PyTorch Fused) | ||
|
|
||
| Version2 reports these fusion optimizations available: | ||
| - QKV Fusion: Available but not active in this run | ||
| - Flash Attention: Available but not active in this run | ||
| - SwiGLU Fusion: Available but not active in this run | ||
| - Torch Compile: Available but failed to activate | ||
|
|
||
| The fused kernels observed (`bwd_kernel_fuse`, `attn_fwd`) suggest some fusion is occurring despite the "not active" status. This may be a reporting issue in the code. | ||
|
|
||
| **Verdict**: Version2 fusion provides minimal benefit (3% speedup) and may have reporting issues. | ||
|
|
||
| ### Version3 (Triton Custom Kernels) | ||
|
|
||
| Version3 reports active Triton optimizations: | ||
| - RMSNorm Kernel: ACTIVE - Fused variance + normalization (1,167 us total, 4.58 us avg) | ||
| - Flash Attention Kernel: ACTIVE - Memory-efficient attention (15,557 us total, 129.64 us avg) | ||
| - SwiGLU Kernel: ACTIVE (not visible in top kernels, likely very fast) | ||
|
|
||
| **Verdict**: Version3 Triton kernels deliver massive performance gains (4.38x speedup) with proper kernel fusion and optimization. | ||
|
|
||
| ### Version4 (PyTorch SDPA + Triton) | ||
|
|
||
| Version4 uses PyTorch's Scaled Dot Product Attention (SDPA) with Triton fallbacks: | ||
| - **attn_fwd** (PyTorch SDPA): 13,045 us total, 108.71 us avg | ||
| - Slightly faster than V3's custom flash attention (15,557 us) | ||
| - Leverages PyTorch's optimized SDPA implementation | ||
| - Custom Triton kernels for other operations (RMSNorm, SwiGLU likely present but not in top kernels) | ||
| - 16% more kernel dispatches than V3 (5,493 vs 4,727) | ||
| - One additional unique kernel type (33 vs 32) | ||
|
|
||
| **Verdict**: Version4 achieves identical performance to version3 (4.38x speedup) using PyTorch SDPA instead of custom flash attention. PyTorch SDPA is actually slightly more efficient for attention, but V4 has slightly more overhead elsewhere. | ||
|
|
||
| ## Conclusion | ||
|
|
||
| 1. **rocprofv3 works correctly** on all four versions with ROCm 6.4.4 | ||
| 2. **No reproduction of GitHub issue #1386** - all versions show full device activity | ||
|
|
||
| 3. **Version3 and Version4 are equivalent winners**: | ||
| - Both: **4.38x faster** than version1 baseline | ||
| - Both: **4.26x faster** than version2 | ||
| - Both: **~55% less memory** usage | ||
| - Both: **~77-79% fewer** kernel dispatches | ||
| - Both: **~70% reduction** in GPU time | ||
| - V3 uses custom flash attention, V4 uses PyTorch SDPA | ||
| - V4's SDPA is slightly faster (13.0 ms vs 15.6 ms) but has slightly more overhead elsewhere | ||
|
|
||
| 4. **Version2 provides minimal gains**: | ||
| - Only 3% faster than version1 | ||
| - Same memory usage as version1 | ||
| - Some fusion, but not well optimized | ||
| - May have reporting issues with fusion flags | ||
|
|
||
| 5. **Performance progression summary**: | ||
| - V1 baseline: 240.6 samples/sec, 346 ms GPU time, 434 MB memory | ||
| - V2 fused: 247.4 samples/sec, 378 ms GPU time, 434 MB memory (marginal improvement) | ||
| - V3 custom Triton: 1,054.8 samples/sec, 104 ms GPU time, 194 MB memory (massive improvement) | ||
| - V4 PyTorch SDPA: 1,054.5 samples/sec, 103 ms GPU time, 194 MB memory (equivalent to V3) | ||
|
|
||
| 6. **Key takeaways**: | ||
| - Custom Triton kernels (V3) deliver transformational performance that PyTorch-level fusion (V2) cannot match | ||
| - PyTorch SDPA (V4) provides a practical alternative to custom flash attention without sacrificing performance | ||
| - For production use, V4 may be preferable due to reliance on PyTorch's maintained SDPA implementation | ||
| - For maximum control and customization, V3's fully custom Triton approach is ideal | ||
|
|
||
| ## Files Generated | ||
|
|
||
| ### Version 1 | ||
| - Runtime trace: `version1_pytorch_baseline/traces/trace_*/` | ||
| - Kernel trace: `version1_pytorch_baseline/counters/counter_20251028_164804/1f81e102abe6/9544_kernel_trace.csv` (11.6 MB) | ||
|
|
||
| ### Version 2 | ||
| - Runtime trace: `version2_pytorch_fused/traces/trace_20251028_170752/` (41 MB) | ||
| - Runtime trace (50 steps): `version2_pytorch_fused/github_issue_test/test_20251028_172311/` (149 MB) | ||
| - Kernel trace: `version2_pytorch_fused/counters/counter_20251028_172429/1f81e102abe6/17496_kernel_trace.csv` (10.8 MB) | ||
|
|
||
| ### Version 3 | ||
| - Kernel trace: `version3_triton/counters/counter_20251028_173451/1f81e102abe6/20129_kernel_trace.csv` (3.0 MB) | ||
| - Much smaller trace file due to 78.8% fewer kernel dispatches | ||
|
|
||
| ### Version 4 | ||
| - Runtime trace: `version4_pytorch_sdpa/traces/trace_20251028_174853/` (9.7 MB) | ||
| - Kernel trace: `version4_pytorch_sdpa/counters/counter_20251028_174948/1f81e102abe6/23175_kernel_trace.csv` (3.3 MB) | ||
| - Similar trace sizes to version3 | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you intend this file for general reader? If yes, could you provide link to the GitHub issue #1386?