feat(profiling): add comprehensive ROCm profiling infrastructure for inference benchmarks and TinyTransformer #1

conde-amd · 2025-11-05T16:13:38Z

Summary

Add comprehensive profiling infrastructure for ML examples with ROCm 6.x/7.x compatibility:

Inference Benchmark Profiling: rocprof-compute, rocprof-sys integration with kernel trace analysis
ROCm 7.x Support: SQLite database analysis tools for new profiler format
TinyTransformer Enhancements: ROCm version detection and profiling scripts
Documentation: Profiling guides and compatibility notes

Profiling Tools

Add rocprof-compute and rocprof-sys profiling scripts for inference benchmarks
Implement kernel trace analysis and visualization tools
Add SQLite database parsers for ROCm 7.x .db format

ROCm Compatibility

Support both ROCm 6.x (CSV-based) and 7.x (SQLite-based) output formats
Automatic version detection and format handling
Backward-compatible profiling scripts

Add test script to validate rocprofv3 profiler capture on baseline PyTorch implementation. Three-phase test: environment validation, baseline execution, and profiler capture with Perfetto output. Script validates: - GPU visibility and ROCm configuration - Baseline performance without profiler overhead - rocprofv3 trace generation (runtime-trace + pftrace format) - PyTorch profiler integration (optional) Used to verify profiler instrumentation works correctly before comparing against fused/Triton implementations where profiling may fail.

Document successful rocprofv3 profiling capture on version1 baseline. Key findings: - 44 MB Perfetto trace generated with full GPU kernel activity - ROCm 6.4.4, PyTorch 2.7.1, RX 7900 XTX (gfx1100) - Performance: 262.3 samples/sec, 33,571 tokens/sec - Profiler overhead minimal Establishes baseline for comparison against version2 (GitHub issue #1386 reports "no device activity") and version3 (Triton). Version1 success confirms profiler environment configured correctly; issue is implementation-specific.

Add patterns for ROCProfiler output artifacts: - .pftrace files (binary trace output) - rocprof_* directories (profiler working directories) - pytorch_profiles/ (PyTorch profiler JSON output) - Generated trace directories in PyTorch_Profiling examples Prevents committing large binary profiling data to repository.

Add .gitignore patterns for TinyTransformer profiling output: - counters/ directories (hardware counter collection) - traces/ directories (execution traces) - github_issue_test/ (test case artifacts) Prevents committing timestamped profiling runs across all versions.

Document profiling comparison across TinyTransformer versions: - Version 1: PyTorch baseline - Version 2: PyTorch fused operations - Version 3: Triton custom kernels - Version 4: PyTorch SDPA + Triton Key findings: - Version3 and Version4 achieve 4.4x speedup over baseline - 76-79% reduction in kernel dispatches - Successful rocprofv3 profiling on ROCm 6.4.4 (RX 7900 XTX) - GitHub issue #1386 does not reproduce

Add profiling workflow scripts for TinyTransformer versions 1-4: Scripts per version: - get_trace.sh: Runtime trace collection (pftrace format) - get_counters.sh: Hardware counter collection - get_hotspots.sh: Kernel hotspot analysis - test_rocpd.sh: rocprofv3 validation with GPU activity check Version2 includes test_github_issue.sh for investigating GitHub issue #1386. All scripts use consistent parameters (batch-size 8, seq-len 128, num-steps 10) and timestamped output directories.

Add analyze_kernel_trace.py for post-processing rocprofv3 output: - Parse kernel dispatch CSV data - Aggregate statistics per kernel type - Calculate total/average/min/max execution times - Sort kernels by total GPU time - Generate performance summaries Deployed across all TinyTransformer versions for consistent analysis workflow.

Add rocprof-compute profiling automation for TinyTransformer versions 1-4: - Collect detailed GPU performance metrics - Kernel execution timeline - Memory transfer analysis - Hardware counter metrics - Occupancy statistics Complements existing rocprofv3 scripts with rocprof-compute's detailed analysis capabilities. Uses consistent parameters (batch-size 8, seq-len 128, num-steps 10) and timestamped output directories.

Add rocprof-sys profiling automation for TinyTransformer versions 1-4: - Call stack sampling - System trace timeline - CPU and GPU activity correlation - Function-level performance breakdown Generates Perfetto-compatible traces for visualization. Complements rocprofv3 (runtime traces) and rocprof-compute (detailed GPU metrics) with system-level profiling perspective. Note: rocprof-sys may produce memory map dumps in some configurations (known issue).

Add .gitignore pattern for inference_benchmark profiling output directory. Prevents committing timestamped profiling runs.

Document ROCm profiling workflow for inference benchmarks: - rocprofv3 counter collection and kernel traces - rocprof-compute detailed GPU metrics - rocprof-sys system-level profiling - Usage examples for ResNet50 profiling Provides reference for all profiling scripts in inference_benchmark.

Add ROCm profiling automation for inference benchmarks: - get_counters.sh: rocprofv3 kernel traces with hardware counters - get_trace.sh: Runtime trace collection - get_rocprof_compute.sh: Detailed GPU metrics - get_rocprof_sys.sh: System-level profiling with Perfetto traces Scripts configured for ResNet50 (batch-size 64, iterations 10) as default workload. Output to profiling_results/ with timestamped subdirectories.

Add analyze_kernel_trace.py for post-processing rocprofv3 kernel traces: - Parse kernel dispatch CSV data - Aggregate statistics per kernel type - Calculate performance metrics - Sort by total GPU time Adapted for inference_benchmark profiling workflow with automatic invocation from get_counters.sh.

Add comprehensive documentation for ROCm version compatibility: - Add compatibility notice for ROCm 6.x and 7.x - Document different output formats (CSV vs SQLite database) - Specify version-specific analysis tools - Include performance comparison examples showing MLIR kernel improvements - Update requirements to include SQLite3 for ROCm 7.x - Clarify table naming with UUID suffixes in ROCm 7.x This documentation helps users understand the differences between ROCm versions and ensures they use the appropriate analysis tools.

Add Python analysis scripts for ROCm 7.x profiling output format. ROCm 7.x changed from CSV files to SQLite databases for profiling data, requiring new analysis tooling. Key features: - Parse ROCm 7.x SQLite database format (*_results.db files) - Handle UUID-suffixed table names (rocpd_kernel_dispatch_<UUID>) - Extract kernel dispatch data with execution timestamps - Join with kernel symbol tables for readable kernel names - Calculate aggregate statistics: count, total/avg/min/max duration - Display top 20 kernels by GPU time with percentage breakdown Deployed to all project areas: - TinyTransformer (all 4 versions: baseline, fused, triton, sdpa) - inference_benchmark Works alongside existing analyze_kernel_trace.py for ROCm 6.x compatibility.

Enhance profiling scripts across all TinyTransformer versions with automatic ROCm version detection and appropriate tool selection. Changes to get_counters.sh: - Add multi-method ROCm version detection (rocminfo, ROCM_PATH, hipcc) - Automatically select analysis tool based on ROCm version - ROCm 6.x: analyze_kernel_trace.py (CSV format) - ROCm 7.x: analyze_rocpd_db.py (SQLite database) - Simplify script logic and error handling - Add descriptive comments explaining profiling purpose Changes to get_trace.sh: - Add ROCm version detection - Conditional --output-format pftrace flag for ROCm 6.4+/7.x - Enhanced comments explaining runtime trace capture - Better output organization and error reporting Minor enhancements to other scripts: - Updated comments in get_hotspots.sh, get_rocprof_compute.sh, get_rocprof_sys.sh - Consistent formatting across all versions Applied uniformly to all four TinyTransformer implementations: - version1_pytorch_baseline - version2_pytorch_fused - version3_triton - version4_pytorch_sdpa All scripts remain backward-compatible with ROCm 6.x.

Update inference_benchmark profiling scripts with automatic ROCm version detection and support for both ROCm 6.x and 7.x output formats. Changes to get_counters.sh: - Add multi-method ROCm version detection (rocminfo, ROCM_PATH, hipcc) - Automatically select appropriate analysis tool: - ROCm 6.x: analyze_kernel_trace.py for CSV output - ROCm 7.x: analyze_rocpd_db.py for SQLite database - Fallback to manual SQLite query instructions if tool not found - Improved error handling and output display Changes to get_trace.sh: - Add ROCm version detection - Conditional --output-format pftrace for ROCm 6.4+/7.x - Replace individual trace flags (--hip-trace, --hsa-trace, --marker-trace) with unified --runtime-trace for comprehensive timeline capture - Enhanced comments explaining captured data - Better handling of output file discovery (pftrace and database) Minor improvements to get_rocprof_compute.sh and get_rocprof_sys.sh: - Updated comments for clarity All scripts maintain backward compatibility with ROCm 6.x while adding full support for ROCm 7.x SQLite database format.

gsitaram

Hi @conde-amd, please see my comments on the inference_benchmark files. I have not looked at the new files in TinyTransformer as they are very similar to what I see in the inference_benchmark directory. Once we fix this one, we can apply the same methodology to TinyTransformer versions as well.

MLExamples/inference_benchmark/analyze_kernel_trace.py

MLExamples/inference_benchmark/analyze_rocpd_db.py

gsitaram · 2026-01-06T19:14:50Z

MLExamples/TinyTransformer/version4_pytorch_sdpa/get_counters.sh

This file does not seem to run any profiler. Also, for performance metrics, we tend to use rocprof-compute. Only a ninja developer/performance analyst would know which hardware counters to get for a given kernel and use rocprofv3 for that. Typically, we use roprofv3 for collecting GPU hotspots, traces of GPU activity, HIP and HSA API activity, Kokkos or RCCL tracing, etc. If we want to show collection of hardware counters using rocprofv3, then we must be clear about what counters we show.

MLExamples/inference_benchmark/get_counters.sh

gsitaram · 2026-01-07T19:38:28Z

MLExamples/inference_benchmark/get_rocprof_compute.sh

+echo ""
+echo "To analyze results, use rocprof-compute analyze tools:"
+echo "  rocprof-compute analyze --help"
+echo "  rocprof-compute analyze --workload-dir $OUTPUT_DIR"


There is no option called --workload-dir in rocprof-compute. Let's change the second command to the following that generates a detailed report of performance metrics corresponding to all hardware components in the GPU:

rocprof-compute analyze -p workloads/${WORKLOAD_NAME}/<arch> --dispatch <N> >& inference_dispatch_<N>_report.txt

It may be nice to show a couple of small sections of this generated report that are most relevant to the inference benchmark. We need to look at the report to determine what they are.

gsitaram · 2026-01-07T19:42:11Z

MLExamples/inference_benchmark/get_rocprof_sys.sh

+#
+# Compatible with ROCm 6.x and 7.x
+#
+# NOTE: rocprof-sys may produce memory map dumps in some configurations


What are you referring to? I do not know of this. If it is not relevant, we may want to remove this note. If it is relevant, a link to that GitHub issue may be helpful.

gsitaram · 2026-01-07T19:43:50Z

MLExamples/inference_benchmark/get_rocprof_sys.sh

+echo "Starting rocprof-sys profiling for inference_benchmark..."
+echo "Output directory: $OUTPUT_DIR"
+echo ""
+echo "NOTE: If you see excessive memory map output, this is a known issue."


Same comment as above. We need to remove this note if it is not relevant anymore.

gsitaram · 2026-01-07T20:37:45Z

MLExamples/inference_benchmark/get_rocprof_sys.sh

+echo "Generated files:"
+ls -lh "$OUTPUT_DIR"
+echo ""
+echo "To analyze results, use rocprof-sys tools:"


Remove these three lines. The rocprof-sys-avail tool is not used for analysis. There is no command called rocprof-sys-analyze. What we do instead is to open the resulting .proto file in the Perfetto UI: https://ui.perfetto.dev.

In addition to informing the reader to use Perfetto UI, we can show a couple of snapshots of the obtained trace here.

gsitaram · 2026-01-07T20:41:14Z

MLExamples/pytorch_microbench/get_rocprof_sys.sh

I wonder if all these must be documented in separate scripts. What do you think of just using one README.md file for this directory that walks the user through all the commands shown in get_counters.sh, get_rocprof_compute.sh, get_rocprof_sys.sh, and get_trace.sh with snapshots of outputs from those commands? That would keep it simple and help us use this example as a hands-on exercise for training purposes.

The directory name better reflects that this example runs forward and backward passes (micro-benchmarking) rather than pure inference. Related to fix/inference-benchmark-pr-comments

Remove analyze_kernel_trace.py and analyze_rocpd_db.py per PR review. Users should use official rocpd tools instead: - rocpd2csv for CSV export - rocpd summary for kernel statistics Related to fix/inference-benchmark-pr-comments

- Update directory references from inference_benchmark to pytorch_microbench - get_counters.sh: remove embedded Python script, use rocpd2csv for analysis - get_rocprof_compute.sh: fix analyze command syntax - get_rocprof_sys.sh: update analysis to use Perfetto UI, simplify notes - get_trace.sh: update echo messages Related to fix/inference-benchmark-pr-comments

Consolidate documentation into single walkthrough README with: - Feature overview of profiling scripts - Step-by-step usage instructions for each profiling tool - Analysis commands using official rocpd tools (rocpd2csv, rocpd summary) - Clarify this is micro-benchmarking (forward+backward), not inference Related to fix/inference-benchmark-pr-comments

Add real example outputs captured on Radeon RX 7900 XTX with ROCm 6.4: - Basic benchmark output showing ~360 img/sec throughput - get_trace.sh output showing 25MB Perfetto trace generation - get_counters.sh output with kernel trace analysis - Top kernels showing MIOpen convolutions dominating execution time - Notes on hardware counter availability for consumer vs data center GPUs Related to fix/inference-benchmark-pr-comments

Remove custom Python analysis scripts (analyze_kernel_trace.py and analyze_rocpd_db.py) per PR review feedback. Users should use the standard rocpd tools instead: - rocpd2csv: Export database to CSV - rocpd summary: Get kernel statistics

- Update header comments to reference TinyTransformer instead of inference_benchmark - Add rocpd2csv and rocpd summary instructions for kernel analysis - Add proper rocprof-compute analyze syntax with dispatch option - Simplify rocprof-sys output to reference Perfetto UI directly - Update memory map warning note format

Replace workshop-style documentation with concise example format: - Add intro paragraph describing the baseline model - Document command-line arguments - Add sections for each ROCm profiling script - Include usage instructions and analysis commands

Rewrite markdown files to follow concise GhostExchange format: - IMPORTTIME_PROFILING.md: 266 -> 82 lines - PYTORCH_BASELINE_WORKSHOP_WALKTHROUGH.md: 2368 -> 200 lines - ROCPROFV3_VERSION1_RESULTS.md: 193 -> 67 lines - exercise_1_baseline_analysis.md: 256 -> 78 lines - exercise_2_memory_analysis.md: 331 -> 91 lines - exercise_3_bottleneck_identification.md: 359 -> 85 lines Focus on essential usage examples and profiling commands.

Remove analyze_kernel_trace.py and analyze_rocpd_db.py per PR review. Use rocpd tools (rocpd2csv, rocpd summary) instead for database analysis.

…rmat - Condense README.md from 813 lines to 172 lines - Update profiling scripts with TinyTransformer V2 references - Add rocpd tool instructions for ROCm 7.x database analysis - Add analyze command syntax to get_rocprof_compute.sh - Fix incomplete get_counters.sh script

Remove analyze_kernel_trace.py and analyze_rocpd_db.py per PR review. Use rocpd tools (rocpd2csv, rocpd summary) instead for database analysis.

…rmat - Condense README.md from 810 to 178 lines - Condense README_WORKSHOP.md from 395 to 77 lines - Condense exercise markdown files (exercise1, exercise2, exercise3) - Condense performance_debugging README.md and WORKSHOP_GUIDE.md - Update profiling scripts with rocpd tool instructions - Add ROCm 6.x/7.x compatibility notes

Per PR review, remove custom analysis scripts that duplicate functionality available in rocpd tools. Users should use: - rocpd2csv for CSV export - rocpd summary for kernel statistics

…rmat - Condense README.md from 1037 to 179 lines - Condense exercise file from 525 to 79 lines - Update profiling scripts with rocpd tool instructions - Add ROCm 6.x/7.x compatibility notes - Add data center GPU requirement note to rocprof-compute

Sidafa Conde added 17 commits October 28, 2025 12:37

chore(profiling): exclude inference_benchmark profiling data

a167ad0

Add .gitignore pattern for inference_benchmark profiling output directory. Prevents committing timestamped profiling runs.

gsitaram requested changes Jan 7, 2026

View reviewed changes

Sidafa Conde added 12 commits January 14, 2026 09:34

refactor(pytorch_microbench): rename inference_benchmark directory

939b675

The directory name better reflects that this example runs forward and backward passes (micro-benchmarking) rather than pure inference. Related to fix/inference-benchmark-pr-comments

chore(pytorch_microbench): remove custom analysis scripts

d92db24

Remove analyze_kernel_trace.py and analyze_rocpd_db.py per PR review. Users should use official rocpd tools instead: - rocpd2csv for CSV export - rocpd summary for kernel statistics Related to fix/inference-benchmark-pr-comments

chore(TinyTransformer): remove custom analysis scripts from version2

db47525

Remove analyze_kernel_trace.py and analyze_rocpd_db.py per PR review. Use rocpd tools (rocpd2csv, rocpd summary) instead for database analysis.

chore(TinyTransformer): remove custom analysis scripts from version3

c8810cf

Remove analyze_kernel_trace.py and analyze_rocpd_db.py per PR review. Use rocpd tools (rocpd2csv, rocpd summary) instead for database analysis.

Sidafa Conde added 3 commits January 14, 2026 13:14

chore(TinyTransformer): remove custom analysis scripts from version4

19aed1b

Per PR review, remove custom analysis scripts that duplicate functionality available in rocpd tools. Users should use: - rocpd2csv for CSV export - rocpd summary for kernel statistics

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(profiling): add comprehensive ROCm profiling infrastructure for inference benchmarks and TinyTransformer #1

feat(profiling): add comprehensive ROCm profiling infrastructure for inference benchmarks and TinyTransformer #1

Uh oh!

conde-amd commented Nov 5, 2025

Uh oh!

gsitaram left a comment

Uh oh!

Uh oh!

Uh oh!

gsitaram Jan 6, 2026

Uh oh!

Uh oh!

Uh oh!

gsitaram Jan 7, 2026

Uh oh!

gsitaram Jan 7, 2026

Uh oh!

gsitaram Jan 7, 2026

Uh oh!

gsitaram Jan 7, 2026

Uh oh!

gsitaram Jan 7, 2026

Uh oh!

gsitaram Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(profiling): add comprehensive ROCm profiling infrastructure for inference benchmarks and TinyTransformer #1

Are you sure you want to change the base?

feat(profiling): add comprehensive ROCm profiling infrastructure for inference benchmarks and TinyTransformer #1

Uh oh!

Conversation

conde-amd commented Nov 5, 2025

Summary

Profiling Tools

ROCm Compatibility

Uh oh!

gsitaram left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gsitaram Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gsitaram Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

gsitaram Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

gsitaram Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

gsitaram Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

gsitaram Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

gsitaram Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants