Skip to content

Conversation

@conde-amd
Copy link

Summary

Add comprehensive profiling infrastructure for ML examples with ROCm 6.x/7.x compatibility:

  • Inference Benchmark Profiling: rocprof-compute, rocprof-sys integration with kernel trace analysis
  • ROCm 7.x Support: SQLite database analysis tools for new profiler format
  • TinyTransformer Enhancements: ROCm version detection and profiling scripts
  • Documentation: Profiling guides and compatibility notes

Profiling Tools

  • Add rocprof-compute and rocprof-sys profiling scripts for inference benchmarks
  • Implement kernel trace analysis and visualization tools
  • Add SQLite database parsers for ROCm 7.x .db format

ROCm Compatibility

  • Support both ROCm 6.x (CSV-based) and 7.x (SQLite-based) output formats
  • Automatic version detection and format handling
  • Backward-compatible profiling scripts

Sidafa Conde added 17 commits October 28, 2025 12:37
Add test script to validate rocprofv3 profiler capture on baseline PyTorch
implementation. Three-phase test: environment validation, baseline execution,
and profiler capture with Perfetto output.

Script validates:
- GPU visibility and ROCm configuration
- Baseline performance without profiler overhead
- rocprofv3 trace generation (runtime-trace + pftrace format)
- PyTorch profiler integration (optional)

Used to verify profiler instrumentation works correctly before comparing
against fused/Triton implementations where profiling may fail.
Document successful rocprofv3 profiling capture on version1 baseline.
Key findings:
- 44 MB Perfetto trace generated with full GPU kernel activity
- ROCm 6.4.4, PyTorch 2.7.1, RX 7900 XTX (gfx1100)
- Performance: 262.3 samples/sec, 33,571 tokens/sec
- Profiler overhead minimal

Establishes baseline for comparison against version2 (GitHub issue #1386
reports "no device activity") and version3 (Triton). Version1 success
confirms profiler environment configured correctly; issue is
implementation-specific.
Add patterns for ROCProfiler output artifacts:
- .pftrace files (binary trace output)
- rocprof_* directories (profiler working directories)
- pytorch_profiles/ (PyTorch profiler JSON output)
- Generated trace directories in PyTorch_Profiling examples

Prevents committing large binary profiling data to repository.
Add .gitignore patterns for TinyTransformer profiling output:
- counters/ directories (hardware counter collection)
- traces/ directories (execution traces)
- github_issue_test/ (test case artifacts)

Prevents committing timestamped profiling runs across all versions.
Document profiling comparison across TinyTransformer versions:
- Version 1: PyTorch baseline
- Version 2: PyTorch fused operations
- Version 3: Triton custom kernels
- Version 4: PyTorch SDPA + Triton

Key findings:
- Version3 and Version4 achieve 4.4x speedup over baseline
- 76-79% reduction in kernel dispatches
- Successful rocprofv3 profiling on ROCm 6.4.4 (RX 7900 XTX)
- GitHub issue #1386 does not reproduce
Add profiling workflow scripts for TinyTransformer versions 1-4:

Scripts per version:
- get_trace.sh: Runtime trace collection (pftrace format)
- get_counters.sh: Hardware counter collection
- get_hotspots.sh: Kernel hotspot analysis
- test_rocpd.sh: rocprofv3 validation with GPU activity check

Version2 includes test_github_issue.sh for investigating GitHub issue #1386.

All scripts use consistent parameters (batch-size 8, seq-len 128, num-steps 10)
and timestamped output directories.
Add analyze_kernel_trace.py for post-processing rocprofv3 output:
- Parse kernel dispatch CSV data
- Aggregate statistics per kernel type
- Calculate total/average/min/max execution times
- Sort kernels by total GPU time
- Generate performance summaries

Deployed across all TinyTransformer versions for consistent analysis workflow.
Add rocprof-compute profiling automation for TinyTransformer versions 1-4:
- Collect detailed GPU performance metrics
- Kernel execution timeline
- Memory transfer analysis
- Hardware counter metrics
- Occupancy statistics

Complements existing rocprofv3 scripts with rocprof-compute's detailed
analysis capabilities. Uses consistent parameters (batch-size 8, seq-len 128,
num-steps 10) and timestamped output directories.
Add rocprof-sys profiling automation for TinyTransformer versions 1-4:
- Call stack sampling
- System trace timeline
- CPU and GPU activity correlation
- Function-level performance breakdown

Generates Perfetto-compatible traces for visualization. Complements rocprofv3
(runtime traces) and rocprof-compute (detailed GPU metrics) with system-level
profiling perspective.

Note: rocprof-sys may produce memory map dumps in some configurations
(known issue).
Add .gitignore pattern for inference_benchmark profiling output directory.
Prevents committing timestamped profiling runs.
Document ROCm profiling workflow for inference benchmarks:
- rocprofv3 counter collection and kernel traces
- rocprof-compute detailed GPU metrics
- rocprof-sys system-level profiling
- Usage examples for ResNet50 profiling

Provides reference for all profiling scripts in inference_benchmark.
Add ROCm profiling automation for inference benchmarks:
- get_counters.sh: rocprofv3 kernel traces with hardware counters
- get_trace.sh: Runtime trace collection
- get_rocprof_compute.sh: Detailed GPU metrics
- get_rocprof_sys.sh: System-level profiling with Perfetto traces

Scripts configured for ResNet50 (batch-size 64, iterations 10) as default
workload. Output to profiling_results/ with timestamped subdirectories.
Add analyze_kernel_trace.py for post-processing rocprofv3 kernel traces:
- Parse kernel dispatch CSV data
- Aggregate statistics per kernel type
- Calculate performance metrics
- Sort by total GPU time

Adapted for inference_benchmark profiling workflow with automatic
invocation from get_counters.sh.
Add comprehensive documentation for ROCm version compatibility:

- Add compatibility notice for ROCm 6.x and 7.x
- Document different output formats (CSV vs SQLite database)
- Specify version-specific analysis tools
- Include performance comparison examples showing MLIR kernel improvements
- Update requirements to include SQLite3 for ROCm 7.x
- Clarify table naming with UUID suffixes in ROCm 7.x

This documentation helps users understand the differences between
ROCm versions and ensures they use the appropriate analysis tools.
Add Python analysis scripts for ROCm 7.x profiling output format.
ROCm 7.x changed from CSV files to SQLite databases for profiling data,
requiring new analysis tooling.

Key features:
- Parse ROCm 7.x SQLite database format (*_results.db files)
- Handle UUID-suffixed table names (rocpd_kernel_dispatch_<UUID>)
- Extract kernel dispatch data with execution timestamps
- Join with kernel symbol tables for readable kernel names
- Calculate aggregate statistics: count, total/avg/min/max duration
- Display top 20 kernels by GPU time with percentage breakdown

Deployed to all project areas:
- TinyTransformer (all 4 versions: baseline, fused, triton, sdpa)
- inference_benchmark

Works alongside existing analyze_kernel_trace.py for ROCm 6.x compatibility.
Enhance profiling scripts across all TinyTransformer versions with
automatic ROCm version detection and appropriate tool selection.

Changes to get_counters.sh:
- Add multi-method ROCm version detection (rocminfo, ROCM_PATH, hipcc)
- Automatically select analysis tool based on ROCm version
  - ROCm 6.x: analyze_kernel_trace.py (CSV format)
  - ROCm 7.x: analyze_rocpd_db.py (SQLite database)
- Simplify script logic and error handling
- Add descriptive comments explaining profiling purpose

Changes to get_trace.sh:
- Add ROCm version detection
- Conditional --output-format pftrace flag for ROCm 6.4+/7.x
- Enhanced comments explaining runtime trace capture
- Better output organization and error reporting

Minor enhancements to other scripts:
- Updated comments in get_hotspots.sh, get_rocprof_compute.sh, get_rocprof_sys.sh
- Consistent formatting across all versions

Applied uniformly to all four TinyTransformer implementations:
- version1_pytorch_baseline
- version2_pytorch_fused
- version3_triton
- version4_pytorch_sdpa

All scripts remain backward-compatible with ROCm 6.x.
Update inference_benchmark profiling scripts with automatic ROCm
version detection and support for both ROCm 6.x and 7.x output formats.

Changes to get_counters.sh:
- Add multi-method ROCm version detection (rocminfo, ROCM_PATH, hipcc)
- Automatically select appropriate analysis tool:
  - ROCm 6.x: analyze_kernel_trace.py for CSV output
  - ROCm 7.x: analyze_rocpd_db.py for SQLite database
- Fallback to manual SQLite query instructions if tool not found
- Improved error handling and output display

Changes to get_trace.sh:
- Add ROCm version detection
- Conditional --output-format pftrace for ROCm 6.4+/7.x
- Replace individual trace flags (--hip-trace, --hsa-trace, --marker-trace)
  with unified --runtime-trace for comprehensive timeline capture
- Enhanced comments explaining captured data
- Better handling of output file discovery (pftrace and database)

Minor improvements to get_rocprof_compute.sh and get_rocprof_sys.sh:
- Updated comments for clarity

All scripts maintain backward compatibility with ROCm 6.x while
adding full support for ROCm 7.x SQLite database format.
Copy link
Owner

@gsitaram gsitaram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @conde-amd, please see my comments on the inference_benchmark files. I have not looked at the new files in TinyTransformer as they are very similar to what I see in the inference_benchmark directory. Once we fix this one, we can apply the same methodology to TinyTransformer versions as well.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file does not seem to run any profiler. Also, for performance metrics, we tend to use rocprof-compute. Only a ninja developer/performance analyst would know which hardware counters to get for a given kernel and use rocprofv3 for that. Typically, we use roprofv3 for collecting GPU hotspots, traces of GPU activity, HIP and HSA API activity, Kokkos or RCCL tracing, etc. If we want to show collection of hardware counters using rocprofv3, then we must be clear about what counters we show.

echo ""
echo "To analyze results, use rocprof-compute analyze tools:"
echo " rocprof-compute analyze --help"
echo " rocprof-compute analyze --workload-dir $OUTPUT_DIR"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no option called --workload-dir in rocprof-compute. Let's change the second command to the following that generates a detailed report of performance metrics corresponding to all hardware components in the GPU:

rocprof-compute analyze -p workloads/${WORKLOAD_NAME}/<arch> --dispatch <N> >& inference_dispatch_<N>_report.txt

It may be nice to show a couple of small sections of this generated report that are most relevant to the inference benchmark. We need to look at the report to determine what they are.

#
# Compatible with ROCm 6.x and 7.x
#
# NOTE: rocprof-sys may produce memory map dumps in some configurations
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are you referring to? I do not know of this. If it is not relevant, we may want to remove this note. If it is relevant, a link to that GitHub issue may be helpful.

echo "Starting rocprof-sys profiling for inference_benchmark..."
echo "Output directory: $OUTPUT_DIR"
echo ""
echo "NOTE: If you see excessive memory map output, this is a known issue."
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above. We need to remove this note if it is not relevant anymore.

echo "Generated files:"
ls -lh "$OUTPUT_DIR"
echo ""
echo "To analyze results, use rocprof-sys tools:"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove these three lines. The rocprof-sys-avail tool is not used for analysis. There is no command called rocprof-sys-analyze. What we do instead is to open the resulting .proto file in the Perfetto UI: https://ui.perfetto.dev.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to informing the reader to use Perfetto UI, we can show a couple of snapshots of the obtained trace here.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if all these must be documented in separate scripts. What do you think of just using one README.md file for this directory that walks the user through all the commands shown in get_counters.sh, get_rocprof_compute.sh, get_rocprof_sys.sh, and get_trace.sh with snapshots of outputs from those commands? That would keep it simple and help us use this example as a hands-on exercise for training purposes.

Sidafa Conde added 12 commits January 14, 2026 09:34
The directory name better reflects that this example runs forward and
backward passes (micro-benchmarking) rather than pure inference.

Related to fix/inference-benchmark-pr-comments
Remove analyze_kernel_trace.py and analyze_rocpd_db.py per PR review.
Users should use official rocpd tools instead:
- rocpd2csv for CSV export
- rocpd summary for kernel statistics

Related to fix/inference-benchmark-pr-comments
- Update directory references from inference_benchmark to pytorch_microbench
- get_counters.sh: remove embedded Python script, use rocpd2csv for analysis
- get_rocprof_compute.sh: fix analyze command syntax
- get_rocprof_sys.sh: update analysis to use Perfetto UI, simplify notes
- get_trace.sh: update echo messages

Related to fix/inference-benchmark-pr-comments
Consolidate documentation into single walkthrough README with:
- Feature overview of profiling scripts
- Step-by-step usage instructions for each profiling tool
- Analysis commands using official rocpd tools (rocpd2csv, rocpd summary)
- Clarify this is micro-benchmarking (forward+backward), not inference

Related to fix/inference-benchmark-pr-comments
Add real example outputs captured on Radeon RX 7900 XTX with ROCm 6.4:
- Basic benchmark output showing ~360 img/sec throughput
- get_trace.sh output showing 25MB Perfetto trace generation
- get_counters.sh output with kernel trace analysis
- Top kernels showing MIOpen convolutions dominating execution time
- Notes on hardware counter availability for consumer vs data center GPUs

Related to fix/inference-benchmark-pr-comments
Remove custom Python analysis scripts (analyze_kernel_trace.py and
analyze_rocpd_db.py) per PR review feedback. Users should use the
standard rocpd tools instead:
- rocpd2csv: Export database to CSV
- rocpd summary: Get kernel statistics
- Update header comments to reference TinyTransformer instead of
  inference_benchmark
- Add rocpd2csv and rocpd summary instructions for kernel analysis
- Add proper rocprof-compute analyze syntax with dispatch option
- Simplify rocprof-sys output to reference Perfetto UI directly
- Update memory map warning note format
Replace workshop-style documentation with concise example format:
- Add intro paragraph describing the baseline model
- Document command-line arguments
- Add sections for each ROCm profiling script
- Include usage instructions and analysis commands
Rewrite markdown files to follow concise GhostExchange format:
- IMPORTTIME_PROFILING.md: 266 -> 82 lines
- PYTORCH_BASELINE_WORKSHOP_WALKTHROUGH.md: 2368 -> 200 lines
- ROCPROFV3_VERSION1_RESULTS.md: 193 -> 67 lines
- exercise_1_baseline_analysis.md: 256 -> 78 lines
- exercise_2_memory_analysis.md: 331 -> 91 lines
- exercise_3_bottleneck_identification.md: 359 -> 85 lines

Focus on essential usage examples and profiling commands.
Remove analyze_kernel_trace.py and analyze_rocpd_db.py per PR review.
Use rocpd tools (rocpd2csv, rocpd summary) instead for database analysis.
…rmat

- Condense README.md from 813 lines to 172 lines
- Update profiling scripts with TinyTransformer V2 references
- Add rocpd tool instructions for ROCm 7.x database analysis
- Add analyze command syntax to get_rocprof_compute.sh
- Fix incomplete get_counters.sh script
Remove analyze_kernel_trace.py and analyze_rocpd_db.py per PR review.
Use rocpd tools (rocpd2csv, rocpd summary) instead for database analysis.
Sidafa Conde added 3 commits January 14, 2026 13:14
…rmat

- Condense README.md from 810 to 178 lines
- Condense README_WORKSHOP.md from 395 to 77 lines
- Condense exercise markdown files (exercise1, exercise2, exercise3)
- Condense performance_debugging README.md and WORKSHOP_GUIDE.md
- Update profiling scripts with rocpd tool instructions
- Add ROCm 6.x/7.x compatibility notes
Per PR review, remove custom analysis scripts that duplicate
functionality available in rocpd tools. Users should use:
- rocpd2csv for CSV export
- rocpd summary for kernel statistics
…rmat

- Condense README.md from 1037 to 179 lines
- Condense exercise file from 525 to 79 lines
- Update profiling scripts with rocpd tool instructions
- Add ROCm 6.x/7.x compatibility notes
- Add data center GPU requirement note to rocprof-compute
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants