Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
---
title: Find CPU Cycle Hotspots with Arm Performix
title: Find Code Hotspots with Arm Performix

draft: true
cascade:
draft: true

minutes_to_complete: 30

who_is_this_for: Software developers and performance engineers who want to identify CPU cycle hotspots in applications running on Arm Linux systems.
who_is_this_for: Software developers and performance engineers who want to identify code hotspots in applications running on Arm Linux systems.

learning_objectives:
- Run the CPU Cycle Hotspot recipe in Arm Performix
- Run the Code Hotspots recipe in Arm Performix
- Identify which functions consume the most CPU cycles and target them for optimization

prerequisites:
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -26,4 +26,4 @@ This Learning Path does not cover flame graphs in depth. To learn more, see [Bre

On Linux, flame graphs are commonly generated from samples collected with `perf`. perf periodically interrupts the running program and records a stack trace, then the collected stacks are converted into a folded format and rendered as the graph. Sampling frequency is important. If the frequency is too low you may miss short-lived hotspots, and if it is too high you may introduce overhead or skew the results. To make the output informative, compile with debug symbols and preserve frame pointers so stacks resolve to meaningful function names and unwind reliably. A typical build uses `-g` and `-fno-omit-frame-pointer`.

Arm has built a tool, Arm Performix that simplifies this workflow through the CPU Cycle hotspot recipe, making it easier to configure collection, run captures, and explore the resulting call hierarchies without manually stitching together the individual steps. This is the tooling solution you will use in this Learning Path.
Arm has built a tool, Arm Performix that simplifies this workflow through the Code Hotspots recipe, making it easier to configure collection, run captures, and explore the resulting call hierarchies without manually stitching together the individual steps. This is the tooling solution you will use in this Learning Path.
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,12 @@ weight: 4
layout: learningpathall
---

## Run CPU Cycle Hotspot Recipe
## Run Code Hotspots Recipe

As shown in the `main.cpp` file below, the program generates a 1920×1080 bitmap image of the fractal. To identify performance bottlenecks, run the CPU Cycle Hotspot recipe in Arm Performix (APX). APX uses sampling to estimate where the CPU spends most of its time, allowing it to highlight the hottest functions—especially useful in larger applications where it isn't obvious ahead of time which functions will dominate runtime.
As shown in the `main.cpp` file below, the program generates a 1920×1080 bitmap image of the fractal. To identify performance bottlenecks, run the Code Hotspots recipe in Arm Performix (APX). APX uses sampling to estimate where the CPU spends most of its time, allowing it to highlight the hottest functions—especially useful in larger applications where it isn't obvious ahead of time which functions will dominate runtime.

{{% notice Note %}}
The `myplot.draw()` call uses a relative path (`./images/green.bmp`). When APX launches the binary, it runs it from `/tmp/atperf/tools/atperf-agent`, so the image would be written there rather than to your project directory. Replace the first string argument with the absolute path to your `images` folder (for example, `/home/ec2-user/Mandelbrot-Example/green.bmp`) and rebuild the application before continuing.
The `myplot.draw()` call uses a relative path (`./images/green.bmp`). When APX launches the binary, it runs it from a temporary location, so the image would be written there rather than to your project directory. To ensure the output is saved where you expect it, update the first string argument in `main.cpp` to the absolute path of the output file, for example `/home/ec2-user/Mandelbrot-Example/images/green.bmp`.
{{% /notice %}}

```cpp
Expand All @@ -23,35 +23,40 @@ using namespace std;
int main(){

Mandelbrot::Mandelbrot myplot(1920, 1080);
myplot.draw("./images/green.bmp", Mandelbrot::Mandelbrot::GREEN);
myplot.draw("/home/ec2-user/Mandelbrot-Example/images/green.bmp", Mandelbrot::Mandelbrot::GREEN);

return 0;
}
```
Rebuild the application before continuing:

Open APX from the host machine. Select the **CPU Cycle Hotspot** recipe. If this is the first time running the recipe on this target machine you may need to select the install tools button.
```bash
./build.sh
```

![The Arm Performix recipe selection screen with the CPU Cycle Hotspot recipe highlighted#center](./install-tools.jpg "Selecting the CPU Cycle Hotspot recipe")
Open APX from the host machine. Select the **Code Hotspot** recipe. If this is the first time running the recipe on this target machine you may need to select the install tools button.

![The Arm Performix recipe selection screen with the Code Hotspots recipe highlighted#center](./install-tools.jpg "Selecting the Code Hotspots recipe")

Configure the recipe to launch a new process. APX will automatically start collecting metrics when the program starts and stop when the program exits.

Provide the absolute path to the binary built in the previous step: `/home/ec2-user/Mandelbrot-Example/builds/mandelbrot`.

Use the default sampling rate of **Normal**. If your application is short-running, consider a higher sample rate, at the cost of more data to store and process.

![The Arm Performix CPU Cycle Hotspot recipe configuration screen showing launch settings, binary path, and sampling rate fields#center](./hotspot-config.jpg "CPU Cycle Hotspot recipe configuration")
![The Arm Performix Code Hotspots recipe configuration screen showing launch settings, binary path, and sampling rate fields#center](./code-hotspots-config.png "Code Hotspots recipe configuration")

## Analyse Results

A flame graph is generated once the run completes. The default colour mode labels the hottest functions—those using CPU most frequently—in the darkest shade. In this example, the `__complex_abs__` function is present in approximately 65% of samples, and it calls the `__hypot` symbol in `libm.so`.

![A flame graph showing single-threaded Mandelbrot profiling results with __complex_abs__ as the dominant hotspot#center](./single-thread-flame-graph.jpg "Single-threaded flame graph showing __complex_abs__ as the hottest function")

To investigate further, you can map source code lines to the functions in the flame graph. Right-click on a specific function and select **View Source Code**. At the time of writing (ATP Engine 0.44.0), you may need to copy the source code onto your host machine.
To investigate further, you can map source code lines to the functions in the flame graph. Right-click on a specific function and select **View Source Code**. You may need to copy the source code onto your host machine to use this feature.

![The Arm Performix flame graph view showing source code annotations mapped to the selected hot function#center](./view-with-source-code.jpg "Flame graph with source code view")

Finally, check your `images` directory for the generated bitmap fractal.
Finally, check your `images` directory for the generated bitmap fractal `green.bmp`

![A rendered Mandelbrot set fractal in green, generated from the single-threaded build at maximum iterations#center](./plot-1-thread-max-iterations.jpg "Mandelbrot fractal output from single-threaded build")

Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ public:
...
```

On the remote server, reduce `MAX_ITERATIONS` in `Mandelbrot.h` to `(1<<9)` (512), update the output filename in `main.cpp` to a different path so you can compare the output with the baseline image, then rebuild:
On the remote server, reduce `MAX_ITERATIONS` in `Mandelbrot.h` to `(1<<9)` (512), update the output image filename in `main.cpp` to a different name (for example: `green-512.bmp`)so you can compare the output with the baseline image, then rebuild:

```bash
./build.sh
Expand All @@ -49,13 +49,20 @@ There is negligible difference in perceived image quality when halving `MAX_ITER

The loop in `Mandelbrot::getIterations` has no loop-carried dependencies — each iteration's result is independent of any other. This means you can parallelize the hot function across multiple threads if your CPU has multiple cores.

The repository contains a parallel version in the `main` branch:
The repository contains a parallel version in the `main` branch. First stash the changes you made locally, then switch to the `main` branch.

```bash
git stash
git checkout main
```

This branch parallelizes the `Mandelbrot::draw` function, which is an earlier function in the call stack that eventually calls `__hypot`. Build the example. This creates a binary `./builds/mandelbrot-parallel` that takes a single numeric command-line argument to set the number of threads.
This branch parallelizes the `Mandelbrot::draw` function, which is an earlier function in the call stack that eventually calls `__hypot`. Before building, update the `myplot.draw()` call in `main.cpp` to use an absolute output path:

```cpp
myplot.draw("/home/ec2-user/Mandelbrot-Example/images/Green-Parallel-512.bmp", Mandelbrot::Mandelbrot::GREEN);
```

Build the example. This creates a binary `./builds/mandelbrot-parallel` that takes a single numeric command-line argument to set the number of threads.

```bash
./build.sh
Expand Down Expand Up @@ -84,6 +91,6 @@ The build script uses the `-O0` flag, which disables all compiler optimizations.

In this Learning Path, you reduced the runtime of the Mandelbrot example by focusing on the hottest code paths—cutting execution time from around 1 minute to ~1 second through targeted optimization and parallelization. While this example is relatively simple and the optimizations are more obvious, the same principle applies to real-world workloads: optimize what matters most first, based on measurement.

The CPU Cycle Hotspot recipe is designed to quickly identify an application's most CPU-time-dominant functions, giving you a clear, evidence-based starting point for performance work. By surfacing where execution time is actually spent, it ensures your optimizations target the parts of the code most likely to deliver the largest gains.
The Code Hotspots recipe is designed to quickly identify an application's most CPU-time-dominant functions, giving you a clear, evidence-based starting point for performance work. By surfacing where execution time is actually spent, it ensures your optimizations target the parts of the code most likely to deliver the largest gains.

This is often one of the first profiling steps to run when assessing an application's performance — especially to determine which functions dominate runtime and should be prioritized. Once hotspots are identified, you can follow up with deeper function-specific analysis, such as memory investigations or top-down studies, and build microbenchmarks around hot functions to explore lower-level bottlenecks and uncover additional optimization opportunities.
Loading