Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,15 @@ cascade:

minutes_to_complete: 30

who_is_this_for: Cloud Engineers looking to optimize their workload running on a Linux-based Arm system.
who_is_this_for: Software developers and performance engineers who want to identify CPU cycle hotspots in applications running on Arm Linux systems.

learning_objectives:
- Run the CPU Cycle Hotspot recipe in Arm Performix
- Identify which functions in your program use the most CPU cycles, so you can target the best candidates for optimization.
- Identify which functions consume the most CPU cycles and target them for optimization

prerequisites:
- Access to Arm Performix
- Basic understand on C++
- Basic understanding of C++

author: Kieran Hejmadi

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,16 +12,18 @@ A flame graph is a visualization built from many sampled call stacks that shows

## Example Flame Graph

Take a look at the example flame graph below.
The flame graph below shows a typical profiling result.

![example](./flame-graph-example.jpg)
![A flame graph showing function call stacks. The x-axis shows relative sample frequency and the y-axis shows call stack depth, with the widest blocks representing functions that consumed the most CPU time#center](./flame-graph-example.jpg "Example flame graph")

The x axis represents the relative number of samples attributed to code paths ordered alphabetically, **not** a timeline. A wider block means that function appeared in more samples and therefore consumed more of the measured resource, typically CPU time. The y axis represents call stack depth. Frames at the bottom are closer to the root of execution such as a thread entry point, and frames above them are functions called by the frames below. A common workflow is to start with the widest blocks, then move upward through the stack to understand which callees dominate that hot path. Each sample captures a snapshot of the current call stack. Many samples are then aggregated by grouping identical stacks and summing their counts. This merging step is what makes flame graphs compact and readable. Reliable stack walking matters, and frame pointers are a common mechanism used to reconstruct the function call hierarchy consistently. When frame pointers are present, it is easier to unwind through nested calls and produce accurate stacks that merge cleanly into stable blocks.
The x-axis represents the relative number of samples attributed to code paths, ordered alphabetically, not a timeline. A wider block means that function appeared in more samples and therefore consumed more CPU time. The y-axis represents call stack depth. Frames at the bottom are closer to the root of execution, such as a thread entry point, and frames above them are functions called by those below.

This learning path is not meant as a detailed explanation of flame graphs, if you would like to learn more please read [this blog](https://www.brendangregg.com/flamegraphs.html) by the original creator, Brendan Gregg.
Each sample captures a snapshot of the current call stack. Many samples are then aggregated by grouping identical stacks and summing their counts, which is what makes flame graphs compact and readable. A common workflow is to start with the widest blocks, then move upward through the stack to understand which callees dominate that hot path. Reliable stack walking depends on frame pointers being present; they allow the profiler to unwind through nested calls and produce accurate stacks that merge cleanly into stable blocks.

This Learning Path does not cover flame graphs in depth. To learn more, see [Brendan Gregg's flame graph reference](https://www.brendangregg.com/flamegraphs.html).

## Tooling options

On Linux, flame graphs are commonly generated from samples collected with `perf`. perf periodically interrupts the running program and records a stack trace, then the collected stacks are converted into a folded format and rendered as the graph. Sampling frequency is important. If the frequency is too low you may miss short lived hotspots, and if it is too high you may introduce overhead or skew the results. To make the output informative, compile with debug symbols and preserve frame pointers so stacks resolve to meaningful function names and unwind reliably. A typical build uses `-g` and `-fno-omit-frame-pointer`.
On Linux, flame graphs are commonly generated from samples collected with `perf`. perf periodically interrupts the running program and records a stack trace, then the collected stacks are converted into a folded format and rendered as the graph. Sampling frequency is important. If the frequency is too low you may miss short-lived hotspots, and if it is too high you may introduce overhead or skew the results. To make the output informative, compile with debug symbols and preserve frame pointers so stacks resolve to meaningful function names and unwind reliably. A typical build uses `-g` and `-fno-omit-frame-pointer`.

Arm has also developed a tool that simplifies this workflow through the CPU Cycle hotspot recipe in Arm Performix, making it easier to configure collection, run captures, and explore the resulting call hierarchies without manually stitching together the individual steps. This is the tooling solution we will use in this learning path.
Arm has built a tool, Arm Performix that simplifies this workflow through the CPU Cycle hotspot recipe, making it easier to configure collection, run captures, and explore the resulting call hierarchies without manually stitching together the individual steps. This is the tooling solution you will use in this Learning Path.
Original file line number Diff line number Diff line change
Expand Up @@ -8,37 +8,40 @@ layout: learningpathall

## Setup

This learning path uses a hands-on worked example to make sampling-based profiling and flame graphs practical. You’ll build a C++11 program that generates a fractal bitmap by computing the Mandelbrot set, then mapping each pixel’s iteration count to a pixel value. You’ll have the full source code, so you can rebuild the program, profile it, and connect what you see in the flame graph back to the exact functions and loops responsible for the runtime.
This Learning Path uses a hands-on worked example to make sampling-based profiling and flame graphs practical. You’ll build a C++11 program that generates a fractal bitmap by computing the Mandelbrot set, then mapping each pixel’s iteration count to a pixel value. You’ll have the full source code, so you can rebuild the program, profile it, and connect what you see in the flame graph back to the exact functions and loops responsible for the runtime.

A fractal is a pattern that shows detail at many scales, often with self-similar structure. Fractals are usually generated by repeatedly applying a simple mathematical rule. In the Mandelbrot set, each pixel corresponds to a complex number, which is iterated through a basic recurrence. How quickly the value “escapes” (or whether it stays bounded) determines the pixel’s color and produces the familiar Mandelbrot image.

You dont need to understand the Mandelbrot algorithm in detail to follow this learning path—we’ll use it primarily as a convenient, compute-heavy workload for profiling. If you'd like to learn more, please refer to the [Wikipedia](https://en.wikipedia.org/wiki/Mandelbrot_set) page for more information.
You don't need to understand the Mandelbrot algorithm in detail to follow this Learning Path — it's used here as a convenient, compute-heavy workload for profiling. To learn more, see the [Mandelbrot set article on Wikipedia](https://en.wikipedia.org/wiki/Mandelbrot_set).


## Connect to Target

Please refer to the [installation guide](https://learn.arm.com/install-guides/atp) if it is your first time setting up Arm Performix. In this learning path, I will be connecting to an AWS Graviton3 metal instance (`m7g.metal`) with 64 Neoverse V1 cores. From the host machine, test the connection to the remote server by navigating to `'Targets`->`Test Connection`. You should see the successul connection below.
See the [Arm Performix installation guide](https://learn.arm.com/install-guides/atp) if this is your first time setting up Arm Performix. In this Learning Path you will connect to an AWS Graviton3 metal instance (`m7g.metal`) with 64 Neoverse V1 cores, your remote target server. From the host machine, test the connection to the remote server by navigating to **Targets** > **Test Connection**. You should see the successful connection screen below.

![successful-connection](./successful-connection.jpg).
![The Arm Performix Targets panel showing a successful connection test result for a remote Arm server#center](./successful-connection.jpg "Successful connection to remote target")

## Build Application on Remote Server

Next, connect to the remote server, for example using SSH or VisualStudio Code, and clone the Mandelbrot repository. This is available under the [Arm Education License](https://github.com/arm-university/Mandelbrot-Example?tab=License-1-ov-file) for teaching and learning. Create a new directory where you will store and build this example. Next, run the commands below.
Connect to the remote server using SSH or Visual Studio Code. Install git and the C++ compiler. On dnf-based systems such as Amazon Linux 2023 or RHEL, run:

```bash
git clone https://github.com/arm-university/Mandelbrot-Example.git
cd Mandelbrot-Example && mkdir images builds
git checkout single-thread
sudo dnf update && sudo dnf install git gcc g++
```

Install a C++ compiler, for example using your operating system's package manager.
Clone the Mandelbrot repository, check out the single-threaded branch, and create the output directories. The repository is available under the [Arm Education License](https://github.com/arm-university/Mandelbrot-Example?tab=License-1-ov-file) for teaching and learning.

```bash
sudo dnf update && sudo dnf install g++ gcc
git clone https://github.com/arm-university/Mandelbrot-Example.git
cd Mandelbrot-Example
git checkout single-thread
mkdir images builds
```

Build the application.
Build the application:

```bash
./build.sh
```
```

This creates the binary `./builds/mandelbrot`.
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,11 @@ layout: learningpathall

## Run CPU Cycle Hotspot Recipe

As shown in the `main.cpp` file below, the program generates a 1920×1080 bitmap image of our fractal. To identify performance bottlenecks, we’ll run the CPU Cycle Hotspot recipe in Arm Performix (APX). APX uses sampling to estimate where the CPU spends most of its time, allowing it to highlight the hottest functions—especially useful in larger applications where it isnt obvious ahead of time which functions will dominate runtime.
As shown in the `main.cpp` file below, the program generates a 1920×1080 bitmap image of the fractal. To identify performance bottlenecks, run the CPU Cycle Hotspot recipe in Arm Performix (APX). APX uses sampling to estimate where the CPU spends most of its time, allowing it to highlight the hottest functions—especially useful in larger applications where it isn't obvious ahead of time which functions will dominate runtime.

**Please Note**: You will need to replace the first string argument in the `myplot.draw()` function with the absolute path to the image folder and rebuild the application. If not, the image will be written to the `/tmp/atperf/tools/atperf-agent` directory from where the binary is run. As the name suggests, this folder is periodically deleted.
{{% notice Note %}}
The `myplot.draw()` call uses a relative path (`./images/green.bmp`). When APX launches the binary, it runs it from `/tmp/atperf/tools/atperf-agent`, so the image would be written there rather than to your project directory. Replace the first string argument with the absolute path to your `images` folder (for example, `/home/ec2-user/Mandelbrot-Example/green.bmp`) and rebuild the application before continuing.
{{% /notice %}}

```cpp
#include "Mandelbrot.h"
Expand All @@ -27,30 +29,29 @@ int main(){
}
```

Open up APX from the host machine. Click on the `CPU Cycle Hotspot` recipe. If this is the first time running the recipe on this target machine you may need to click the install tools button.
Open APX from the host machine. Select the **CPU Cycle Hotspot** recipe. If this is the first time running the recipe on this target machine you may need to select the install tools button.

![install-tools](./install-tools.jpg)
![The Arm Performix recipe selection screen with the CPU Cycle Hotspot recipe highlighted#center](./install-tools.jpg "Selecting the CPU Cycle Hotspot recipe")

Next we will configure the recipe. We will choose to launch a new process, APX will automatically start collecting metric when the program starts and stop when the program exits.
Configure the recipe to launch a new process. APX will automatically start collecting metrics when the program starts and stop when the program exits.

Provide an absolute path to the recently built binary, `mandelbrot`.
Provide the absolute path to the binary built in the previous step: `/home/ec2-user/Mandelbrot-Example/builds/mandelbrot`.

Finally, we will use the default sampling rate of `Normal`. If your application is a short running program, you may want to consider a higher sample rate, this will be at the tradeoff of more data to store and process.
Use the default sampling rate of **Normal**. If your application is short-running, consider a higher sample rate, at the cost of more data to store and process.

![config](./hotspot-config.jpg)
![The Arm Performix CPU Cycle Hotspot recipe configuration screen showing launch settings, binary path, and sampling rate fields#center](./hotspot-config.jpg "CPU Cycle Hotspot recipe configuration")

## Analyse Results

A flame graph should be generated. The default colour mode is to label the 'hottest function', those which are sampled and utilizing CPU most frequently, in the darkest shade. Here we can see that the `__complex_abs__` function is being called during ~65% of samples. This is then calling the `__hypot` symbol in `libm.so`.
A flame graph is generated once the run completes. The default colour mode labels the hottest functions—those using CPU most frequentlyin the darkest shade. In this example, the `__complex_abs__` function is present in approximately 65% of samples, and it calls the `__hypot` symbol in `libm.so`.

![single-thread-flameg](./single-thread-flame-graph.jpg)
![A flame graph showing single-threaded Mandelbrot profiling results with __complex_abs__ as the dominant hotspot#center](./single-thread-flame-graph.jpg "Single-threaded flame graph showing __complex_abs__ as the hottest function")

To investigate further, you can map source code lines to the functions in the flame graph. Right-click on a specific function and select **View Source Code**. At the time of writing (ATP Engine 0.44.0), you may need to copy the source code onto your host machine.

To understand deeper, we can map the the lines of source code to the functions. To do this right clight on a specific function and select 'View Source Code'. At the time of writing (ATP Engine 0.44.0), you may need to copy the source code onto your host machine.
![The Arm Performix flame graph view showing source code annotations mapped to the selected hot function#center](./view-with-source-code.jpg "Flame graph with source code view")

![view-src-code](./view-with-source-code.jpg)
Finally, check your `images` directory for the generated bitmap fractal.

Finally, looking in our images directory we can see the bitmap fractal.

![mandelbrot](./plot-1-thread-MAX_ITERATIONS.jpg)
![A rendered Mandelbrot set fractal in green, generated from the single-threaded build at maximum iterations#center](./plot-1-thread-max-iterations.jpg "Mandelbrot fractal output from single-threaded build")

Loading
Loading