Skip to content

Commit e3abb83

Browse files
authored
Merge pull request #3008 from pareenaverma/content_review
First tech review of CPU Cycle Hotspots LP
2 parents 4c1146c + c40c35f commit e3abb83

5 files changed

Lines changed: 70 additions & 63 deletions

File tree

content/learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/_index.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,15 +7,15 @@ cascade:
77

88
minutes_to_complete: 30
99

10-
who_is_this_for: Cloud Engineers looking to optimize their workload running on a Linux-based Arm system.
10+
who_is_this_for: Software developers and performance engineers who want to identify CPU cycle hotspots in applications running on Arm Linux systems.
1111

1212
learning_objectives:
1313
- Run the CPU Cycle Hotspot recipe in Arm Performix
14-
- Identify which functions in your program use the most CPU cycles, so you can target the best candidates for optimization.
14+
- Identify which functions consume the most CPU cycles and target them for optimization
1515

1616
prerequisites:
1717
- Access to Arm Performix
18-
- Basic understand on C++
18+
- Basic understanding of C++
1919

2020
author: Kieran Hejmadi
2121

content/learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/how-to-1.md

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -12,16 +12,18 @@ A flame graph is a visualization built from many sampled call stacks that shows
1212

1313
## Example Flame Graph
1414

15-
Take a look at the example flame graph below.
15+
The flame graph below shows a typical profiling result.
1616

17-
![example](./flame-graph-example.jpg)
17+
![A flame graph showing function call stacks. The x-axis shows relative sample frequency and the y-axis shows call stack depth, with the widest blocks representing functions that consumed the most CPU time#center](./flame-graph-example.jpg "Example flame graph")
1818

19-
The x axis represents the relative number of samples attributed to code paths ordered alphabetically, **not** a timeline. A wider block means that function appeared in more samples and therefore consumed more of the measured resource, typically CPU time. The y axis represents call stack depth. Frames at the bottom are closer to the root of execution such as a thread entry point, and frames above them are functions called by the frames below. A common workflow is to start with the widest blocks, then move upward through the stack to understand which callees dominate that hot path. Each sample captures a snapshot of the current call stack. Many samples are then aggregated by grouping identical stacks and summing their counts. This merging step is what makes flame graphs compact and readable. Reliable stack walking matters, and frame pointers are a common mechanism used to reconstruct the function call hierarchy consistently. When frame pointers are present, it is easier to unwind through nested calls and produce accurate stacks that merge cleanly into stable blocks.
19+
The x-axis represents the relative number of samples attributed to code paths, ordered alphabetically, not a timeline. A wider block means that function appeared in more samples and therefore consumed more CPU time. The y-axis represents call stack depth. Frames at the bottom are closer to the root of execution, such as a thread entry point, and frames above them are functions called by those below.
2020

21-
This learning path is not meant as a detailed explanation of flame graphs, if you would like to learn more please read [this blog](https://www.brendangregg.com/flamegraphs.html) by the original creator, Brendan Gregg.
21+
Each sample captures a snapshot of the current call stack. Many samples are then aggregated by grouping identical stacks and summing their counts, which is what makes flame graphs compact and readable. A common workflow is to start with the widest blocks, then move upward through the stack to understand which callees dominate that hot path. Reliable stack walking depends on frame pointers being present; they allow the profiler to unwind through nested calls and produce accurate stacks that merge cleanly into stable blocks.
22+
23+
This Learning Path does not cover flame graphs in depth. To learn more, see [Brendan Gregg's flame graph reference](https://www.brendangregg.com/flamegraphs.html).
2224

2325
## Tooling options
2426

25-
On Linux, flame graphs are commonly generated from samples collected with `perf`. perf periodically interrupts the running program and records a stack trace, then the collected stacks are converted into a folded format and rendered as the graph. Sampling frequency is important. If the frequency is too low you may miss short lived hotspots, and if it is too high you may introduce overhead or skew the results. To make the output informative, compile with debug symbols and preserve frame pointers so stacks resolve to meaningful function names and unwind reliably. A typical build uses `-g` and `-fno-omit-frame-pointer`.
27+
On Linux, flame graphs are commonly generated from samples collected with `perf`. perf periodically interrupts the running program and records a stack trace, then the collected stacks are converted into a folded format and rendered as the graph. Sampling frequency is important. If the frequency is too low you may miss short-lived hotspots, and if it is too high you may introduce overhead or skew the results. To make the output informative, compile with debug symbols and preserve frame pointers so stacks resolve to meaningful function names and unwind reliably. A typical build uses `-g` and `-fno-omit-frame-pointer`.
2628

27-
Arm has also developed a tool that simplifies this workflow through the CPU Cycle hotspot recipe in Arm Performix, making it easier to configure collection, run captures, and explore the resulting call hierarchies without manually stitching together the individual steps. This is the tooling solution we will use in this learning path.
29+
Arm has built a tool, Arm Performix that simplifies this workflow through the CPU Cycle hotspot recipe, making it easier to configure collection, run captures, and explore the resulting call hierarchies without manually stitching together the individual steps. This is the tooling solution you will use in this Learning Path.

content/learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/how-to-2.md

Lines changed: 15 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -8,37 +8,40 @@ layout: learningpathall
88

99
## Setup
1010

11-
This learning path uses a hands-on worked example to make sampling-based profiling and flame graphs practical. You’ll build a C++11 program that generates a fractal bitmap by computing the Mandelbrot set, then mapping each pixel’s iteration count to a pixel value. You’ll have the full source code, so you can rebuild the program, profile it, and connect what you see in the flame graph back to the exact functions and loops responsible for the runtime.
11+
This Learning Path uses a hands-on worked example to make sampling-based profiling and flame graphs practical. You’ll build a C++11 program that generates a fractal bitmap by computing the Mandelbrot set, then mapping each pixel’s iteration count to a pixel value. You’ll have the full source code, so you can rebuild the program, profile it, and connect what you see in the flame graph back to the exact functions and loops responsible for the runtime.
1212

1313
A fractal is a pattern that shows detail at many scales, often with self-similar structure. Fractals are usually generated by repeatedly applying a simple mathematical rule. In the Mandelbrot set, each pixel corresponds to a complex number, which is iterated through a basic recurrence. How quickly the value “escapes” (or whether it stays bounded) determines the pixel’s color and produces the familiar Mandelbrot image.
1414

15-
You dont need to understand the Mandelbrot algorithm in detail to follow this learning path—we’ll use it primarily as a convenient, compute-heavy workload for profiling. If you'd like to learn more, please refer to the [Wikipedia](https://en.wikipedia.org/wiki/Mandelbrot_set) page for more information.
15+
You don't need to understand the Mandelbrot algorithm in detail to follow this Learning Path — it's used here as a convenient, compute-heavy workload for profiling. To learn more, see the [Mandelbrot set article on Wikipedia](https://en.wikipedia.org/wiki/Mandelbrot_set).
1616

1717

1818
## Connect to Target
1919

20-
Please refer to the [installation guide](https://learn.arm.com/install-guides/atp) if it is your first time setting up Arm Performix. In this learning path, I will be connecting to an AWS Graviton3 metal instance (`m7g.metal`) with 64 Neoverse V1 cores. From the host machine, test the connection to the remote server by navigating to `'Targets`->`Test Connection`. You should see the successul connection below.
20+
See the [Arm Performix installation guide](https://learn.arm.com/install-guides/atp) if this is your first time setting up Arm Performix. In this Learning Path you will connect to an AWS Graviton3 metal instance (`m7g.metal`) with 64 Neoverse V1 cores, your remote target server. From the host machine, test the connection to the remote server by navigating to **Targets** > **Test Connection**. You should see the successful connection screen below.
2121

22-
![successful-connection](./successful-connection.jpg).
22+
![The Arm Performix Targets panel showing a successful connection test result for a remote Arm server#center](./successful-connection.jpg "Successful connection to remote target")
2323

2424
## Build Application on Remote Server
2525

26-
Next, connect to the remote server, for example using SSH or VisualStudio Code, and clone the Mandelbrot repository. This is available under the [Arm Education License](https://github.com/arm-university/Mandelbrot-Example?tab=License-1-ov-file) for teaching and learning. Create a new directory where you will store and build this example. Next, run the commands below.
26+
Connect to the remote server using SSH or Visual Studio Code. Install git and the C++ compiler. On dnf-based systems such as Amazon Linux 2023 or RHEL, run:
2727

2828
```bash
29-
git clone https://github.com/arm-university/Mandelbrot-Example.git
30-
cd Mandelbrot-Example && mkdir images builds
31-
git checkout single-thread
29+
sudo dnf update && sudo dnf install git gcc g++
3230
```
3331

34-
Install a C++ compiler, for example using your operating system's package manager.
32+
Clone the Mandelbrot repository, check out the single-threaded branch, and create the output directories. The repository is available under the [Arm Education License](https://github.com/arm-university/Mandelbrot-Example?tab=License-1-ov-file) for teaching and learning.
3533

3634
```bash
37-
sudo dnf update && sudo dnf install g++ gcc
35+
git clone https://github.com/arm-university/Mandelbrot-Example.git
36+
cd Mandelbrot-Example
37+
git checkout single-thread
38+
mkdir images builds
3839
```
3940

40-
Build the application.
41+
Build the application:
4142

4243
```bash
4344
./build.sh
44-
```
45+
```
46+
47+
This creates the binary `./builds/mandelbrot`.

content/learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/how-to-3.md

Lines changed: 16 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,11 @@ layout: learningpathall
88

99
## Run CPU Cycle Hotspot Recipe
1010

11-
As shown in the `main.cpp` file below, the program generates a 1920×1080 bitmap image of our fractal. To identify performance bottlenecks, we’ll run the CPU Cycle Hotspot recipe in Arm Performix (APX). APX uses sampling to estimate where the CPU spends most of its time, allowing it to highlight the hottest functions—especially useful in larger applications where it isnt obvious ahead of time which functions will dominate runtime.
11+
As shown in the `main.cpp` file below, the program generates a 1920×1080 bitmap image of the fractal. To identify performance bottlenecks, run the CPU Cycle Hotspot recipe in Arm Performix (APX). APX uses sampling to estimate where the CPU spends most of its time, allowing it to highlight the hottest functions—especially useful in larger applications where it isn't obvious ahead of time which functions will dominate runtime.
1212

13-
**Please Note**: You will need to replace the first string argument in the `myplot.draw()` function with the absolute path to the image folder and rebuild the application. If not, the image will be written to the `/tmp/atperf/tools/atperf-agent` directory from where the binary is run. As the name suggests, this folder is periodically deleted.
13+
{{% notice Note %}}
14+
The `myplot.draw()` call uses a relative path (`./images/green.bmp`). When APX launches the binary, it runs it from `/tmp/atperf/tools/atperf-agent`, so the image would be written there rather than to your project directory. Replace the first string argument with the absolute path to your `images` folder (for example, `/home/ec2-user/Mandelbrot-Example/green.bmp`) and rebuild the application before continuing.
15+
{{% /notice %}}
1416

1517
```cpp
1618
#include "Mandelbrot.h"
@@ -27,30 +29,29 @@ int main(){
2729
}
2830
```
2931

30-
Open up APX from the host machine. Click on the `CPU Cycle Hotspot` recipe. If this is the first time running the recipe on this target machine you may need to click the install tools button.
32+
Open APX from the host machine. Select the **CPU Cycle Hotspot** recipe. If this is the first time running the recipe on this target machine you may need to select the install tools button.
3133

32-
![install-tools](./install-tools.jpg)
34+
![The Arm Performix recipe selection screen with the CPU Cycle Hotspot recipe highlighted#center](./install-tools.jpg "Selecting the CPU Cycle Hotspot recipe")
3335

34-
Next we will configure the recipe. We will choose to launch a new process, APX will automatically start collecting metric when the program starts and stop when the program exits.
36+
Configure the recipe to launch a new process. APX will automatically start collecting metrics when the program starts and stop when the program exits.
3537

36-
Provide an absolute path to the recently built binary, `mandelbrot`.
38+
Provide the absolute path to the binary built in the previous step: `/home/ec2-user/Mandelbrot-Example/builds/mandelbrot`.
3739

38-
Finally, we will use the default sampling rate of `Normal`. If your application is a short running program, you may want to consider a higher sample rate, this will be at the tradeoff of more data to store and process.
40+
Use the default sampling rate of **Normal**. If your application is short-running, consider a higher sample rate, at the cost of more data to store and process.
3941

40-
![config](./hotspot-config.jpg)
42+
![The Arm Performix CPU Cycle Hotspot recipe configuration screen showing launch settings, binary path, and sampling rate fields#center](./hotspot-config.jpg "CPU Cycle Hotspot recipe configuration")
4143

4244
## Analyse Results
4345

44-
A flame graph should be generated. The default colour mode is to label the 'hottest function', those which are sampled and utilizing CPU most frequently, in the darkest shade. Here we can see that the `__complex_abs__` function is being called during ~65% of samples. This is then calling the `__hypot` symbol in `libm.so`.
46+
A flame graph is generated once the run completes. The default colour mode labels the hottest functions—those using CPU most frequentlyin the darkest shade. In this example, the `__complex_abs__` function is present in approximately 65% of samples, and it calls the `__hypot` symbol in `libm.so`.
4547

46-
![single-thread-flameg](./single-thread-flame-graph.jpg)
48+
![A flame graph showing single-threaded Mandelbrot profiling results with __complex_abs__ as the dominant hotspot#center](./single-thread-flame-graph.jpg "Single-threaded flame graph showing __complex_abs__ as the hottest function")
4749

50+
To investigate further, you can map source code lines to the functions in the flame graph. Right-click on a specific function and select **View Source Code**. At the time of writing (ATP Engine 0.44.0), you may need to copy the source code onto your host machine.
4851

49-
To understand deeper, we can map the the lines of source code to the functions. To do this right clight on a specific function and select 'View Source Code'. At the time of writing (ATP Engine 0.44.0), you may need to copy the source code onto your host machine.
52+
![The Arm Performix flame graph view showing source code annotations mapped to the selected hot function#center](./view-with-source-code.jpg "Flame graph with source code view")
5053

51-
![view-src-code](./view-with-source-code.jpg)
54+
Finally, check your `images` directory for the generated bitmap fractal.
5255

53-
Finally, looking in our images directory we can see the bitmap fractal.
54-
55-
![mandelbrot](./plot-1-thread-MAX_ITERATIONS.jpg)
56+
![A rendered Mandelbrot set fractal in green, generated from the single-threaded build at maximum iterations#center](./plot-1-thread-max-iterations.jpg "Mandelbrot fractal output from single-threaded build")
5657

0 commit comments

Comments
 (0)