You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/how-to-1.md
+8-6Lines changed: 8 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,16 +12,18 @@ A flame graph is a visualization built from many sampled call stacks that shows
12
12
13
13
## Example Flame Graph
14
14
15
-
Take a look at the example flame graph below.
15
+
The flame graph below shows a typical profiling result.
16
16
17
-

17
+

18
18
19
-
The x axis represents the relative number of samples attributed to code paths ordered alphabetically, **not** a timeline. A wider block means that function appeared in more samples and therefore consumed more of the measured resource, typically CPU time. The y axis represents call stack depth. Frames at the bottom are closer to the root of execution such as a thread entry point, and frames above them are functions called by the frames below. A common workflow is to start with the widest blocks, then move upward through the stack to understand which callees dominate that hot path. Each sample captures a snapshot of the current call stack. Many samples are then aggregated by grouping identical stacks and summing their counts. This merging step is what makes flame graphs compact and readable. Reliable stack walking matters, and frame pointers are a common mechanism used to reconstruct the function call hierarchy consistently. When frame pointers are present, it is easier to unwind through nested calls and produce accurate stacks that merge cleanly into stable blocks.
19
+
The x-axis represents the relative number of samples attributed to code paths, ordered alphabetically, not a timeline. A wider block means that function appeared in more samples and therefore consumed more CPU time. The y-axis represents call stack depth. Frames at the bottom are closer to the root of execution, such as a thread entry point, and frames above them are functions called by those below.
20
20
21
-
This learning path is not meant as a detailed explanation of flame graphs, if you would like to learn more please read [this blog](https://www.brendangregg.com/flamegraphs.html) by the original creator, Brendan Gregg.
21
+
Each sample captures a snapshot of the current call stack. Many samples are then aggregated by grouping identical stacks and summing their counts, which is what makes flame graphs compact and readable. A common workflow is to start with the widest blocks, then move upward through the stack to understand which callees dominate that hot path. Reliable stack walking depends on frame pointers being present; they allow the profiler to unwind through nested calls and produce accurate stacks that merge cleanly into stable blocks.
22
+
23
+
This Learning Path does not cover flame graphs in depth. To learn more, see [Brendan Gregg's flame graph reference](https://www.brendangregg.com/flamegraphs.html).
22
24
23
25
## Tooling options
24
26
25
-
On Linux, flame graphs are commonly generated from samples collected with `perf`. perf periodically interrupts the running program and records a stack trace, then the collected stacks are converted into a folded format and rendered as the graph. Sampling frequency is important. If the frequency is too low you may miss shortlived hotspots, and if it is too high you may introduce overhead or skew the results. To make the output informative, compile with debug symbols and preserve frame pointers so stacks resolve to meaningful function names and unwind reliably. A typical build uses `-g` and `-fno-omit-frame-pointer`.
27
+
On Linux, flame graphs are commonly generated from samples collected with `perf`. perf periodically interrupts the running program and records a stack trace, then the collected stacks are converted into a folded format and rendered as the graph. Sampling frequency is important. If the frequency is too low you may miss short-lived hotspots, and if it is too high you may introduce overhead or skew the results. To make the output informative, compile with debug symbols and preserve frame pointers so stacks resolve to meaningful function names and unwind reliably. A typical build uses `-g` and `-fno-omit-frame-pointer`.
26
28
27
-
Arm has also developed a toolthat simplifies this workflow through the CPU Cycle hotspot recipe in Arm Performix, making it easier to configure collection, run captures, and explore the resulting call hierarchies without manually stitching together the individual steps. This is the tooling solution we will use in this learning path.
29
+
Arm has built a tool, Arm Performix that simplifies this workflow through the CPU Cycle hotspot recipe, making it easier to configure collection, run captures, and explore the resulting call hierarchies without manually stitching together the individual steps. This is the tooling solution you will use in this Learning Path.
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/how-to-2.md
+15-12Lines changed: 15 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,37 +8,40 @@ layout: learningpathall
8
8
9
9
## Setup
10
10
11
-
This learning path uses a hands-on worked example to make sampling-based profiling and flame graphs practical. You’ll build a C++11 program that generates a fractal bitmap by computing the Mandelbrot set, then mapping each pixel’s iteration count to a pixel value. You’ll have the full source code, so you can rebuild the program, profile it, and connect what you see in the flame graph back to the exact functions and loops responsible for the runtime.
11
+
This Learning Path uses a hands-on worked example to make sampling-based profiling and flame graphs practical. You’ll build a C++11 program that generates a fractal bitmap by computing the Mandelbrot set, then mapping each pixel’s iteration count to a pixel value. You’ll have the full source code, so you can rebuild the program, profile it, and connect what you see in the flame graph back to the exact functions and loops responsible for the runtime.
12
12
13
13
A fractal is a pattern that shows detail at many scales, often with self-similar structure. Fractals are usually generated by repeatedly applying a simple mathematical rule. In the Mandelbrot set, each pixel corresponds to a complex number, which is iterated through a basic recurrence. How quickly the value “escapes” (or whether it stays bounded) determines the pixel’s color and produces the familiar Mandelbrot image.
14
14
15
-
You don’t need to understand the Mandelbrot algorithm in detail to follow this learning path—we’ll use it primarily as a convenient, compute-heavy workload for profiling. If you'd like to learn more, please refer to the [Wikipedia](https://en.wikipedia.org/wiki/Mandelbrot_set) page for more information.
15
+
You don't need to understand the Mandelbrot algorithm in detail to follow this Learning Path — it's used here as a convenient, compute-heavy workload for profiling. To learn more, see the [Mandelbrot set article on Wikipedia](https://en.wikipedia.org/wiki/Mandelbrot_set).
16
16
17
17
18
18
## Connect to Target
19
19
20
-
Please refer to the [installation guide](https://learn.arm.com/install-guides/atp) if it is your first time setting up Arm Performix. In this learning path, I will be connecting to an AWS Graviton3 metal instance (`m7g.metal`) with 64 Neoverse V1 cores. From the host machine, test the connection to the remote server by navigating to `'Targets`->`Test Connection`. You should see the successul connection below.
20
+
See the [Arm Performix installation guide](https://learn.arm.com/install-guides/atp) if this is your first time setting up Arm Performix. In this Learning Path you will connect to an AWS Graviton3 metal instance (`m7g.metal`) with 64 Neoverse V1 cores, your remote target server. From the host machine, test the connection to the remote server by navigating to **Targets** > **Test Connection**. You should see the successful connection screen below.

23
23
24
24
## Build Application on Remote Server
25
25
26
-
Next, connect to the remote server, for example using SSH or VisualStudio Code, and clone the Mandelbrot repository. This is available under the [Arm Education License](https://github.com/arm-university/Mandelbrot-Example?tab=License-1-ov-file) for teaching and learning. Create a new directory where you will store and build this example. Next, run the commands below.
26
+
Connect to the remote serverusing SSH or Visual Studio Code. Install git and the C++ compiler. On dnf-based systems such as Amazon Linux 2023 or RHEL, run:
Install a C++ compiler, for example using your operating system's package manager.
32
+
Clone the Mandelbrot repository, check out the single-threaded branch, and create the output directories. The repository is available under the [Arm Education License](https://github.com/arm-university/Mandelbrot-Example?tab=License-1-ov-file) for teaching and learning.
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/how-to-3.md
+16-15Lines changed: 16 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,9 +8,11 @@ layout: learningpathall
8
8
9
9
## Run CPU Cycle Hotspot Recipe
10
10
11
-
As shown in the `main.cpp` file below, the program generates a 1920×1080 bitmap image of our fractal. To identify performance bottlenecks, we’ll run the CPU Cycle Hotspot recipe in Arm Performix (APX). APX uses sampling to estimate where the CPU spends most of its time, allowing it to highlight the hottest functions—especially useful in larger applications where it isn’t obvious ahead of time which functions will dominate runtime.
11
+
As shown in the `main.cpp` file below, the program generates a 1920×1080 bitmap image of the fractal. To identify performance bottlenecks, run the CPU Cycle Hotspot recipe in Arm Performix (APX). APX uses sampling to estimate where the CPU spends most of its time, allowing it to highlight the hottest functions—especially useful in larger applications where it isn't obvious ahead of time which functions will dominate runtime.
12
12
13
-
**Please Note**: You will need to replace the first string argument in the `myplot.draw()` function with the absolute path to the image folder and rebuild the application. If not, the image will be written to the `/tmp/atperf/tools/atperf-agent` directory from where the binary is run. As the name suggests, this folder is periodically deleted.
13
+
{{% notice Note %}}
14
+
The `myplot.draw()` call uses a relative path (`./images/green.bmp`). When APX launches the binary, it runs it from `/tmp/atperf/tools/atperf-agent`, so the image would be written there rather than to your project directory. Replace the first string argument with the absolute path to your `images` folder (for example, `/home/ec2-user/Mandelbrot-Example/green.bmp`) and rebuild the application before continuing.
15
+
{{% /notice %}}
14
16
15
17
```cpp
16
18
#include"Mandelbrot.h"
@@ -27,30 +29,29 @@ int main(){
27
29
}
28
30
```
29
31
30
-
Open up APX from the host machine. Click on the `CPU Cycle Hotspot` recipe. If this is the first time running the recipe on this target machine you may need to click the install tools button.
32
+
Open APX from the host machine. Select the **CPU Cycle Hotspot** recipe. If this is the first time running the recipe on this target machine you may need to select the install tools button.
31
33
32
-

34
+

33
35
34
-
Next we will configure the recipe. We will choose to launch a new process, APX will automatically start collecting metric when the program starts and stop when the program exits.
36
+
Configure the recipeto launch a new process. APX will automatically start collecting metrics when the program starts and stop when the program exits.
35
37
36
-
Provide an absolute path to the recently built binary, `mandelbrot`.
38
+
Provide the absolute path to the binary built in the previous step: `/home/ec2-user/Mandelbrot-Example/builds/mandelbrot`.
37
39
38
-
Finally, we will use the default sampling rate of `Normal`. If your application is a shortrunning program, you may want to consider a higher sample rate, this will be at the tradeoff of more data to store and process.
40
+
Use the default sampling rate of **Normal**. If your application is short-running, consider a higher sample rate, at the cost of more data to store and process.
39
41
40
-

42
+

41
43
42
44
## Analyse Results
43
45
44
-
A flame graph should be generated. The default colour mode is to label the 'hottest function', those which are sampled and utilizing CPU most frequently, in the darkest shade. Here we can see that the `__complex_abs__` function is being called during ~65% of samples. This is then calling the `__hypot` symbol in `libm.so`.
46
+
A flame graph is generated once the run completes. The default colour mode labels the hottest functions—those using CPU most frequently—in the darkest shade. In this example, the `__complex_abs__` function is present in approximately 65% of samples, and it calls the `__hypot` symbol in `libm.so`.

47
49
50
+
To investigate further, you can map source code lines to the functions in the flame graph. Right-click on a specific function and select **View Source Code**. At the time of writing (ATP Engine 0.44.0), you may need to copy the source code onto your host machine.
48
51
49
-
To understand deeper, we can map the the lines of source code to the functions. To do this right clight on a specific function and select 'View Source Code'. At the time of writing (ATP Engine 0.44.0), you may need to copy the source code onto your host machine.
52
+

50
53
51
-

54
+
Finally, check your `images` directory for the generated bitmap fractal.
52
55
53
-
Finally, looking in our images directory we can see the bitmap fractal.
54
-
55
-

56
+

0 commit comments