Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,44 +1,71 @@
---
title: Setup
title: Set up the target environment and compile the application
weight: 2

### FIXED, DO NOT MODIFY
layout: learningpathall
---

This Learning Path uses Arm Performix CPU Mircoarchitecture and Instruction Mix recipes to analyze performance in a sample application.
To analyze performance bottlenecks, you need an environment and a sample application to profile. In this section, you configure an Arm Performix connection and build a Mandelbrot set generator.

A Mandelbrot set generator is a classic computer science application used to test computational performance. It calculates a famous mathematical fractal by performing intense, repeated mathematical operations (often floating-point) for every pixel in a large image. Because the math for each pixel is independent of the others, it is a highly parallelizable workload that is perfect for demonstrating CPU optimizations like vectorization and loop unrolling.

## Before you begin

Use the Performix [installation guide](/install-guides/atp/) to install the tool if this is your first run. From the host machine, open the **Targets** tab, set up an SSH connection to the target that runs the workload, and test the connection. In this Learning Path's examples, I'll connect to an Arm Neoverse V1 workstation.
Make sure Arm Performix installed on your host machine. The host machine is your local computer where the Arm Performix GUI runs, and it can be a Windows, macOS, or Linux machine. The target machine is the Linux server where your application is compiled and where the application runs.

If you do not have Arm Performix installed, see the [Arm Performix install guide](/install-guides/atp/).

From the host machine, open the Arm Performix application and navigate to the **Targets** tab. Set up an SSH connection to the target that runs the workload, and test the connection. For the examples in this guide, you connect to an Arm Neoverse-based server.

The Arm Performix collection agent requires Python and `binutils` to run on the target machine.

Connect to your target machine using SSH and install these required OS packages.

For Ubuntu and other Debian-based distributions, run the following command:

Install required OS packages on the target. For Debian-based distributions, run:
```bash
sudo apt-get install python3 python3-venv binutils
```

## Build sample application on remote server
## Build the sample application on the target machine

Connect to your target machine and download the sample application for this Learning Path, a Mandelbrot set generator.
The code is available under the [Arm Education License](https://github.com/arm-university/Mandelbrot-Example?tab=License-1-ov-file). Create a directory where you want to store and build the example, then run:
Download the sample application, which is a Mandelbrot set generator provided under the [Arm Education License](https://github.com/arm-university/Mandelbrot-Example?tab=License-1-ov-file). Create a directory to store and build the example, then run the following commands:

```bash
cd $HOME
git clone https://github.com/arm-university/Mandelbrot-Example.git
cd Mandelbrot-Example && mkdir images builds
```

Install a C++ compiler by using your operating system's package manager.
Install a C++ compiler using your operating system's package manager. For Ubuntu and other Debian-based distributions, run the following command:

```bash
sudo apt install build-essential
```

Build the application:
Run the provided setup script to build the application:

```bash
./build.sh
```

The binary in the `./builds/` directory generates an image similar to the fractal below.
When the build completes, a binary named `mandelbrot-parallel` is created in the `./builds` directory.

The application requires one argument: the number of threads to use. Run this new executable with 4 threads:

```bash
./builds/mandelbrot-parallel 4
```

The application generates a bitmap image file in your `./images` directory that looks similar to the following fractal:

![Mandelbrot set fractal generated by the sample application#center](./green-parallel-512.webp "Mandelbrot Set")

## What you've accomplished and what's next

In this section:
- You set up the target machine and established an SSH connection.
- You built the Mandelbrot sample application.

![Green-Parallel-512.bmp](./Green-Parallel-512.bmp)
Next, you will use the CPU Microarchitecture recipe to identify performance bottlenecks in the application.
Original file line number Diff line number Diff line change
@@ -1,64 +1,76 @@
---
title: Find Bottlenecks with CPU Microarchitecture
title: Identify application bottlenecks with the CPU Microarchitecture recipe
weight: 3

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Run CPU Microarchitecture analysis
## Run the CPU Microarchitecture recipe

As shown in the `main.cpp` listing below, the program generates a 1920×1080 bitmap image of the fractal. To identify performance bottlenecks, run the CPU Microarchitecture recipe in Arm Performix (APX). APX uses microarchitectural sampling to show which instruction pipeline stages dominate program latency, then highlights ways to improve those bottlenecks.
To identify performance bottlenecks, run the CPU Microarchitecture recipe in Arm Performix. Arm Performix uses microarchitectural sampling to show which instruction pipeline stages dominate program latency, and then highlights ways to improve those bottlenecks.


{{% notice Specify the example output file %}}
Replace the first string argument in `myplot.draw()` with the absolute path to your image folder, then rebuild the application. Otherwise, the image is written to `/tmp/atperf/tools/atperf-agent`, which is periodically deleted.
{{% /notice %}}
Start by reviewing the code in `main.cpp`, the program generates a 1920×1080 bitmap image of the fractal.

```cpp
#include "Mandelbrot.h"
#include <iostream>

using namespace std;

int main(){
int main(int argc, char* argv[]){

const int NUM_THREADS = std::stoi(argv[1]);
std::cout << "Number of Threads = " << NUM_THREADS << std::endl;

Mandelbrot::Mandelbrot myplot(1920, 1080);
myplot.draw("/path/to/images/green.bmp", Mandelbrot::Mandelbrot::GREEN);
Mandelbrot::Mandelbrot myplot(1920, 1080, NUM_THREADS);
myplot.draw("/home/ec2-user/Mandelbrot-final/Mandelbrot-Example/images/Green-Parallel-512.bmp", Mandelbrot::Mandelbrot::GREEN);

return 0;
}
```

On your host machine, open Arm Performix and select the **CPU Microarchitecture** recipe.
When Arm Performix launches the executable on the target machine, it does so from a temporary agent directory, `/tmp/atperf/tools/atperf-agent`. If your code uses a relative path to save the image, the image is written to that temporary folder and might be deleted.

To prevent this, edit the `myplot.draw()` line in `main.cpp` to use the absolute path to your project's image folder (for example, `/home/ubuntu/Mandelbrot-Example/images/Green-Parallel-512.bmp`), and then rebuild the application.

In the Arm Performix application on your host machine, select the **CPU Microarchitecture** recipe.

![config](./cpu-uarch-config.jpg)
![Arm Performix CPU Microarchitecture configuration screen#center](./cpu-uarch-config.webp "CPU Microarchitecture Configuration")

Select the target you configured in the setup phase. If this is your first run on this target, you likely need to select **Install Tools** to copy collection tools to the target. Next, select the **Workload type**. You can sample the whole system or attach to an existing process, but in this exercise you launch a new process.
Select the target you configured in the setup section. If this is your first run on this target, you might need to select **Install Tools** to copy the collection tools to the target. After the tools are installed, you see the target is now ready.

{{% notice Common Gotcha%}}
Next, select the **Workload type** and select **Launch a new process**.

Enter the absolute path to your executable in the **Workload** field. For example, `/home/ubuntu/Mandelbrot-Example/builds/mandelbrot-parallel`. Make sure to add the number of threads argument.

{{% notice Note %}}
Use the full path to your executable because the **Workload** field does not currently support shell-style path expansion.
{{% /notice %}}

You can set a time limit for the workload and customize metrics if you already know what to investigate.
Before starting the analysis, you can customize the configuration. For instance, you can set a time limit for the workload or choose specific metrics to investigate. You can also adjust the sampling rate (High, Normal, or Low) to balance collection overhead against sampling granularity. Because this Mandelbrot example is a native C++ application, you can ignore the **Collect managed code stacks** toggle, which is used for Java or .NET workloads.

The **Collect managed code stacks** toggle matters for Java/JVM or .NET workloads.
When your configuration is ready, select **Run Recipe** to launch the workload and collect the performance data.

You can also select High, Normal, or Low sampling rates to trade off collection overhead and sampling granularity.
## View the run results

Select **Run Recipe** to launch the workload and collect performance data.
Arm Performix generates a high-level instruction pipeline view, highlighting where the most time is spent.

## View Run Results
![Arm Performix high-level instruction pipeline results#center](cpu-uarch-results.webp "Instruction Pipeline View")

Performix generates a high-level instruction pipeline view, highlighting where most time is spent.
In this breakdown, Backend Stalls dominate the samples. Within that category, work is split between Load Operations and integer and floating-point operations.
There is no measured SIMD activity, even though this workload is highly parallelizable.

![cpu-uarch-results.jpg](cpu-uarch-results.jpg)
The **Insights** panel highlights ALU contention as a likely improvement opportunity:

In this breakdown, Backend Stalls dominate samples. Within that category, work is split between Load Operations and integer and floating-point operations.
There is no measured SIMD activity, even though this workload is highly parallelizable.
![Arm Performix insights panel highlighting ALU contention#center](cpu-uarch-insights.webp "Insights Panel")

To inspect executed instruction types in more detail, use the Instruction Mix recipe in the next step.

The **Insights** panel highlights ALU contention as a likely improvement opportunity.
## What you've accomplished and what's next

![cpu-uarch-insights.jpg](cpu-uarch-insights.jpg)
In this section:
- You ran the CPU Microarchitecture recipe on the Mandelbrot application.
- You identified that the application spends most of its time in Backend Stalls without using SIMD operations.

To inspect executed instruction types in more detail, use the Instruction Mix recipe in the next step.
Next, you will run the Instruction Mix recipe to confirm where optimization opportunities exist and implement vectorization.
Original file line number Diff line number Diff line change
@@ -1,44 +1,54 @@
---
title: Understand Instruction Mix
title: Analyze SIMD utilization with the Instruction Mix recipe
weight: 4

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Run Instruction Mix
## Run the Instruction Mix recipe

The previous CPU Microarchitecture analysis showed that the sample application used no single instruction, multiple data (SIMD) operations, which points to an optimization opportunity. Run the Instruction Mix recipe to learn more. The Instruction Mix launch panel is similar to CPU Microarchitecture, but it does not include options to choose metrics. Again, enter the full path to the workload. This Mandelbrot example is native C++ code, not Java or .NET, so you do not need to collect managed code stacks.
The previous CPU Microarchitecture analysis showed that the sample application used no single instruction, multiple data (SIMD) operations, which points to an optimization opportunity. Run the Instruction Mix recipe to learn more. The Instruction Mix launch panel is similar to CPU Microarchitecture, but it does not include options to choose metrics. Again, enter the full path to the workload.

![instruction-mix-config.jpg](instruction-mix-config.jpg)
Select **Dynamic** f for the **Analysis Mode**.

![Arm Performix Instruction Mix configuration screen#center](instruction-mix-config.webp "Instruction Mix Configuration")

The results below confirm a high number of integer and floating-point operations, with no SIMD operations. The **Insights** panel suggests vectorization as a path forward, lists possible root causes, and links to related Learning Paths.

![instruction-mix-results.jpg](instruction-mix-results.jpg)
![Arm Performix Instruction Mix results showing high integer and floating point operations#center](instruction-mix-results.webp "Instruction Mix Results")

## Vectorize
## Vectorize the application

The CPU Hotspots recipe in [Find CPU cycle hotspots with Arm Performix](/learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/) helps you identify which functions consume the most time. In this example, `Mandelbrot::draw` and its inner function `Mandelbrot::getIterations` dominate runtime. A vectorized version is available in the [instruction-mix branch](https://github.com/arm-education/Mandelbrot-Example/tree/instruction-mix). This branch uses Neon operations for Neoverse N1, while your platform might support alternatives such as SVE or SVE2.
To address the lack of SIMD operations, you can vectorize the application's most intensive functions. For the Mandelbrot application, `Mandelbrot::draw` and its inner `Mandelbrot::getIterations` function consume most of the runtime. A vectorized version is available in the [instruction-mix branch](https://github.com/arm-education/Mandelbrot-Example/tree/instruction-mix). This branch uses Neon operations, which run on any Neoverse system. Your system might support alternatives such as SVE or SVE2 which can also be used.

After you rebuild the application and run Instruction Mix again, integer and floating-point operations are greatly reduced and replaced by a smaller set of SIMD instructions.
Connect to your target machine using SSH and navigate to your project directory. Because you modified `main.cpp` earlier, you must stash your changes before switching to the `instruction-mix` branch. Then, rebuild the application:

![instruction-mix-simd-results.jpg](instruction-mix-simd-results.jpg)
```bash
cd $HOME/Mandelbrot-Example
git stash
git checkout instruction-mix
./build.sh
```

After you rebuild the application and run the Instruction Mix recipe again, integer and floating-point operations are greatly reduced and replaced by a smaller set of SIMD instructions.

## Assess improvements
![Arm Performix Instruction Mix results after vectorization showing increased SIMD operations#center](instruction-mix-simd-results.webp "SIMD Instruction Mix Results")

## Assess the performance improvements

Because you are running multiple experiments, give each run a meaningful nickname to keep results organized.
![rename-run.jpg](rename-run.jpg)
![Arm Performix run renaming interface#center](rename-run.webp "Rename Run")

Use the **Compare** feature at the top right of an entry in the **Runs** view to select another run of the same recipe for comparison.

![compare-with-box.jpg](compare-with-box.jpg)
![Arm Performix compare view selection box#center](compare-with-box.webp "Compare Runs")

This selection box lets you choose any run of the same recipe type. The ⇄ arrows swap which run is treated as the baseline and which is current.

After you select two runs, Performix overlays them so you can review category changes in one view. In the new run, note that
After you select two runs, Arm Performix overlays them so you can review category changes in one view. In the new run, note that

![instruction-mix-diff-results.jpg](instruction-mix-diff-results.jpg)
![Arm Performix comparison showing differences in instruction mix#center](instruction-mix-diff-results.webp "Instruction Mix Comparison")
Compared to the baseline, floating-point operations, branch operations, and some integer operations have been traded for loads, stores, and SIMD operations.
Execution time also improves significantly, making this run nearly four times faster.

Expand All @@ -60,10 +70,10 @@ user 0m8.331s
sys 0m0.016s
```

## CPU Microarchitecture results comparison
## Compare the CPU Microarchitecture results

The CPU Microarchitecture recipe also supports a **Compare** view that shows percentage-point changes in each stage and instruction type.
![cpu-uarch-simd-results-diff.jpg](cpu-uarch-simd-results-diff.jpg)
![Arm Performix CPU Microarchitecture comparison showing changes in each stage#center](cpu-uarch-simd-results-diff.webp "CPU Microarchitecture Difference View")

You can now see that Load and Store operations account for about 70% of execution time. **Insights** offers several explanations because multiple issues can contribute to the root cause.
```
Expand All @@ -78,7 +88,9 @@ POSSIBLE CAUSES
- Instruction dependencies that create pipeline bubbles
```

Next, add optimization flags to the compiler to enable more aggressive loop unrolling.
## Apply compiler optimizations for loop unrolling

To address the new load and store bottlenecks, add optimization flags to the compiler to enable more aggressive loop unrolling. Edit the `build.sh` script to include these flags in the `CXXFLAGS` array:
```bash
# build.sh
CXXFLAGS=(
Expand All @@ -92,7 +104,9 @@ Next, add optimization flags to the compiler to enable more aggressive loop unro
)
```

Runtime improves again, with an additional 11x speedup over the SIMD build that uses default compiler flags.
After saving the file, run `./build.sh` to compile the application with the new flags.

Runtime improves again, with an additional 11x speedup over the SIMD build that used the default compiler flags.


```bash { command_line="root@localhost | 2-6" }
Expand All @@ -105,8 +119,17 @@ sys 0m0.014s
```

Another CPU Microarchitecture measurement shows that Load and Store bottlenecks are almost eliminated. SIMD floating-point operations now dominate execution, which indicates the application is better tuned to feed floating-point execution units.
![high-simd-utilization.jpg](high-simd-utilization.jpg)
![Arm Performix insights showing high SIMD utilization#center](high-simd-utilization.webp "High SIMD Utilization")

The program still generates the same output, and runtime drops from 31 s to less than 1 s, a 43x speedup.

![performance-improvement.jpg](performance-improvement.jpg)
![Arm Performix results highlighting total performance improvement#center](performance-improvement.webp "Performance Improvement Summary")

## What you've accomplished and what's next

In this section:
- You used the Instruction Mix recipe to confirm a lack of SIMD operations.
- You vectorized the sample application and verified the shift toward SIMD execution.
- You applied compiler loop unrolling to relieve backend load/store bottlenecks, achieving over 40x speedup.

You are now ready to analyze and optimize your own native C/C++ applications on Arm Neoverse using Arm Performix. Review the next steps to continue your learning journey.
Binary file not shown.
Binary file not shown.
Loading