Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion assets/contributors.csv
Original file line number Diff line number Diff line change
Expand Up @@ -113,5 +113,6 @@ Steve Suzuki,Arm,,,,
Qixiang Xu,Arm,,,,
Phalani Paladugu,Arm,phalani-paladugu,phalani-paladugu,,
Richard Burton,Arm,Burton2000,,,
Brendan Long,Arm,bccbrendan,https://www.linkedin.com/in/brendan-long-5817924/,,
Asier Arranz,NVIDIA,,asierarranz,,asierarranz.com
Prince Agyeman,Arm,,,,
Prince Agyeman,Arm,,,,
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
---
title: Setup
weight: 2

### FIXED, DO NOT MODIFY
layout: learningpathall
---

This Learning Path uses Arm Performix CPU Mircoarchitecture and Instruction Mix recipes to analyze performance in a sample application.

## Before you begin

Use the Performix [installation guide](/install-guides/atp/) to install the tool if this is your first run. From the host machine, open the **Targets** tab, set up an SSH connection to the target that runs the workload, and test the connection. In this Learning Path's examples, I'll connect to an Arm Neoverse V1 workstation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Performix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I found a couple other instances of 'Performix' without the 'Arm'. Fixing in #3002


Install required OS packages on the target. For Debian-based distributions, run:
```bash
sudo apt-get install python3 python3-venv binutils

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need python on the target for Arm Performix to run?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I know we need binutils for objdump for identifying stack frames for Code Hotspots.

And I know I saw some recipes of Performix complain if venv was unavailable, but I can't remember which recipes or why it's needed. But I believe it was one of Code Hotspots, Instruction Mix, or Cpu Microarchitecture. I don't mind removing venv if this isn't an expected dependency, but I know it's been a point of friction for some users.

```

## Build sample application on remote server

Connect to your target machine and download the sample application for this Learning Path, a Mandelbrot set generator.
The code is available under the [Arm Education License](https://github.com/arm-university/Mandelbrot-Example?tab=License-1-ov-file). Create a directory where you want to store and build the example, then run:

```bash
git clone https://github.com/arm-university/Mandelbrot-Example.git
cd Mandelbrot-Example && mkdir images builds
```

Install a C++ compiler by using your operating system's package manager.

```bash
sudo apt install build-essential
```

Build the application:

```bash
./build.sh
```

The binary in the `./builds/` directory generates an image similar to the fractal below.

![Green-Parallel-512.bmp](./Green-Parallel-512.bmp)
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
---
title: Find Bottlenecks with CPU Microarchitecture
weight: 3

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Run CPU Microarchitecture analysis

As shown in the `main.cpp` listing below, the program generates a 1920×1080 bitmap image of the fractal. To identify performance bottlenecks, run the CPU Microarchitecture recipe in Arm Performix (APX). APX uses microarchitectural sampling to show which instruction pipeline stages dominate program latency, then highlights ways to improve those bottlenecks.


{{% notice Specify the example output file %}}
Replace the first string argument in `myplot.draw()` with the absolute path to your image folder, then rebuild the application. Otherwise, the image is written to `/tmp/atperf/tools/atperf-agent`, which is periodically deleted.
{{% /notice %}}

```cpp
#include "Mandelbrot.h"
#include <iostream>

using namespace std;

int main(){

Mandelbrot::Mandelbrot myplot(1920, 1080);
myplot.draw("/path/to/images/green.bmp", Mandelbrot::Mandelbrot::GREEN);

return 0;
}
```

On your host machine, open Arm Performix and select the **CPU Microarchitecture** recipe.

![config](./cpu-uarch-config.jpg)

Select the target you configured in the setup phase. If this is your first run on this target, you likely need to select **Install Tools** to copy collection tools to the target. Next, select the **Workload type**. You can sample the whole system or attach to an existing process, but in this exercise you launch a new process.

{{% notice Common Gotcha%}}
Use the full path to your executable because the **Workload** field does not currently support shell-style path expansion.
{{% /notice %}}

You can set a time limit for the workload and customize metrics if you already know what to investigate.

The **Collect managed code stacks** toggle matters for Java/JVM or .NET workloads.

You can also select High, Normal, or Low sampling rates to trade off collection overhead and sampling granularity.

Select **Run Recipe** to launch the workload and collect performance data.

## View Run Results

Performix generates a high-level instruction pipeline view, highlighting where most time is spent.

![cpu-uarch-results.jpg](cpu-uarch-results.jpg)

In this breakdown, Backend Stalls dominate samples. Within that category, work is split between Load Operations and integer and floating-point operations.
There is no measured SIMD activity, even though this workload is highly parallelizable.

The **Insights** panel highlights ALU contention as a likely improvement opportunity.

![cpu-uarch-insights.jpg](cpu-uarch-insights.jpg)

To inspect executed instruction types in more detail, use the Instruction Mix recipe in the next step.
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
---
title: Understand Instruction Mix
weight: 4

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Run Instruction Mix

The previous CPU Microarchitecture analysis showed that the sample application used no single instruction, multiple data (SIMD) operations, which points to an optimization opportunity. Run the Instruction Mix recipe to learn more. The Instruction Mix launch panel is similar to CPU Microarchitecture, but it does not include options to choose metrics. Again, enter the full path to the workload. This Mandelbrot example is native C++ code, not Java or .NET, so you do not need to collect managed code stacks.

![instruction-mix-config.jpg](instruction-mix-config.jpg)


The results below confirm a high number of integer and floating-point operations, with no SIMD operations. The **Insights** panel suggests vectorization as a path forward, lists possible root causes, and links to related Learning Paths.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure that this is true with the current way that insights work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. I find the link to the learning path on using SVE is a good highlight of how effective Insights can be when they're working well.


![instruction-mix-results.jpg](instruction-mix-results.jpg)

## Vectorize

The CPU Hotspots recipe in [Find CPU cycle hotspots with Arm Performix](/learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/) helps you identify which functions consume the most time. In this example, `Mandelbrot::draw` and its inner function `Mandelbrot::getIterations` dominate runtime. A vectorized version is available in the [instruction-mix branch](https://github.com/arm-education/Mandelbrot-Example/tree/instruction-mix). This branch uses Neon operations for Neoverse N1, while your platform might support alternatives such as SVE or SVE2.

After you rebuild the application and run Instruction Mix again, integer and floating-point operations are greatly reduced and replaced by a smaller set of SIMD instructions.

![instruction-mix-simd-results.jpg](instruction-mix-simd-results.jpg)

## Assess improvements

Because you are running multiple experiments, give each run a meaningful nickname to keep results organized.
![rename-run.jpg](rename-run.jpg)

Use the **Compare** feature at the top right of an entry in the **Runs** view to select another run of the same recipe for comparison.

![compare-with-box.jpg](compare-with-box.jpg)

This selection box lets you choose any run of the same recipe type. The ⇄ arrows swap which run is treated as the baseline and which is current.

After you select two runs, Performix overlays them so you can review category changes in one view. In the new run, note that

![instruction-mix-diff-results.jpg](instruction-mix-diff-results.jpg)
Compared to the baseline, floating-point operations, branch operations, and some integer operations have been traded for loads, stores, and SIMD operations.
Execution time also improves significantly, making this run nearly four times faster.

```bash { command_line="root@localhost | 2-6" }
time builds/mandelbrot-parallel-no-simd 1
Number of Threads = 1

real 0m31.326s
user 0m31.279s
sys 0m0.011s
```

```bash { command_line="root@localhost | 2-6" }
time builds/mandelbrot-parallel 1
Number of Threads = 1

real 0m8.362s
user 0m8.331s
sys 0m0.016s
```

## CPU Microarchitecture results comparison

The CPU Microarchitecture recipe also supports a **Compare** view that shows percentage-point changes in each stage and instruction type.
![cpu-uarch-simd-results-diff.jpg](cpu-uarch-simd-results-diff.jpg)

You can now see that Load and Store operations account for about 70% of execution time. **Insights** offers several explanations because multiple issues can contribute to the root cause.
```
The CPU spends a larger share of cycles stalled in the backend, meaning execution or memory resources cannot complete work fast enough. This is a cycle-based measure (percentage of stalled cycles).

POSSIBLE CAUSES

- Slow memory access, for example, L2 cache misses or Dynamic Random-Access Memory (DRAM) misses
- Contention in execution pipelines, for example, the Arithmetic Logic Unit (ALU) or load/store units
- Poor data locality
- Excessive branching
- Instruction dependencies that create pipeline bubbles
```

Next, add optimization flags to the compiler to enable more aggressive loop unrolling.
```bash
# build.sh
CXXFLAGS=(
--std=c++11
-O3
-mcpu=neoverse-n1+crc+crypto
-ffast-math
-funroll-loops
-flto
-DNDEBUG
)
```

Runtime improves again, with an additional 11x speedup over the SIMD build that uses default compiler flags.


```bash { command_line="root@localhost | 2-6" }
time ./builds/mandelbrot-parallel 1
Number of Threads = 1

real 0m0.743s
user 0m0.724s
sys 0m0.014s
```

Another CPU Microarchitecture measurement shows that Load and Store bottlenecks are almost eliminated. SIMD floating-point operations now dominate execution, which indicates the application is better tuned to feed floating-point execution units.
![high-simd-utilization.jpg](high-simd-utilization.jpg)

The program still generates the same output, and runtime drops from 31 s to less than 1 s, a 43x speedup.

![performance-improvement.jpg](performance-improvement.jpg)
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
---
title: Tune application performance with Arm Performix CPU Microarchitecture analysis

minutes_to_complete: 60

who_is_this_for: This introductory Learning Path is for software developers who want to learn performance analysis methodologies for Linux applications on Arm Neoverse.

learning_objectives:
- Understand sampling and counting for performance analysis
- Learn commonly used hardware metrics
- Analyze a sample application by using Arm Performix
- Make an application code change and see improved performance

prerequisites:
- An Arm Neoverse N1 or higher computer running Linux. A bare-metal or cloud bare-metal instance is best because it exposes more counters.

author:
- Brendan Long
- Kieran Hejmadi

### Tags
skilllevels: Introductory
subjects: Performance and Architecture
armips:
- Neoverse
tools_software_languages:
- Arm Performix
- C++
- Runbook

operatingsystems:
- Linux

further_reading:
- resource:
title: "Find CPU Cycle Hotspots with Arm Performix"
link: /learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/
type: documentation
- resource:
title: "Port Code to Arm Scalable Vector Extension (SVE)"
link: /learning-paths/servers-and-cloud-computing/sve/
type: documentation
- resource:
title: "Arm Neoverse N1: Core Performance Analysis Methodology"
link: https://armkeil.blob.core.windows.net/developer/Files/pdf/white-paper/neoverse-n1-core-performance-v2.pdf
type: documentation
- resource:
title: "Arm Neoverse N1 PMU Guide"
link: https://developer.arm.com/documentation/PJDOC-466751330-547673/r4p1/
type: documentation

### FIXED, DO NOT MODIFY
# ================================================================================
weight: 1 # _index.md always has weight of 1 to order correctly
layout: "learningpathall" # All files under learning paths have this same wrapper
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
# ================================================================================
# FIXED, DO NOT MODIFY THIS FILE
# ================================================================================
weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation.
title: "Next Steps" # Always the same, html page title.
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
---
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading