-
Notifications
You must be signed in to change notification settings - Fork 266
Add learning path for CPU Microarchitecture analysis with Arm Performix #2961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
894045b
5e4d86b
7c141fc
035f33d
7acade6
05bce1a
a23890e
8b54ec6
f57ba69
1cfe5e7
7713faa
1e82cb7
4f9a377
a89f1a0
3464fc8
3baedf9
c1d694e
858d0ff
f130597
2f5204c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,44 @@ | ||
| --- | ||
| title: Setup | ||
| weight: 2 | ||
|
|
||
| ### FIXED, DO NOT MODIFY | ||
| layout: learningpathall | ||
| --- | ||
|
|
||
| This Learning Path uses Arm Performix CPU Mircoarchitecture and Instruction Mix recipes to analyze performance in a sample application. | ||
|
|
||
| ## Before you begin | ||
|
|
||
| Use the Performix [installation guide](/install-guides/atp/) to install the tool if this is your first run. From the host machine, open the **Targets** tab, set up an SSH connection to the target that runs the workload, and test the connection. In this Learning Path's examples, I'll connect to an Arm Neoverse V1 workstation. | ||
|
|
||
| Install required OS packages on the target. For Debian-based distributions, run: | ||
| ```bash | ||
| sudo apt-get install python3 python3-venv binutils | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we really need python on the target for Arm Performix to run?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, I know we need binutils for objdump for identifying stack frames for Code Hotspots. And I know I saw some recipes of Performix complain if |
||
| ``` | ||
|
|
||
| ## Build sample application on remote server | ||
|
|
||
| Connect to your target machine and download the sample application for this Learning Path, a Mandelbrot set generator. | ||
| The code is available under the [Arm Education License](https://github.com/arm-university/Mandelbrot-Example?tab=License-1-ov-file). Create a directory where you want to store and build the example, then run: | ||
|
|
||
| ```bash | ||
| git clone https://github.com/arm-university/Mandelbrot-Example.git | ||
| cd Mandelbrot-Example && mkdir images builds | ||
| ``` | ||
|
|
||
| Install a C++ compiler by using your operating system's package manager. | ||
|
|
||
| ```bash | ||
| sudo apt install build-essential | ||
| ``` | ||
|
|
||
| Build the application: | ||
|
|
||
| ```bash | ||
| ./build.sh | ||
| ``` | ||
|
|
||
| The binary in the `./builds/` directory generates an image similar to the fractal below. | ||
|
|
||
|  | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,64 @@ | ||
| --- | ||
| title: Find Bottlenecks with CPU Microarchitecture | ||
| weight: 3 | ||
|
|
||
| ### FIXED, DO NOT MODIFY | ||
| layout: learningpathall | ||
| --- | ||
|
|
||
| ## Run CPU Microarchitecture analysis | ||
|
|
||
| As shown in the `main.cpp` listing below, the program generates a 1920×1080 bitmap image of the fractal. To identify performance bottlenecks, run the CPU Microarchitecture recipe in Arm Performix (APX). APX uses microarchitectural sampling to show which instruction pipeline stages dominate program latency, then highlights ways to improve those bottlenecks. | ||
|
|
||
|
|
||
| {{% notice Specify the example output file %}} | ||
| Replace the first string argument in `myplot.draw()` with the absolute path to your image folder, then rebuild the application. Otherwise, the image is written to `/tmp/atperf/tools/atperf-agent`, which is periodically deleted. | ||
| {{% /notice %}} | ||
|
|
||
| ```cpp | ||
| #include "Mandelbrot.h" | ||
| #include <iostream> | ||
|
|
||
| using namespace std; | ||
|
|
||
| int main(){ | ||
|
|
||
| Mandelbrot::Mandelbrot myplot(1920, 1080); | ||
| myplot.draw("/path/to/images/green.bmp", Mandelbrot::Mandelbrot::GREEN); | ||
|
|
||
| return 0; | ||
| } | ||
| ``` | ||
|
|
||
| On your host machine, open Arm Performix and select the **CPU Microarchitecture** recipe. | ||
|
|
||
|  | ||
|
|
||
| Select the target you configured in the setup phase. If this is your first run on this target, you likely need to select **Install Tools** to copy collection tools to the target. Next, select the **Workload type**. You can sample the whole system or attach to an existing process, but in this exercise you launch a new process. | ||
|
|
||
| {{% notice Common Gotcha%}} | ||
| Use the full path to your executable because the **Workload** field does not currently support shell-style path expansion. | ||
| {{% /notice %}} | ||
|
|
||
| You can set a time limit for the workload and customize metrics if you already know what to investigate. | ||
|
|
||
| The **Collect managed code stacks** toggle matters for Java/JVM or .NET workloads. | ||
|
|
||
| You can also select High, Normal, or Low sampling rates to trade off collection overhead and sampling granularity. | ||
|
|
||
| Select **Run Recipe** to launch the workload and collect performance data. | ||
|
|
||
| ## View Run Results | ||
|
|
||
| Performix generates a high-level instruction pipeline view, highlighting where most time is spent. | ||
|
|
||
|  | ||
|
|
||
| In this breakdown, Backend Stalls dominate samples. Within that category, work is split between Load Operations and integer and floating-point operations. | ||
| There is no measured SIMD activity, even though this workload is highly parallelizable. | ||
|
|
||
| The **Insights** panel highlights ALU contention as a likely improvement opportunity. | ||
|
|
||
|  | ||
|
|
||
| To inspect executed instruction types in more detail, use the Instruction Mix recipe in the next step. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,112 @@ | ||
| --- | ||
| title: Understand Instruction Mix | ||
| weight: 4 | ||
|
|
||
| ### FIXED, DO NOT MODIFY | ||
| layout: learningpathall | ||
| --- | ||
|
|
||
| ## Run Instruction Mix | ||
|
|
||
| The previous CPU Microarchitecture analysis showed that the sample application used no single instruction, multiple data (SIMD) operations, which points to an optimization opportunity. Run the Instruction Mix recipe to learn more. The Instruction Mix launch panel is similar to CPU Microarchitecture, but it does not include options to choose metrics. Again, enter the full path to the workload. This Mandelbrot example is native C++ code, not Java or .NET, so you do not need to collect managed code stacks. | ||
|
|
||
|  | ||
|
|
||
|
|
||
| The results below confirm a high number of integer and floating-point operations, with no SIMD operations. The **Insights** panel suggests vectorization as a path forward, lists possible root causes, and links to related Learning Paths. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Make sure that this is true with the current way that insights work
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is. I find the link to the learning path on using SVE is a good highlight of how effective Insights can be when they're working well. |
||
|
|
||
|  | ||
|
|
||
| ## Vectorize | ||
|
|
||
| The CPU Hotspots recipe in [Find CPU cycle hotspots with Arm Performix](/learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/) helps you identify which functions consume the most time. In this example, `Mandelbrot::draw` and its inner function `Mandelbrot::getIterations` dominate runtime. A vectorized version is available in the [instruction-mix branch](https://github.com/arm-education/Mandelbrot-Example/tree/instruction-mix). This branch uses Neon operations for Neoverse N1, while your platform might support alternatives such as SVE or SVE2. | ||
|
|
||
| After you rebuild the application and run Instruction Mix again, integer and floating-point operations are greatly reduced and replaced by a smaller set of SIMD instructions. | ||
|
|
||
|  | ||
|
|
||
| ## Assess improvements | ||
|
|
||
| Because you are running multiple experiments, give each run a meaningful nickname to keep results organized. | ||
|  | ||
|
|
||
| Use the **Compare** feature at the top right of an entry in the **Runs** view to select another run of the same recipe for comparison. | ||
|
|
||
|  | ||
|
|
||
| This selection box lets you choose any run of the same recipe type. The ⇄ arrows swap which run is treated as the baseline and which is current. | ||
|
|
||
| After you select two runs, Performix overlays them so you can review category changes in one view. In the new run, note that | ||
|
|
||
|  | ||
| Compared to the baseline, floating-point operations, branch operations, and some integer operations have been traded for loads, stores, and SIMD operations. | ||
| Execution time also improves significantly, making this run nearly four times faster. | ||
|
|
||
| ```bash { command_line="root@localhost | 2-6" } | ||
| time builds/mandelbrot-parallel-no-simd 1 | ||
| Number of Threads = 1 | ||
|
|
||
| real 0m31.326s | ||
| user 0m31.279s | ||
| sys 0m0.011s | ||
| ``` | ||
|
|
||
| ```bash { command_line="root@localhost | 2-6" } | ||
| time builds/mandelbrot-parallel 1 | ||
| Number of Threads = 1 | ||
|
|
||
| real 0m8.362s | ||
| user 0m8.331s | ||
| sys 0m0.016s | ||
| ``` | ||
|
|
||
| ## CPU Microarchitecture results comparison | ||
|
|
||
| The CPU Microarchitecture recipe also supports a **Compare** view that shows percentage-point changes in each stage and instruction type. | ||
|  | ||
|
|
||
| You can now see that Load and Store operations account for about 70% of execution time. **Insights** offers several explanations because multiple issues can contribute to the root cause. | ||
| ``` | ||
| The CPU spends a larger share of cycles stalled in the backend, meaning execution or memory resources cannot complete work fast enough. This is a cycle-based measure (percentage of stalled cycles). | ||
|
|
||
| POSSIBLE CAUSES | ||
|
|
||
| - Slow memory access, for example, L2 cache misses or Dynamic Random-Access Memory (DRAM) misses | ||
| - Contention in execution pipelines, for example, the Arithmetic Logic Unit (ALU) or load/store units | ||
| - Poor data locality | ||
| - Excessive branching | ||
| - Instruction dependencies that create pipeline bubbles | ||
| ``` | ||
|
|
||
| Next, add optimization flags to the compiler to enable more aggressive loop unrolling. | ||
| ```bash | ||
| # build.sh | ||
| CXXFLAGS=( | ||
| --std=c++11 | ||
| -O3 | ||
| -mcpu=neoverse-n1+crc+crypto | ||
| -ffast-math | ||
| -funroll-loops | ||
| -flto | ||
| -DNDEBUG | ||
| ) | ||
| ``` | ||
|
|
||
| Runtime improves again, with an additional 11x speedup over the SIMD build that uses default compiler flags. | ||
|
|
||
|
|
||
| ```bash { command_line="root@localhost | 2-6" } | ||
| time ./builds/mandelbrot-parallel 1 | ||
| Number of Threads = 1 | ||
|
|
||
| real 0m0.743s | ||
| user 0m0.724s | ||
| sys 0m0.014s | ||
| ``` | ||
|
|
||
| Another CPU Microarchitecture measurement shows that Load and Store bottlenecks are almost eliminated. SIMD floating-point operations now dominate execution, which indicates the application is better tuned to feed floating-point execution units. | ||
|  | ||
|
|
||
| The program still generates the same output, and runtime drops from 31 s to less than 1 s, a 43x speedup. | ||
|
|
||
|  | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,57 @@ | ||
| --- | ||
| title: Tune application performance with Arm Performix CPU Microarchitecture analysis | ||
|
|
||
| minutes_to_complete: 60 | ||
|
|
||
| who_is_this_for: This introductory Learning Path is for software developers who want to learn performance analysis methodologies for Linux applications on Arm Neoverse. | ||
|
|
||
| learning_objectives: | ||
| - Understand sampling and counting for performance analysis | ||
| - Learn commonly used hardware metrics | ||
| - Analyze a sample application by using Arm Performix | ||
| - Make an application code change and see improved performance | ||
|
|
||
| prerequisites: | ||
| - An Arm Neoverse N1 or higher computer running Linux. A bare-metal or cloud bare-metal instance is best because it exposes more counters. | ||
|
|
||
| author: | ||
| - Brendan Long | ||
| - Kieran Hejmadi | ||
|
|
||
| ### Tags | ||
| skilllevels: Introductory | ||
| subjects: Performance and Architecture | ||
| armips: | ||
| - Neoverse | ||
| tools_software_languages: | ||
| - Arm Performix | ||
| - C++ | ||
| - Runbook | ||
|
|
||
| operatingsystems: | ||
| - Linux | ||
|
|
||
| further_reading: | ||
| - resource: | ||
| title: "Find CPU Cycle Hotspots with Arm Performix" | ||
| link: /learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/ | ||
| type: documentation | ||
| - resource: | ||
| title: "Port Code to Arm Scalable Vector Extension (SVE)" | ||
| link: /learning-paths/servers-and-cloud-computing/sve/ | ||
| type: documentation | ||
| - resource: | ||
| title: "Arm Neoverse N1: Core Performance Analysis Methodology" | ||
| link: https://armkeil.blob.core.windows.net/developer/Files/pdf/white-paper/neoverse-n1-core-performance-v2.pdf | ||
| type: documentation | ||
| - resource: | ||
| title: "Arm Neoverse N1 PMU Guide" | ||
| link: https://developer.arm.com/documentation/PJDOC-466751330-547673/r4p1/ | ||
| type: documentation | ||
|
|
||
| ### FIXED, DO NOT MODIFY | ||
| # ================================================================================ | ||
| weight: 1 # _index.md always has weight of 1 to order correctly | ||
| layout: "learningpathall" # All files under learning paths have this same wrapper | ||
| learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. | ||
| --- |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| --- | ||
| # ================================================================================ | ||
| # FIXED, DO NOT MODIFY THIS FILE | ||
| # ================================================================================ | ||
| weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation. | ||
| title: "Next Steps" # Always the same, html page title. | ||
| layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. | ||
| --- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arm Performix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I found a couple other instances of 'Performix' without the 'Arm'. Fixing in #3002