diff --git a/assets/contributors.csv b/assets/contributors.csv index dedf06545e..96eacdaeb6 100644 --- a/assets/contributors.csv +++ b/assets/contributors.csv @@ -113,5 +113,6 @@ Steve Suzuki,Arm,,,, Qixiang Xu,Arm,,,, Phalani Paladugu,Arm,phalani-paladugu,phalani-paladugu,, Richard Burton,Arm,Burton2000,,, +Brendan Long,Arm,bccbrendan,https://www.linkedin.com/in/brendan-long-5817924/,, Asier Arranz,NVIDIA,,asierarranz,,asierarranz.com -Prince Agyeman,Arm,,,, \ No newline at end of file +Prince Agyeman,Arm,,,, diff --git a/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/1-setup.md b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/1-setup.md new file mode 100644 index 0000000000..aedafa7636 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/1-setup.md @@ -0,0 +1,44 @@ +--- +title: Setup +weight: 2 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +This Learning Path uses Arm Performix CPU Mircoarchitecture and Instruction Mix recipes to analyze performance in a sample application. + +## Before you begin + +Use the Performix [installation guide](/install-guides/atp/) to install the tool if this is your first run. From the host machine, open the **Targets** tab, set up an SSH connection to the target that runs the workload, and test the connection. In this Learning Path's examples, I'll connect to an Arm Neoverse V1 workstation. + +Install required OS packages on the target. For Debian-based distributions, run: +```bash +sudo apt-get install python3 python3-venv binutils +``` + +## Build sample application on remote server + +Connect to your target machine and download the sample application for this Learning Path, a Mandelbrot set generator. +The code is available under the [Arm Education License](https://github.com/arm-university/Mandelbrot-Example?tab=License-1-ov-file). Create a directory where you want to store and build the example, then run: + +```bash +git clone https://github.com/arm-university/Mandelbrot-Example.git +cd Mandelbrot-Example && mkdir images builds +``` + +Install a C++ compiler by using your operating system's package manager. + +```bash +sudo apt install build-essential +``` + +Build the application: + +```bash +./build.sh +``` + +The binary in the `./builds/` directory generates an image similar to the fractal below. + +![Green-Parallel-512.bmp](./Green-Parallel-512.bmp) \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/2-run-cpu-uarch.md b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/2-run-cpu-uarch.md new file mode 100644 index 0000000000..0940dfec3e --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/2-run-cpu-uarch.md @@ -0,0 +1,64 @@ +--- +title: Find Bottlenecks with CPU Microarchitecture +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Run CPU Microarchitecture analysis + +As shown in the `main.cpp` listing below, the program generates a 1920×1080 bitmap image of the fractal. To identify performance bottlenecks, run the CPU Microarchitecture recipe in Arm Performix (APX). APX uses microarchitectural sampling to show which instruction pipeline stages dominate program latency, then highlights ways to improve those bottlenecks. + + +{{% notice Specify the example output file %}} +Replace the first string argument in `myplot.draw()` with the absolute path to your image folder, then rebuild the application. Otherwise, the image is written to `/tmp/atperf/tools/atperf-agent`, which is periodically deleted. +{{% /notice %}} + +```cpp +#include "Mandelbrot.h" +#include + +using namespace std; + +int main(){ + + Mandelbrot::Mandelbrot myplot(1920, 1080); + myplot.draw("/path/to/images/green.bmp", Mandelbrot::Mandelbrot::GREEN); + + return 0; +} +``` + +On your host machine, open Arm Performix and select the **CPU Microarchitecture** recipe. + +![config](./cpu-uarch-config.jpg) + +Select the target you configured in the setup phase. If this is your first run on this target, you likely need to select **Install Tools** to copy collection tools to the target. Next, select the **Workload type**. You can sample the whole system or attach to an existing process, but in this exercise you launch a new process. + +{{% notice Common Gotcha%}} +Use the full path to your executable because the **Workload** field does not currently support shell-style path expansion. +{{% /notice %}} + +You can set a time limit for the workload and customize metrics if you already know what to investigate. + +The **Collect managed code stacks** toggle matters for Java/JVM or .NET workloads. + +You can also select High, Normal, or Low sampling rates to trade off collection overhead and sampling granularity. + +Select **Run Recipe** to launch the workload and collect performance data. + +## View Run Results + +Performix generates a high-level instruction pipeline view, highlighting where most time is spent. + +![cpu-uarch-results.jpg](cpu-uarch-results.jpg) + +In this breakdown, Backend Stalls dominate samples. Within that category, work is split between Load Operations and integer and floating-point operations. +There is no measured SIMD activity, even though this workload is highly parallelizable. + +The **Insights** panel highlights ALU contention as a likely improvement opportunity. + +![cpu-uarch-insights.jpg](cpu-uarch-insights.jpg) + +To inspect executed instruction types in more detail, use the Instruction Mix recipe in the next step. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/3-instruction-mix.md b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/3-instruction-mix.md new file mode 100644 index 0000000000..3865221c36 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/3-instruction-mix.md @@ -0,0 +1,112 @@ +--- +title: Understand Instruction Mix +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Run Instruction Mix + +The previous CPU Microarchitecture analysis showed that the sample application used no single instruction, multiple data (SIMD) operations, which points to an optimization opportunity. Run the Instruction Mix recipe to learn more. The Instruction Mix launch panel is similar to CPU Microarchitecture, but it does not include options to choose metrics. Again, enter the full path to the workload. This Mandelbrot example is native C++ code, not Java or .NET, so you do not need to collect managed code stacks. + +![instruction-mix-config.jpg](instruction-mix-config.jpg) + + +The results below confirm a high number of integer and floating-point operations, with no SIMD operations. The **Insights** panel suggests vectorization as a path forward, lists possible root causes, and links to related Learning Paths. + +![instruction-mix-results.jpg](instruction-mix-results.jpg) + +## Vectorize + +The CPU Hotspots recipe in [Find CPU cycle hotspots with Arm Performix](/learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/) helps you identify which functions consume the most time. In this example, `Mandelbrot::draw` and its inner function `Mandelbrot::getIterations` dominate runtime. A vectorized version is available in the [instruction-mix branch](https://github.com/arm-education/Mandelbrot-Example/tree/instruction-mix). This branch uses Neon operations for Neoverse N1, while your platform might support alternatives such as SVE or SVE2. + +After you rebuild the application and run Instruction Mix again, integer and floating-point operations are greatly reduced and replaced by a smaller set of SIMD instructions. + +![instruction-mix-simd-results.jpg](instruction-mix-simd-results.jpg) + +## Assess improvements + +Because you are running multiple experiments, give each run a meaningful nickname to keep results organized. +![rename-run.jpg](rename-run.jpg) + +Use the **Compare** feature at the top right of an entry in the **Runs** view to select another run of the same recipe for comparison. + +![compare-with-box.jpg](compare-with-box.jpg) + +This selection box lets you choose any run of the same recipe type. The ⇄ arrows swap which run is treated as the baseline and which is current. + +After you select two runs, Performix overlays them so you can review category changes in one view. In the new run, note that + +![instruction-mix-diff-results.jpg](instruction-mix-diff-results.jpg) +Compared to the baseline, floating-point operations, branch operations, and some integer operations have been traded for loads, stores, and SIMD operations. +Execution time also improves significantly, making this run nearly four times faster. + +```bash { command_line="root@localhost | 2-6" } +time builds/mandelbrot-parallel-no-simd 1 +Number of Threads = 1 + +real 0m31.326s +user 0m31.279s +sys 0m0.011s +``` + +```bash { command_line="root@localhost | 2-6" } +time builds/mandelbrot-parallel 1 +Number of Threads = 1 + +real 0m8.362s +user 0m8.331s +sys 0m0.016s +``` + +## CPU Microarchitecture results comparison + +The CPU Microarchitecture recipe also supports a **Compare** view that shows percentage-point changes in each stage and instruction type. +![cpu-uarch-simd-results-diff.jpg](cpu-uarch-simd-results-diff.jpg) + +You can now see that Load and Store operations account for about 70% of execution time. **Insights** offers several explanations because multiple issues can contribute to the root cause. +``` +The CPU spends a larger share of cycles stalled in the backend, meaning execution or memory resources cannot complete work fast enough. This is a cycle-based measure (percentage of stalled cycles). + +POSSIBLE CAUSES + +- Slow memory access, for example, L2 cache misses or Dynamic Random-Access Memory (DRAM) misses +- Contention in execution pipelines, for example, the Arithmetic Logic Unit (ALU) or load/store units +- Poor data locality +- Excessive branching +- Instruction dependencies that create pipeline bubbles +``` + +Next, add optimization flags to the compiler to enable more aggressive loop unrolling. +```bash + # build.sh + CXXFLAGS=( + --std=c++11 + -O3 + -mcpu=neoverse-n1+crc+crypto + -ffast-math + -funroll-loops + -flto + -DNDEBUG + ) +``` + +Runtime improves again, with an additional 11x speedup over the SIMD build that uses default compiler flags. + + +```bash { command_line="root@localhost | 2-6" } +time ./builds/mandelbrot-parallel 1 +Number of Threads = 1 + +real 0m0.743s +user 0m0.724s +sys 0m0.014s +``` + +Another CPU Microarchitecture measurement shows that Load and Store bottlenecks are almost eliminated. SIMD floating-point operations now dominate execution, which indicates the application is better tuned to feed floating-point execution units. +![high-simd-utilization.jpg](high-simd-utilization.jpg) + +The program still generates the same output, and runtime drops from 31 s to less than 1 s, a 43x speedup. + +![performance-improvement.jpg](performance-improvement.jpg) \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/Green-Parallel-512-simd.bmp b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/Green-Parallel-512-simd.bmp new file mode 100644 index 0000000000..68b164c3f4 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/Green-Parallel-512-simd.bmp differ diff --git a/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/Green-Parallel-512.bmp b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/Green-Parallel-512.bmp new file mode 100644 index 0000000000..d9b41ddb62 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/Green-Parallel-512.bmp differ diff --git a/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/_index.md b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/_index.md new file mode 100644 index 0000000000..9e04ffede9 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/_index.md @@ -0,0 +1,57 @@ +--- +title: Tune application performance with Arm Performix CPU Microarchitecture analysis + +minutes_to_complete: 60 + +who_is_this_for: This introductory Learning Path is for software developers who want to learn performance analysis methodologies for Linux applications on Arm Neoverse. + +learning_objectives: + - Understand sampling and counting for performance analysis + - Learn commonly used hardware metrics + - Analyze a sample application by using Arm Performix + - Make an application code change and see improved performance + +prerequisites: + - An Arm Neoverse N1 or higher computer running Linux. A bare-metal or cloud bare-metal instance is best because it exposes more counters. + +author: +- Brendan Long +- Kieran Hejmadi + +### Tags +skilllevels: Introductory +subjects: Performance and Architecture +armips: + - Neoverse +tools_software_languages: + - Arm Performix + - C++ + - Runbook + +operatingsystems: + - Linux + +further_reading: + - resource: + title: "Find CPU Cycle Hotspots with Arm Performix" + link: /learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/ + type: documentation + - resource: + title: "Port Code to Arm Scalable Vector Extension (SVE)" + link: /learning-paths/servers-and-cloud-computing/sve/ + type: documentation + - resource: + title: "Arm Neoverse N1: Core Performance Analysis Methodology" + link: https://armkeil.blob.core.windows.net/developer/Files/pdf/white-paper/neoverse-n1-core-performance-v2.pdf + type: documentation + - resource: + title: "Arm Neoverse N1 PMU Guide" + link: https://developer.arm.com/documentation/PJDOC-466751330-547673/r4p1/ + type: documentation + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/_next-steps.md b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/_next-steps.md new file mode 100644 index 0000000000..c3db0de5a2 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/compare-with-box.jpg b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/compare-with-box.jpg new file mode 100644 index 0000000000..a4477e9a36 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/compare-with-box.jpg differ diff --git a/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/cpu-uarch-config.jpg b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/cpu-uarch-config.jpg new file mode 100644 index 0000000000..945e5b6af8 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/cpu-uarch-config.jpg differ diff --git a/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/cpu-uarch-insights.png b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/cpu-uarch-insights.png new file mode 100644 index 0000000000..b7de71ca22 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/cpu-uarch-insights.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/cpu-uarch-results.jpg b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/cpu-uarch-results.jpg new file mode 100644 index 0000000000..c859bf5994 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/cpu-uarch-results.jpg differ diff --git a/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/cpu-uarch-simd-results-diff.jpg b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/cpu-uarch-simd-results-diff.jpg new file mode 100644 index 0000000000..22aea8edb6 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/cpu-uarch-simd-results-diff.jpg differ diff --git a/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/high-simd-utilization.jpg b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/high-simd-utilization.jpg new file mode 100644 index 0000000000..eab1fcc97f Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/high-simd-utilization.jpg differ diff --git a/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/instruction-mix-config.jpg b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/instruction-mix-config.jpg new file mode 100644 index 0000000000..bb0594e897 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/instruction-mix-config.jpg differ diff --git a/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/instruction-mix-diff-results.jpg b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/instruction-mix-diff-results.jpg new file mode 100644 index 0000000000..b09fff2788 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/instruction-mix-diff-results.jpg differ diff --git a/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/instruction-mix-results.jpg b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/instruction-mix-results.jpg new file mode 100644 index 0000000000..f953182d42 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/instruction-mix-results.jpg differ diff --git a/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/instruction-mix-simd-results.jpg b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/instruction-mix-simd-results.jpg new file mode 100644 index 0000000000..9d07814569 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/instruction-mix-simd-results.jpg differ diff --git a/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/performance-improvement.jpg b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/performance-improvement.jpg new file mode 100644 index 0000000000..8de65361cb Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/performance-improvement.jpg differ diff --git a/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/rename-run.jpg b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/rename-run.jpg new file mode 100644 index 0000000000..f69cadbfbc Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/performix-microarchitecture/rename-run.jpg differ