From c40c35fddc66001d56b68874dd9d9a4bbda9f8ed Mon Sep 17 00:00:00 2001 From: pareenaverma Date: Wed, 18 Mar 2026 15:46:33 -0400 Subject: [PATCH] First tech review of CPU Cycle Hotspots LP --- .../cpu_hotspot_performix/_index.md | 6 +- .../cpu_hotspot_performix/how-to-1.md | 14 +++-- .../cpu_hotspot_performix/how-to-2.md | 27 +++++---- .../cpu_hotspot_performix/how-to-3.md | 31 ++++++----- .../cpu_hotspot_performix/how-to-4.md | 55 ++++++++++--------- 5 files changed, 70 insertions(+), 63 deletions(-) diff --git a/content/learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/_index.md b/content/learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/_index.md index 0f859e9710..b4fc50980a 100644 --- a/content/learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/_index.md @@ -7,15 +7,15 @@ cascade: minutes_to_complete: 30 -who_is_this_for: Cloud Engineers looking to optimize their workload running on a Linux-based Arm system. +who_is_this_for: Software developers and performance engineers who want to identify CPU cycle hotspots in applications running on Arm Linux systems. learning_objectives: - Run the CPU Cycle Hotspot recipe in Arm Performix - - Identify which functions in your program use the most CPU cycles, so you can target the best candidates for optimization. + - Identify which functions consume the most CPU cycles and target them for optimization prerequisites: - Access to Arm Performix - - Basic understand on C++ + - Basic understanding of C++ author: Kieran Hejmadi diff --git a/content/learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/how-to-1.md b/content/learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/how-to-1.md index 86779b2816..7fb18e685f 100644 --- a/content/learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/how-to-1.md +++ b/content/learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/how-to-1.md @@ -12,16 +12,18 @@ A flame graph is a visualization built from many sampled call stacks that shows ## Example Flame Graph -Take a look at the example flame graph below. +The flame graph below shows a typical profiling result. -![example](./flame-graph-example.jpg) +![A flame graph showing function call stacks. The x-axis shows relative sample frequency and the y-axis shows call stack depth, with the widest blocks representing functions that consumed the most CPU time#center](./flame-graph-example.jpg "Example flame graph") -The x axis represents the relative number of samples attributed to code paths ordered alphabetically, **not** a timeline. A wider block means that function appeared in more samples and therefore consumed more of the measured resource, typically CPU time. The y axis represents call stack depth. Frames at the bottom are closer to the root of execution such as a thread entry point, and frames above them are functions called by the frames below. A common workflow is to start with the widest blocks, then move upward through the stack to understand which callees dominate that hot path. Each sample captures a snapshot of the current call stack. Many samples are then aggregated by grouping identical stacks and summing their counts. This merging step is what makes flame graphs compact and readable. Reliable stack walking matters, and frame pointers are a common mechanism used to reconstruct the function call hierarchy consistently. When frame pointers are present, it is easier to unwind through nested calls and produce accurate stacks that merge cleanly into stable blocks. +The x-axis represents the relative number of samples attributed to code paths, ordered alphabetically, not a timeline. A wider block means that function appeared in more samples and therefore consumed more CPU time. The y-axis represents call stack depth. Frames at the bottom are closer to the root of execution, such as a thread entry point, and frames above them are functions called by those below. -This learning path is not meant as a detailed explanation of flame graphs, if you would like to learn more please read [this blog](https://www.brendangregg.com/flamegraphs.html) by the original creator, Brendan Gregg. +Each sample captures a snapshot of the current call stack. Many samples are then aggregated by grouping identical stacks and summing their counts, which is what makes flame graphs compact and readable. A common workflow is to start with the widest blocks, then move upward through the stack to understand which callees dominate that hot path. Reliable stack walking depends on frame pointers being present; they allow the profiler to unwind through nested calls and produce accurate stacks that merge cleanly into stable blocks. + +This Learning Path does not cover flame graphs in depth. To learn more, see [Brendan Gregg's flame graph reference](https://www.brendangregg.com/flamegraphs.html). ## Tooling options -On Linux, flame graphs are commonly generated from samples collected with `perf`. perf periodically interrupts the running program and records a stack trace, then the collected stacks are converted into a folded format and rendered as the graph. Sampling frequency is important. If the frequency is too low you may miss short lived hotspots, and if it is too high you may introduce overhead or skew the results. To make the output informative, compile with debug symbols and preserve frame pointers so stacks resolve to meaningful function names and unwind reliably. A typical build uses `-g` and `-fno-omit-frame-pointer`. +On Linux, flame graphs are commonly generated from samples collected with `perf`. perf periodically interrupts the running program and records a stack trace, then the collected stacks are converted into a folded format and rendered as the graph. Sampling frequency is important. If the frequency is too low you may miss short-lived hotspots, and if it is too high you may introduce overhead or skew the results. To make the output informative, compile with debug symbols and preserve frame pointers so stacks resolve to meaningful function names and unwind reliably. A typical build uses `-g` and `-fno-omit-frame-pointer`. -Arm has also developed a tool that simplifies this workflow through the CPU Cycle hotspot recipe in Arm Performix, making it easier to configure collection, run captures, and explore the resulting call hierarchies without manually stitching together the individual steps. This is the tooling solution we will use in this learning path. \ No newline at end of file +Arm has built a tool, Arm Performix that simplifies this workflow through the CPU Cycle hotspot recipe, making it easier to configure collection, run captures, and explore the resulting call hierarchies without manually stitching together the individual steps. This is the tooling solution you will use in this Learning Path. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/how-to-2.md b/content/learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/how-to-2.md index 1e35444afb..61c36ae5da 100644 --- a/content/learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/how-to-2.md +++ b/content/learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/how-to-2.md @@ -8,37 +8,40 @@ layout: learningpathall ## Setup -This learning path uses a hands-on worked example to make sampling-based profiling and flame graphs practical. You’ll build a C++11 program that generates a fractal bitmap by computing the Mandelbrot set, then mapping each pixel’s iteration count to a pixel value. You’ll have the full source code, so you can rebuild the program, profile it, and connect what you see in the flame graph back to the exact functions and loops responsible for the runtime. +This Learning Path uses a hands-on worked example to make sampling-based profiling and flame graphs practical. You’ll build a C++11 program that generates a fractal bitmap by computing the Mandelbrot set, then mapping each pixel’s iteration count to a pixel value. You’ll have the full source code, so you can rebuild the program, profile it, and connect what you see in the flame graph back to the exact functions and loops responsible for the runtime. A fractal is a pattern that shows detail at many scales, often with self-similar structure. Fractals are usually generated by repeatedly applying a simple mathematical rule. In the Mandelbrot set, each pixel corresponds to a complex number, which is iterated through a basic recurrence. How quickly the value “escapes” (or whether it stays bounded) determines the pixel’s color and produces the familiar Mandelbrot image. -You don’t need to understand the Mandelbrot algorithm in detail to follow this learning path—we’ll use it primarily as a convenient, compute-heavy workload for profiling. If you'd like to learn more, please refer to the [Wikipedia](https://en.wikipedia.org/wiki/Mandelbrot_set) page for more information. +You don't need to understand the Mandelbrot algorithm in detail to follow this Learning Path — it's used here as a convenient, compute-heavy workload for profiling. To learn more, see the [Mandelbrot set article on Wikipedia](https://en.wikipedia.org/wiki/Mandelbrot_set). ## Connect to Target -Please refer to the [installation guide](https://learn.arm.com/install-guides/atp) if it is your first time setting up Arm Performix. In this learning path, I will be connecting to an AWS Graviton3 metal instance (`m7g.metal`) with 64 Neoverse V1 cores. From the host machine, test the connection to the remote server by navigating to `'Targets`->`Test Connection`. You should see the successul connection below. +See the [Arm Performix installation guide](https://learn.arm.com/install-guides/atp) if this is your first time setting up Arm Performix. In this Learning Path you will connect to an AWS Graviton3 metal instance (`m7g.metal`) with 64 Neoverse V1 cores, your remote target server. From the host machine, test the connection to the remote server by navigating to **Targets** > **Test Connection**. You should see the successful connection screen below. -![successful-connection](./successful-connection.jpg). +![The Arm Performix Targets panel showing a successful connection test result for a remote Arm server#center](./successful-connection.jpg "Successful connection to remote target") ## Build Application on Remote Server -Next, connect to the remote server, for example using SSH or VisualStudio Code, and clone the Mandelbrot repository. This is available under the [Arm Education License](https://github.com/arm-university/Mandelbrot-Example?tab=License-1-ov-file) for teaching and learning. Create a new directory where you will store and build this example. Next, run the commands below. +Connect to the remote server using SSH or Visual Studio Code. Install git and the C++ compiler. On dnf-based systems such as Amazon Linux 2023 or RHEL, run: ```bash -git clone https://github.com/arm-university/Mandelbrot-Example.git -cd Mandelbrot-Example && mkdir images builds -git checkout single-thread +sudo dnf update && sudo dnf install git gcc g++ ``` -Install a C++ compiler, for example using your operating system's package manager. +Clone the Mandelbrot repository, check out the single-threaded branch, and create the output directories. The repository is available under the [Arm Education License](https://github.com/arm-university/Mandelbrot-Example?tab=License-1-ov-file) for teaching and learning. ```bash -sudo dnf update && sudo dnf install g++ gcc +git clone https://github.com/arm-university/Mandelbrot-Example.git +cd Mandelbrot-Example +git checkout single-thread +mkdir images builds ``` -Build the application. +Build the application: ```bash ./build.sh -``` \ No newline at end of file +``` + +This creates the binary `./builds/mandelbrot`. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/how-to-3.md b/content/learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/how-to-3.md index 022d6c5c51..25702f8bcd 100644 --- a/content/learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/how-to-3.md +++ b/content/learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/how-to-3.md @@ -8,9 +8,11 @@ layout: learningpathall ## Run CPU Cycle Hotspot Recipe -As shown in the `main.cpp` file below, the program generates a 1920×1080 bitmap image of our fractal. To identify performance bottlenecks, we’ll run the CPU Cycle Hotspot recipe in Arm Performix (APX). APX uses sampling to estimate where the CPU spends most of its time, allowing it to highlight the hottest functions—especially useful in larger applications where it isn’t obvious ahead of time which functions will dominate runtime. +As shown in the `main.cpp` file below, the program generates a 1920×1080 bitmap image of the fractal. To identify performance bottlenecks, run the CPU Cycle Hotspot recipe in Arm Performix (APX). APX uses sampling to estimate where the CPU spends most of its time, allowing it to highlight the hottest functions—especially useful in larger applications where it isn't obvious ahead of time which functions will dominate runtime. -**Please Note**: You will need to replace the first string argument in the `myplot.draw()` function with the absolute path to the image folder and rebuild the application. If not, the image will be written to the `/tmp/atperf/tools/atperf-agent` directory from where the binary is run. As the name suggests, this folder is periodically deleted. +{{% notice Note %}} +The `myplot.draw()` call uses a relative path (`./images/green.bmp`). When APX launches the binary, it runs it from `/tmp/atperf/tools/atperf-agent`, so the image would be written there rather than to your project directory. Replace the first string argument with the absolute path to your `images` folder (for example, `/home/ec2-user/Mandelbrot-Example/green.bmp`) and rebuild the application before continuing. +{{% /notice %}} ```cpp #include "Mandelbrot.h" @@ -27,30 +29,29 @@ int main(){ } ``` -Open up APX from the host machine. Click on the `CPU Cycle Hotspot` recipe. If this is the first time running the recipe on this target machine you may need to click the install tools button. +Open APX from the host machine. Select the **CPU Cycle Hotspot** recipe. If this is the first time running the recipe on this target machine you may need to select the install tools button. -![install-tools](./install-tools.jpg) +![The Arm Performix recipe selection screen with the CPU Cycle Hotspot recipe highlighted#center](./install-tools.jpg "Selecting the CPU Cycle Hotspot recipe") -Next we will configure the recipe. We will choose to launch a new process, APX will automatically start collecting metric when the program starts and stop when the program exits. +Configure the recipe to launch a new process. APX will automatically start collecting metrics when the program starts and stop when the program exits. -Provide an absolute path to the recently built binary, `mandelbrot`. +Provide the absolute path to the binary built in the previous step: `/home/ec2-user/Mandelbrot-Example/builds/mandelbrot`. -Finally, we will use the default sampling rate of `Normal`. If your application is a short running program, you may want to consider a higher sample rate, this will be at the tradeoff of more data to store and process. +Use the default sampling rate of **Normal**. If your application is short-running, consider a higher sample rate, at the cost of more data to store and process. -![config](./hotspot-config.jpg) +![The Arm Performix CPU Cycle Hotspot recipe configuration screen showing launch settings, binary path, and sampling rate fields#center](./hotspot-config.jpg "CPU Cycle Hotspot recipe configuration") ## Analyse Results -A flame graph should be generated. The default colour mode is to label the 'hottest function', those which are sampled and utilizing CPU most frequently, in the darkest shade. Here we can see that the `__complex_abs__` function is being called during ~65% of samples. This is then calling the `__hypot` symbol in `libm.so`. +A flame graph is generated once the run completes. The default colour mode labels the hottest functions—those using CPU most frequently—in the darkest shade. In this example, the `__complex_abs__` function is present in approximately 65% of samples, and it calls the `__hypot` symbol in `libm.so`. -![single-thread-flameg](./single-thread-flame-graph.jpg) +![A flame graph showing single-threaded Mandelbrot profiling results with __complex_abs__ as the dominant hotspot#center](./single-thread-flame-graph.jpg "Single-threaded flame graph showing __complex_abs__ as the hottest function") +To investigate further, you can map source code lines to the functions in the flame graph. Right-click on a specific function and select **View Source Code**. At the time of writing (ATP Engine 0.44.0), you may need to copy the source code onto your host machine. -To understand deeper, we can map the the lines of source code to the functions. To do this right clight on a specific function and select 'View Source Code'. At the time of writing (ATP Engine 0.44.0), you may need to copy the source code onto your host machine. +![The Arm Performix flame graph view showing source code annotations mapped to the selected hot function#center](./view-with-source-code.jpg "Flame graph with source code view") -![view-src-code](./view-with-source-code.jpg) +Finally, check your `images` directory for the generated bitmap fractal. -Finally, looking in our images directory we can see the bitmap fractal. - -![mandelbrot](./plot-1-thread-MAX_ITERATIONS.jpg) +![A rendered Mandelbrot set fractal in green, generated from the single-threaded build at maximum iterations#center](./plot-1-thread-max-iterations.jpg "Mandelbrot fractal output from single-threaded build") diff --git a/content/learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/how-to-4.md b/content/learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/how-to-4.md index 3ea0dcaf94..92edd8cd96 100644 --- a/content/learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/how-to-4.md +++ b/content/learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/how-to-4.md @@ -6,9 +6,11 @@ weight: 5 layout: learningpathall --- -Now we can leverage the insights surfaced by Arm Performix to focus the optimizations around the hottest functions. Looking at the source code, we understand the the hypotenuse function, `__hypot`, is being invoked by the `Mandelbrot::getIterations` function to calculate the absolute value of a complex number. You may consider trying to used an optimized version of `libm` +## Optimize hot functions -Looking at the `Mandelbrot::getIterations` function, there are some obvious ways to optimize. +Use the insights from Arm Performix to focus optimizations on the hottest functions. The flame graph from the previous step shows that `__hypot` is invoked by `Mandelbrot::getIterations` to calculate the absolute value of a complex number. One option worth exploring is replacing the `libm` implementation with an optimized alternative, but first consider the more targeted changes below. + +Looking at the `Mandelbrot::getIterations` function, there are two clear optimization opportunities. ```cpp while (iterations < MAX_ITERATIONS){ @@ -20,69 +22,68 @@ Looking at the `Mandelbrot::getIterations` function, there are some obvious way } ``` -### Optimization 1 - Limiting Loop Boundary - +### Optimization 1 - Limiting loop boundary -We can see that the number of iterations is of the absolute value is limited by the loop boundary, MAX_ITERATIONS. Our first optimization could be to reduce MAX_ITERATIONS. This is defined as 1024 a static const integer in the `Mandelbrot.h` header. We could half this to 512 and assess the perceived image quality on our fractal. +The iteration count is bounded by `MAX_ITERATIONS`, defined as 1024, a `static const` integer in the `Mandelbrot.h` header. Halving this to 512 reduces the maximum work per pixel but you will need to verify that the change in image quality is acceptable. ```cpp public: -... + ... static const int MAX_ITERATIONS = (1<<10); -... + ... ``` -On the remote server, reduce `MAX_ITERATIONS` in `Mandelbrot.h` and rename the output file string in `main.cpp` to something else and rebuild the binary with the following command. +On the remote server, reduce `MAX_ITERATIONS` in `Mandelbrot.h` to `(1<<9)` (512), update the output filename in `main.cpp` to a different path so you can compare the output with the baseline image, then rebuild: ```bash ./build.sh ``` -Next, click on the refresh icon in the top right to rerun the recipe. Next we select the comparison mode to view differences in the run. Navigating to the 'Run Details' tab, we observe a reducion in run duration from 1m 0s to 0m 32s, almost proportional to the reduction in `MAX_ITERATIONS`. However, we need to see if the tradeoff between image quality and runtime was worth it. +Select the refresh icon in the top right to rerun the recipe, then switch to comparison mode to view differences between runs. Under the **Run Details** tab, the run duration drops from 1m 0s to 0m 32s — almost proportional to the reduction in `MAX_ITERATIONS`. The trade-off to verify is whether the image quality is still acceptable. -Looking at the change in image quality, there is neglible difference in perceived image quality when halfing MAX_ITERATIONS. +There is negligible difference in perceived image quality when halving `MAX_ITERATIONS`. -![comparison](./comparison.jpg) +![Side-by-side comparison of Mandelbrot fractal output at MAX_ITERATIONS 1024 and 512, showing no visible quality difference#center](./comparison.jpg "Image quality comparison: 1024 vs 512 iterations") -### Optimization 2 - Parallelising Hot Function +### Optimization 2 - Parallelising the hot function -Fortunately, our loop does not contain any loop-carried dependencies, where the result of an iterations depends on a future or previous iteration. As such we can parallelize our hot function to fun on multiple threads if our CPU has multiple cores. +The loop in `Mandelbrot::getIterations` has no loop-carried dependencies — each iteration's result is independent of any other. This means you can parallelize the hot function across multiple threads if your CPU has multiple cores. -The repository contains a parallel version in the main branch. +The repository contains a parallel version in the `main` branch: ```bash git checkout main ``` -This branch parallelized the `Mandelbrot::draw` function, which is earlier function in the stack that eventually calls the `__hypot` function. - -Build the example, this creates a binary `./builds/mandelbrot-parallel` which takes in a numerical command line arguments to set the number of threads. +This branch parallelizes the `Mandelbrot::draw` function, which is an earlier function in the call stack that eventually calls `__hypot`. Build the example. This creates a binary `./builds/mandelbrot-parallel` that takes a single numeric command-line argument to set the number of threads. ```bash ./build.sh ``` -Rerun the recipe with the new binary from Arm Performix running on the host. +Update the binary path in APX to `./builds/mandelbrot-parallel` and pass the desired thread count as an argument, then rerun the recipe from the host. -To assess the change, we can compare with a previous run. Looking under the `Run Details` tab, we can see the execution time has reduced further from 0m 32s to 7s with 32 threads. +To compare with a previous run, switch to comparison mode. Under the **Run Details** tab, execution time drops further from 0m 32s to 7s with 32 threads. -![exec-change](./comparison-time.jpg) +![The Arm Performix Run Details tab comparing execution time between the baseline and parallelized builds, showing a reduction from 32s to 7s#center](./comparison-time.jpg "Execution time comparison: single-threaded vs parallelized build") -The percentage point of samples has not changed significantly, but we see with 64 threads the % of sampling landing on the `Mandelbrot::draw` function has reduced by 7%. This suggests that if we want to further improve the execution time, further optimizations on the `Mandelbrot::draw` function will yield the greatest benefit. +The proportion of samples has not changed significantly overall, but with 64 threads the percentage of samples landing on `Mandelbrot::draw` has reduced by 7%. To further improve execution time, you can continue optimizing `Mandelbrot::draw`. -![flame-graph-comparison](./flame-graph-comparison.jpg). +![Side-by-side flame graph comparison between single-threaded and parallel Mandelbrot builds, showing reduced dominance of the Mandelbrot::draw function#center](./flame-graph-comparison.jpg "Flame graph comparison: single-threaded vs parallelized build") -**Please Note:** The total run duration is the runtime for both the tooling setup and data analysis, not the runtime of the application. Using a command line tool such as `time` we observe the application duration is now ~ 1s. Resulting in almost a 100x improvement in runtime! +{{% notice Note %}} +The total run duration shown in APX includes tooling setup and data analysis time, not just application execution time. To measure only the application, use the `time` command: the application now runs in approximately 1 second — close to a 100x improvement over the original single-threaded baseline. +{{% /notice %}} ### (Optional Challenge) Additional optimizations -You may have noticed our build script uses the `-O0` flag, which ensures the compiler does not add any additional optimizations. You can experiment with additional optimization levels, loop boundary sizes and threads. Please see our learning path introducing [basic compiler flags](https://learn.arm.com/learning-paths/servers-and-cloud-computing/cplusplus_compilers_flags/) for more information. Additionally, you may wish to look at vectorized libraries that could replace the hypotenuse function in `libm`, such as the [Arm Performance Libraries](https://developer.arm.com/documentation/101004/2601/Arm-Performance-Libraries-Math-Functions/Arm-Performance-Libraries-Vector-Math-Functions--Accuracy-Table). +The build script uses the `-O0` flag, which disables all compiler optimizations. Try experimenting with higher optimization levels, different loop boundary sizes, and thread counts. See the Learning Path [Get started with compiler optimization flags](/learning-paths/servers-and-cloud-computing/cplusplus_compilers_flags/) for guidance. You may also want to explore vectorized math libraries that could replace the `libm` hypotenuse function, such as the [Arm Performance Libraries vector math functions](https://developer.arm.com/documentation/101004/2601/Arm-Performance-Libraries-Math-Functions/Arm-Performance-Libraries-Vector-Math-Functions--Accuracy-Table). ## Summary -In this learning path, we reduced the runtime of the Mandelbrot example by focusing on the hottest code paths—cutting execution time from around 1 minute to ~1 second through targeted optimization and parallelization. While this example is relatively simple and the optimizations are more obvious, the same principle applies to real-world workloads: optimize what matters most first, based on measurement. +In this Learning Path, you reduced the runtime of the Mandelbrot example by focusing on the hottest code paths—cutting execution time from around 1 minute to ~1 second through targeted optimization and parallelization. While this example is relatively simple and the optimizations are more obvious, the same principle applies to real-world workloads: optimize what matters most first, based on measurement. -The cpu_hotspot recipe is designed to quickly identify an application’s hottest (most CPU-time-dominant) functions, giving you a clear, evidence-based starting point for performance work. By surfacing where execution time is actually being spent, it helps ensure any optimizations are targeted at the parts of the code most likely to deliver the largest performance gains, rather than relying on guesswork. +The CPU Cycle Hotspot recipe is designed to quickly identify an application's most CPU-time-dominant functions, giving you a clear, evidence-based starting point for performance work. By surfacing where execution time is actually spent, it ensures your optimizations target the parts of the code most likely to deliver the largest gains. -This is often one of the first profiling steps you’ll run when assessing an application’s performance characteristics—especially to determine which functions dominate runtime and should be prioritized. Once hotspots are identified, you can follow up with deeper, function-specific analysis, such as memory investigations or top-down studies, and even build microbenchmarks around hot functions to explore lower-level bottlenecks and uncover additional optimization opportunities. \ No newline at end of file +This is often one of the first profiling steps to run when assessing an application's performance — especially to determine which functions dominate runtime and should be prioritized. Once hotspots are identified, you can follow up with deeper function-specific analysis, such as memory investigations or top-down studies, and build microbenchmarks around hot functions to explore lower-level bottlenecks and uncover additional optimization opportunities. \ No newline at end of file