GPU Performance Analyzer
GPU Performance Analyzer
The GPU Performance Analyzer exposes GPU and CPU activities, events, and performance metrics in a chronological timeline plot to accurately visualize, identify and address performance bottlenecks in the generated CUDA® code.
These numbers are representative. The actual values depend on your hardware setup. This profiling was done using MATLAB® R2023a using a host machine with an 6 core, 3.5GHz Intel® Xeon® CPU, and an NVIDIA® TITAN XP GPU and a Jetson™ AGX Xavier development kit.
The profiling timeline shows the complete trace of all events that have a runtime higher than the threshold value. A snippet of the profiling trace is shown.
You can use the mouse wheel (or an equivalent touch pad option) to zoom into and out of the timeline. Alternatively, you can use the timeline summary at the top of the panel to zoom and navigate the timeline plot.
The tooltips on each event indicate the start time, end time and duration of the selected event on the CPU and the GPU. It also indicates the time elapsed between the kernel launch on the CPU and the actual execution of the kernel on the GPU.
By default, the View Mode of the GPU Performance Analyzer window is set to Entry-Point Function and the Profiling Timeline shows only the last execution of the generated code. To view all the iterations, set the View Mode to Full Application.
On the Functions and Loops rows, you can navigate between caller and callee functions/loops using the up/down arrows that appear on the right side of the event bar.
The event statistics panel shows additional information for the selected event. For
feature_matching_kernel2 shows the following
The insights panel gives an pie chart overview of the GPU and CPU activities. The pie chart changes according to the zoom level of the profiling timeline. A snippet of the insights panel is shown. Within the region selected on the timeline, it shows that the GPU utilization is 76%.
This section lists the GPU events called from the CPU. Each event in the call tree lists the execution times as percentages of caller function. This metric can help you to identify performance bottlenecks in generated code. You can also navigate to specific events on the profiling timeline by clicking on the corresponding events in the call tree.
Use Open Report to open a GPU profiling report
gpuProfiler.mldatx). By default, the
gpuPerformanceAnalyzer function generates files in the following
target can be:
exefor CUDA executables
libfor CUDA libraries
dllfor CUDA dynamic libraries
fcn_name is the name of the MATLAB entry-point function.
gpuPerformanceAnalyzer generates the same type of
output for the same code, it removes the files from the previous build. If you want to
preserve files from a previous build, before starting another build, copy them to a
This section provides filtering options for the report.
View Mode - Use this option to view profiling results for the entire application (including initialization and terminate) or the design function (without initialization and terminate).
Event Threshold - Skip events shorter than the given threshold.
Memory Allocation/Free - Show GPU device memory allocation and deallocation related events on the CPU activities bar.
Memory Transfers - Show host-to-device and device-to-host memory transfers.
Kernels - Show CPU kernel launches and GPU kernel activities.
Others - Show other GPU related events such as synchronization and waiting for GPU.
GPU Performance Analyzer has been tested with Nsight 2022.5.1. Other versions of Nsight Systems may require admin privileges on Windows® platforms.
On the Functions and Loops rows, you can navigate between caller and callee functions/loops using the up/down arrows that appear on the right side of the event bar. For short events, it may not be possible to navigate back to the calling function/loop by using the up/down arrows. In such cases, use the call tree to navigate to the calling/callee functions/loops
GPU Performance Analyzer displays the row header even if the row does not contain any events.
At low zoom levels, GPU Performance Analyzer represents a densely populated area of short events separated by short distances as a single event. At higher levels of zoom, GPU Performance Analyzer displays the actual events. However, if the event duration is extremely short, it may not be possible to render this event on the timeline plot, even at high zoom levels.
GPU Performance Analyzer uses a single row to represent all the GPU events. If there are multiple CUDA streams, the GPU Activities row may contain overlapping events and the occupancy calculation in the Insights panel may be inaccurate. For example, deep learning libraries such as cuDNN may use multiple CUDA streams.
In the Entry-Point Function mode, a GPU event triggered after the end of the entry-point function may not displayed properly.
- Analyze Performance of the Generated CUDA Code
- GPU Profiling on NVIDIA Jetson Platforms
- Analyze Performance of Code Generated for Deep Learning Networks