GPU Performance Analyzer

Analyze GPU profiling data and identify optimizations

Since R2023a

Description

The GPU Performance Analyzer displays information about GPU and CPU activities, events, and performance metrics from the generated CUDA^® code. Use the GPU Performance Analyzer to find performance bottlenecks in a design.

The GPU Performance Analyzer displays the profiling data in chronological order in the Profiling Timeline pane. The timeline contains four rows:

The Functions row, which shows function calls in the generated code
The Loops row, which shows loops that execute on the CPU
The CPU Overhead row, which shows memory transfers and GPU kernel launches
The GPU Activities row, which shows memory transfers and GPU kernel execution

The Diagnostics pane reports potential performance bottlenecks, such as long CPU loops, repetitive memory transfers, or GPU kernels with few threads, and it suggests ways to address them. The analyzer also contains:

A Call Tree pane that shows the function call hierarchy
A Profiling Summary pane that contains overview statistics for the generated code
An Event Statistics pane that displays detailed statistics for the selected event
A Code pane that you can use to trace events and generated code to the source MATLAB^® code

If the entry-point function that you generate code from contains a deep learning network, you can use the Open Deep Learning Dashboard button to open a dashboard that contains the statistics for the network. For more information, see Analyzing Network Performance Using the Deep Learning Dashboard.

The GPU Performance Analyzer creates a gpuProfiler.mldatx file that contains the profiling data in the html subfolder of the code generation folder. You can reopen the profiling data from a profiling session by opening the MLDATX file.

Open the GPU Performance Analyzer

MATLAB command prompt:
- Use the gpuPerformanceAnalyzer function.
- Use the codegen function with both the -gpuprofile and -test options.
- Use the gpuprofile function with the viewer option.

Examples

expand all

Profile Generated Code and View Data

Generate GPU code, profile it, and view the profiling data in the GPU Performance Analyzer.

Use the openExample function to obtain the example files.

openExample("gpucoder/GPUExecutionProfilingOfTheGeneratedCodeExample");

The fog_rectification_init function takes a foggy input image and returns a defogged image. Read the image data from the foggyInput.png file and assign it to a variable, inputImage.

inputImage = imread("foggyInput.png");
inputs = {inputImage};

Create a GPU code configuration object by using the coder.gpuConfig function. Use the mex input argument to generate a MEX function.

cfg = coder.gpuConfig("mex");

Use the gpuPerformanceAnalyzer function to generate code for the function, profile the code, and view the data in the GPU Performance Analyzer.

designFileName = "fog_rectification_init";
gpuPerformanceAnalyzer(designFileName,inputs,Config=cfg);

### Starting GPU code generation
Code generation successful: View report

### GPU code generation finished
### Starting application profiling
### Application profiling finished
### Starting profiling data processing
### Profiling data processing finished
### Showing profiling data

GPU Performance Analyzer showing the profiling data for fog_rectification_init

Change Zoom Level of Timeline

Zoom in to a section of the profiling timeline and examine an event.

To zoom in on a section of the timeline, use one of these methods:

Hold Shift and click and drag over the section of the timeline you want to display.
Use the mouse wheel or an alternative touchpad option.
In the bar at the top of the Profiling Timeline pane, click and drag the sliders to change the region shown in the timeline.

After zooming in, the Profiling Summary shows the summary for the region in the timeline. This image shows the GPU Performance Analyzer when zoomed in on a loop, fog_rectification_init_loop_2.

GPU Performance Analyzer zoomed in to fog_rectification_init_loop_2. The Profiling Summary pane shows the GPU utilization is 66% and the CPU overhead is 7%

To zoom out, use the mouse wheel or the sliders at the top of the timeline.

Examine Call Hierarchy

Examine the call hierarchy of the generated code using the Call Tree pane.

In this example, the GPU Performance Analyzer shows the profiling data from the fog_rectification_init entry-point function. In the Call Tree pane, select the fog_rectification_init function. To display the child events, expand the fog_rectification_init node.

Call Tree pane showing the events called by fog_rectification_init.

The call tree lists the execution times of each event as a percentage of its parent event. The loop fog_rectification_init_loop_2 runs for the longest time compared to the other child events.

To examine the events called by the loop, expand the fog_rectification_init_loop_2 node. The tree shows that most of the loop execution is waiting for the GPU.

Call Tree pane showing the events under fog_rectification_init_loop_2. The tree shows WaitForGPU takes 84.39% of the execution time.

Trace Event to Code

Trace a GPU kernel event to the section of the generated code that calls the kernel, and trace the generated code to the source MATLAB code.

Open the profiling data for fog_rectification_init in the GPU Performance Analyzer. In the Profiling Timeline pane, in the GPU Activities row, select the leftmost kernel event, fog_rectification_init_kernel01.

Profiling Timeline with fog_rectification_init_kernel01 selected

To view the call to fog_rectification_init_kernel01 in the generated code, in the timeline, click the code button. Alternatively, in the toolstrip, in the Event Actions section, select Go To Code. The Code pane shows the trace in the generated code.

Code pane showing the call to the kernel in fog_rectification_init.cu

To view the definition of the kernel, hold Ctrl and click the kernel name.

Code pane showing the definition for fog_rectification_init_kernel01

Point to line 307, which is part of the kernel. Select the code on line 307 to view traced MATLAB code.

Code pane highlighting the traced MATLAB code. The traced MATLAB code calls the zeros function.

Line 307 of the generated code traces to line 9 of fog_rectification_init.m, which creates a matrix of zeros. To view the other code traces, select the list at the top of the Code pane.

Code pane listing the lines of MATLAB code that trace to line 307 of fog_rectification_init.cu

From the list, select line 43 of fog_rectification_init.m. The kernel also traces to MATLAB code that multiplies each element of the array restoreOut by 255.

Code pane highlighting the trace from line 307 of the CUDA code to line 43 of the MATLAB code

To open the traced code in the MATLAB editor, click the line number 43 in the fog_rectification_init.m file.

Filter Profiling Data

Use the Filters section of the toolstrip to filter the events in the analyzer.

Show entire profiling session — Use this option to view the profiling results for the entire application, including initialization and termination.
Show single run — Use this option to view the profiling results for a single iteration of the generated code. By default, the GPU Performance Analyzer shows the results from the last run of the generated code.
Use the options under Filter Events to filter individual events:
- Threshold (ms) — Skip events shorter than the given threshold.
- Memory Allocation/Free — Show GPU device memory allocation and deallocation related events on the CPU activities bar.
- Memory Transfer — Show memory transfers to the host or device.
- Kernel — Show CPU kernel launches and GPU kernel activities.
- Other Event — Show other GPU related events such as synchronization and waiting for GPU.

Related Examples

Limitations

On the Functions and Loops rows, you can navigate between caller and callee functions and loops using the up and down arrows on the right side of the event bar. For short events, it might not be possible to navigate back to the calling function or loop by using the up and down arrows. In such cases, use the call tree to navigate back to the caller function or loop.
At low zoom levels, GPU Performance Analyzer represents a densely populated area of short events separated by short distances as a single event. At higher levels of zoom, GPU Performance Analyzer displays the individual events. However, if the event duration is extremely short, it may not be possible to render this event on the timeline plot, even at high zoom levels.
GPU Performance Analyzer displays all the GPU events in a single row. In case of multiple CUDA streams, the GPU Activities row may contain overlapping events and the calculation in the Profiling Summary panel may be inaccurate. For example, deep learning libraries such as cuDNN may use multiple CUDA streams.

Version History

Introduced in R2023a

expand all

R2026a: Access additional information about CPU loops

In the Diagnostics pane, diagnostics for CPU loops contain an Additional info section that explains why GPU Coder™ did not generate the loop as a kernel. In the Additional info section, click Expand info to learn more about the loop.

R2025a: Visualize performance of deep neural networks

The GPU Performance Analyzer collects performance data for inferences made by the generated code in a deep learning dashboard. Reveal the deep learning inference functions using the Show Predict Functions button. To open the dashboard, select an inference function, and, in the toolstrip, click Open Deep Learning Dashboard.

R2025a: Diagnose performance bottlenecks

The Diagnostics pane lists performance bottlenecks that the GPU Performance Analyzer detects. The pane displays their severity, cause, and a link to the relevant section of the generated code.

R2025a: Updated timeline, summary, and event statistics

The GPU Performance Analyzer has an updated interface for the timeline, summary statistics, and event statistics.

The Profiling Timeline tab now contains:
- A Key Bindings button that displays the keyboard shortcuts for the GPU Performance Analyzer
- A Legend button that displays the meanings of the colors in the profiling timeline
The Profiling Timeline tab displays the run number and the name of the entry-point function for each MEX function event.
The Profiling Summary pane contains an overview of the CPU and GPU activities in the generated code.
The Event Statistics pane contains performance data for the selected event.

The Profiling Summary and Event Statistics panes replace the Insights pane. In previous releases, the Insights pane contained an overview of the CPU and GPU activities in the generated code and performance data for the selected event.

R2024a: Profile CUDA MEX functions

You can profile CUDA MEX functions using the gpuPerformanceAnalyzer or gpuprofile functions. In previous releases, you profiled static libraries and dynamic libraries by passing a coder.EmbeddedCodeConfig object to the gpuPerformanceAnalyzer function or the codegen command. To profile GPU MEX functions, pass a coder.MexCodeConfig object to the gpuPerformanceAnalyzer function. Alternatively, start a profiling session using the gpuprofile function and calls a GPU MEX function.

R2024a: Profile GPU code using the `codegen` command

If you have a test file that calls a MATLAB function, you can generate GPU code, profile it, and open the GPU Performance Analyzer in one step by running the codegen command with both the -gpuprofile and -test options.

R2023b: Enable GPU profiling using the `codegen` command

Generate code using the codegen command with the -gpuprofile option to create a software-in-the-loop (SIL) executable with GPU profiling instrumentation. The GPU Performance Analyzer collects profiling metrics when you run the executable. After terminating SIL execution, MATLAB provides a link to open the GPU Performance Analyzer and generates a gpuProfiler.mldatx file with the profiling data for the executable.

R2023b: Use the NVIDIA CUPTI library

To collect profiling metrics from the generated code, the GPU Performance Analyzer now uses an installation of the NVIDIA^® CUPTI library. In R2023a, you installed the NVIDIA Nsight Systems tools and libraries to collect profiling metrics using the analyzer. Installing the NVIDIA NSight Systems libraries is no longer required.

R2023b: Profile NVIDIA Jetson input/output functions

You can profile generated code from input/output functions from the MATLAB Coder™ Support Package for NVIDIA Jetson™ and NVIDIA DRIVE^® Platforms using the GPU Performance Analyzer. For example, the GPU Performance Analyzer now collects profiling metrics from the camera object functions.

R2023b: Trace events to code

View generated code and identify bottlenecks in the MATLAB source code by using the Code pane. You can view the generated code for events and use bidirectional tracing to trace between MATLAB code and generated GPU code.