Main Content

GPU Performance Analyzer

Analyze GPU profiling data and identify optimizations

Since R2023a

Description

The GPU Performance Analyzer displays information about GPU and CPU activities, events, and performance metrics from the generated CUDA® code. Use the GPU Performance Analyzer to find performance bottlenecks in a design.

The GPU Performance Analyzer displays the profiling data in chronological order in the Profiling Timeline pane. The timeline contains four rows:

  • The Functions row, which shows function calls in the generated code

  • The Loops row, which shows loops that execute on the CPU

  • The CPU Overhead row, which shows memory transfers and GPU kernel launches

  • The GPU Activities row, which shows memory transfers and GPU kernel execution

The Diagnostics pane reports potential performance bottlenecks, such as long CPU loops, repetitive memory transfers, or GPU kernels with few threads, and it suggests ways to address them. The analyzer also contains:

  • A Call Tree pane that shows the function call hierarchy

  • A Profiling Summary pane that contains overview statistics for the generated code

  • An Event Statistics pane that displays detailed statistics for the selected event

  • A Code pane that you can use to trace events and generated code to the source MATLAB® code

If the entry-point function that you generate code from contains a deep learning network, you can use the Open Deep Learning Dashboard button to open a dashboard that contains the statistics for the network. For more information, see Analyzing Network Performance Using the Deep Learning Dashboard.

The GPU Performance Analyzer creates a gpuProfiler.mldatx file that contains the profiling data in the html subfolder of the code generation folder. You can reopen the profiling data from a profiling session by opening the MLDATX file.

GPU Performance Analyzer

Open the GPU Performance Analyzer

Examples

expand all

Generate GPU code, profile it, and view the profiling data in the GPU Performance Analyzer.

Use the openExample function to obtain the example files.

openExample("gpucoder/GPUExecutionProfilingOfTheGeneratedCodeExample");

The fog_rectification_init function takes a foggy input image and returns a defogged image. Read the image data from the foggyInput.png file and assign it to a variable, inputImage.

inputImage = imread("foggyInput.png");
inputs = {inputImage};

Create a GPU code configuration object by using the coder.gpuConfig function. Use the mex input argument to generate a MEX function.

cfg = coder.gpuConfig("mex");

Use the gpuPerformanceAnalyzer function to generate code for the function, profile the code, and view the data in the GPU Performance Analyzer.

designFileName = "fog_rectification_init";
gpuPerformanceAnalyzer(designFileName,inputs,Config=cfg);
### Starting GPU code generation
Code generation successful: View report

### GPU code generation finished
### Starting application profiling
### Application profiling finished
### Starting profiling data processing
### Profiling data processing finished
### Showing profiling data

GPU Performance Analyzer showing the profiling data for fog_rectification_init

Zoom in to a section of the profiling timeline and examine an event.

To zoom in on a section of the timeline, use one of these methods:

  • Hold Shift and click and drag over the section of the timeline you want to display.

  • Use the mouse wheel or an alternative touchpad option.

  • In the bar at the top of the Profiling Timeline pane, click and drag the sliders to change the region shown in the timeline.

After zooming in, the Profiling Summary shows the summary for the region in the timeline. This image shows the GPU Performance Analyzer when zoomed in on a loop, fog_rectification_init_loop_2.

GPU Performance Analyzer zoomed in to fog_rectification_init_loop_2. The Profiling Summary pane shows the GPU utilization is 66% and the CPU overhead is 7%

To zoom out, use the mouse wheel or the sliders at the top of the timeline.

Examine the call hierarchy of the generated code using the Call Tree pane.

In this example, the GPU Performance Analyzer shows the profiling data from the fog_rectification_init entry-point function. In the Call Tree pane, select the fog_rectification_init function. To display the child events, expand the fog_rectification_init node.

Call Tree pane showing the events called by fog_rectification_init.

The call tree lists the execution times of each event as a percentage of its parent event. The loop fog_rectification_init_loop_2 runs for the longest time compared to the other child events.

To examine the events called by the loop, expand the fog_rectification_init_loop_2 node. The tree shows that most of the loop execution is waiting for the GPU.

Call Tree pane showing the events under fog_rectification_init_loop_2. The tree shows WaitForGPU takes 84.39% of the execution time.

Trace a GPU kernel event to the section of the generated code that calls the kernel, and trace the generated code to the source MATLAB code.

Open the profiling data for fog_rectification_init in the GPU Performance Analyzer. In the Profiling Timeline pane, in the GPU Activities row, select the leftmost kernel event, fog_rectification_init_kernel01.

Profiling Timeline with fog_rectification_init_kernel01 selected

To view the call to fog_rectification_init_kernel01 in the generated code, in the timeline, click the code button. Alternatively, in the toolstrip, in the Event Actions section, select Go To Code. The Code pane shows the trace in the generated code.

Code pane showing the call to the kernel in fog_rectification_init.cu

To view the definition of the kernel, hold Ctrl and click the kernel name.

Code pane showing the definition for fog_rectification_init_kernel01

Point to line 307, which is part of the kernel. Select the code on line 307 to view traced MATLAB code.

Code pane highlighting the traced MATLAB code. The traced MATLAB code calls the zeros function.

Line 307 of the generated code traces to line 9 of fog_rectification_init.m, which creates a matrix of zeros. To view the other code traces, select the list at the top of the Code pane.

Code pane listing the lines of MATLAB code that trace to line 307 of fog_rectification_init.cu

From the list, select line 43 of fog_rectification_init.m. The kernel also traces to MATLAB code that multiplies each element of the array restoreOut by 255.

Code pane highlighting the trace from line 307 of the CUDA code to line 43 of the MATLAB code

To open the traced code in the MATLAB editor, click the line number 43 in the fog_rectification_init.m file.

Use the Filters section of the toolstrip to filter the events in the analyzer.

  • Show entire profiling session — Use this option to view the profiling results for the entire application, including initialization and termination.

  • Show single run — Use this option to view the profiling results for a single iteration of the generated code. By default, the GPU Performance Analyzer shows the results from the last run of the generated code.

  • Use the options under Filter Events to filter individual events:

    • Threshold (ms) — Skip events shorter than the given threshold.

    • Memory Allocation/Free — Show GPU device memory allocation and deallocation related events on the CPU activities bar.

    • Memory Transfer — Show memory transfers to the host or device.

    • Kernel — Show CPU kernel launches and GPU kernel activities.

    • Other Event — Show other GPU related events such as synchronization and waiting for GPU.

Related Examples

Limitations

  • On the Functions and Loops rows, you can navigate between caller and callee functions and loops using the up and down arrows on the right side of the event bar. For short events, it might not be possible to navigate back to the calling function or loop by using the up and down arrows. In such cases, use the call tree to navigate back to the caller function or loop.

  • At low zoom levels, GPU Performance Analyzer represents a densely populated area of short events separated by short distances as a single event. At higher levels of zoom, GPU Performance Analyzer displays the individual events. However, if the event duration is extremely short, it may not be possible to render this event on the timeline plot, even at high zoom levels.

  • GPU Performance Analyzer displays all the GPU events in a single row. In case of multiple CUDA streams, the GPU Activities row may contain overlapping events and the calculation in the Profiling Summary panel may be inaccurate. For example, deep learning libraries such as cuDNN may use multiple CUDA streams.

Version History

Introduced in R2023a

expand all