Main Content

GPU Execution Profiling of the Generated Code

This example shows you how to generate an execution profiling report for the generated CUDA® code by using the gpucoder.profile function.

The GPU Coder profiler runs a software-in-the-loop (SIL) execution that produces execution-time metrics for the tasks and kernels in the generated code. This example generates an execution profiling report for the Fog Rectification example from GPU Coder. For more information, see Fog Rectification.

Third-Party Prerequisites

  • CUDA enabled NVIDIA® GPU.

  • NVIDIA CUDA toolkit and driver.

  • NVIDIA Nsight™ Systems. For information on the supported versions of the compilers and libraries, see Third-Party Hardware.

  • Environment variables for the compilers and libraries. For setting up the environment variables, see Setting Up the Prerequisite Products.

  • The profiling workflow of this example depends on the profiling tools from NVIDIA that accesses GPU performance counters. From CUDA toolkit v10.1, NVIDIA restricts access to performance counters to only admin users. To enable GPU performance counters to be used by all users, see the instructions provided in Permission issue with Performance Counters (NVIDIA).

Verify GPU Environment

To verify that the compilers and libraries necessary for running this example are set up correctly, use the coder.checkGpuInstall function.

envCfg = coder.gpuEnvConfig('host');
envCfg.BasicCodegen = 1;
envCfg.Quiet = 1;
coder.checkGpuInstall(envCfg);

Fog Rectification Algorithm

To improve the foggy input image, the algorithm performs fog removal and then contrast enhancement. The diagram shows the steps of both these operations.

This example takes a foggy RGB image as input. To perform fog removal, the algorithm estimates the dark channel of the image, calculates the airlight map based on the dark channel, and refines the airlight map by using filters. The restoration stage creates a defogged image by subtracting the refined airlight map from the input image.

Then, the Contrast Enhancement stage assesses the range of intensity values in the image and uses contrast stretching to expand the range of values and make features stand out more clearly.

FogRectificationAlgorithm.png

type fog_rectification.m
function [out] = fog_rectification(input) %#codegen

%   Copyright 2017-2019 The MathWorks, Inc.

coder.gpu.kernelfun;

% restoreOut is used to store the output of restoration
restoreOut = zeros(size(input),'double');

% Changing the precision level of input image to double
input = double(input)./255;

%% Dark channel Estimation from input
darkChannel = min(input,[],3);

% diff_im is used as input and output variable for anisotropic diffusion
diff_im = 0.9*darkChannel;
num_iter = 3;

% 2D convolution mask for Anisotropic diffusion
hN = [0.0625 0.1250 0.0625; 0.1250 0.2500 0.1250; 0.0625 0.1250 0.0625];
hN = double(hN);

%% Refine dark channel using Anisotropic diffusion.
for t = 1:num_iter
    diff_im = conv2(diff_im,hN,'same');
end

%% Reduction with min
diff_im = min(darkChannel,diff_im);

diff_im = 0.6*diff_im ;

%% Parallel element-wise math to compute
%  Restoration with inverse Koschmieder's law
factor = 1.0./(1.0-(diff_im));
restoreOut(:,:,1) = (input(:,:,1)-diff_im).*factor;
restoreOut(:,:,2) = (input(:,:,2)-diff_im).*factor;
restoreOut(:,:,3) = (input(:,:,3)-diff_im).*factor;
restoreOut = uint8(255.*restoreOut);
restoreOut = uint8(restoreOut);

%%
% Stretching performs the histogram stretching of the image.
% im is the input color image and p is cdf limit.
% out is the contrast stretched image and cdf is the cumulative prob.
% density function and T is the stretching function.

p = 5;
% RGB to grayscale conversion
im_gray = im2gray(restoreOut);
[row,col] = size(im_gray);

% histogram calculation
[count,~] = imhist(im_gray);
prob = count'/(row*col);

% cumulative Sum calculation
cdf = cumsum(prob(:));

% finding less than particular probability
i1 = length(find(cdf <= (p/100)));
i2 = 255-length(find(cdf >= 1-(p/100)));

o1 = floor(255*.10);
o2 = floor(255*.90);

t1 = (o1/i1)*[0:i1];
t2 = (((o2-o1)/(i2-i1))*[i1+1:i2])-(((o2-o1)/(i2-i1))*i1)+o1;
t3 = (((255-o2)/(255-i2))*[i2+1:255])-(((255-o2)/(255-i2))*i2)+o2;

T = (floor([t1 t2 t3]));

restoreOut(restoreOut == 0) = 1;

u1 = (restoreOut(:,:,1));
u2 = (restoreOut(:,:,2));
u3 = (restoreOut(:,:,3));

% Replacing the value from look up table
out1 = T(u1);
out2 = T(u2);
out3 = T(u3);

out = zeros([size(out1),3], 'uint8');
out(:,:,1) = uint8(out1);
out(:,:,2) = uint8(out2);
out(:,:,3) = uint8(out3);
return

Generate Execution Profiling Report

To generate an execution profiling report, create a code configuration object with a dynamic library ('dll') build type. Because the gpucoder.profile function accepts only an Embedded Coder™ configuration object, enable the option to create a coder.EmbeddedCodeConfig configuration object.

cfg = coder.gpuConfig('dll','ecoder',true);
cfg.GpuConfig.MallocMode = 'discrete';

Run gpucoder.profile with the default threshold value of zero seconds. If the generated code has a lot of CUDA API or kernel calls, it is likely that each call constitutes only a small proportion of the total time. In such cases, set a low (non-zero) threshold value to generate a meaningful profiling report. It is not advisable to set number of executions value to a very low number (less than 5) because it does not produce an accurate representation of a typical execution profile.

inputImage = imread('foggyInput.png');
inputs  = {inputImage};
designFileName = 'fog_rectification';

gpucoder.profile(designFileName, inputs, ...
    'CodegenConfig', cfg, 'Threshold', 0, 'NumCalls', 10);
Code generation successful: View report

### Starting SIL execution for 'fog_rectification'
    To terminate execution: clear fog_rectification_sil
    Execution profiling data is available for viewing. Open Simulation Data Inspector.
    Execution profiling report available after termination.
 
### Host application produced the following standard error (stderr) messages:
Warning: LBR backtrace method is not supported on this platform. DWARF backtrace method will be used.
Collecting data...

 
### Stopping SIL execution for 'fog_rectification'

Code Execution Profiling Report for the fog_rectification Function

The code execution profiling report provides metrics based on data collected from a SIL execution. Execution times are calculated from data recorded by instrumentation probes added to the SIL or PIL test harness or inside the code generated for each component. For more information, see View Execution Times (Embedded Coder).

These numbers are representative. The actual values depend on your hardware setup. This profiling was done using MATLAB R2022b on a machine with an 6 core, 3.5GHz Intel® Xeon® CPU, and an NVIDIA TITAN XP GPU

Summary

This section gives information about the creation of the report.

Profiled Sections of Code

This section contains information about profiled code sections. The report contains time measurements for:

  • The entry_point_fn_initialize function, for example, fog_rectification_initialize.

  • The entry-point function, for example, fog_rectification.

  • The entry_point_fn_terminate function, for example, fog_rectification_terminate.

  • The section column lists the names of the function from which code is generated.

  • Maximum execution time is the longest time between start and end of code section.

  • Average Execution Time is the average time between start and end of code section.

  • Maximum Self Time is the maximum execution time, excluding time in child sections.

  • Average Self Time is the average execution time, excluding time in child sections.

  • Calls indicate the number of calls to the code section.

  • To view execution-time metrics for a code section in the Command Window, on the corresponding row, click the icon icon_view_code_sect_obj.png.

  • To display measured execution times, click the Simulation Data Inspector icon icon_simulation_data_inspectora5cc10a7e5374a15280c4ca6011f26f9.png. You can use the Simulation Data Inspector to manage and compare plots from various executions.

  • To display the execution-time distribution, click the icon code_exec_profiling_report_icon_frequency_distribution.png.

By default, the report displays time in milliseconds (10-3 seconds). You can specify the time unit and numeric display format. For example, to display time in microseconds (10-6 seconds), use the report (Embedded Coder) command:

executionProfile=getCoderExecutionProfile('fog_rectification');
report(executionProfile, ...
    'Units', 'Seconds', ...
    'ScaleFactor', '1e-06', ...
    'NumericFormat', '%0.3f')
ans = 
'/local-ssd/lnarasim/MATLAB/ExampleManager/lnarasim.Bdoc22b.j1984243/gpucoder-ex87489778/codegen/dll/fog_rectification/html/orphaned/ExecutionProfiling_f31bfb52dfefde93.html'

The report displays time in seconds only if the timer is calibrated, that is, the number of timer ticks per second is known. On a Windows® machine, the software determines this value for a SIL simulation. On a Linux® machine, you must manually calibrate the timer. For example, if your processor speed is 3.5 GHz, specify the number of timer ticks per second:

executionProfile.TimerTicksPerSecond = 3.5e9;

Execution Times in Percentages

This section provides function execution times as percentages of caller function and total execution times, which can help you to identify performance bottlenecks in generated code.

GPU Profiling Trace for fog_rectification

Section 4 shows the complete trace of GPU calls that have a runtime higher than the threshold value. A snippet of the profiling trace is shown.

GPU Profiling Summary for fog_rectification

Section 5 in the report shows the summary of GPU calls that are shown in section 4. The cudaFree is called 15 times per run of fog_rectification and the average time taken by 15 calls of cudaFree over 9 runs of fog_rectification is 1.3790 milliseconds. This summary is sorted in descending order of time taken to give the users an idea which GPU call is taking the maximum time.

Definitions

This section provides descriptions of some metrics.