Measure and Improve GPU Performance
Getting Started with GPU Benchmarking
You can use various benchmark tests in MATLAB® to measure the performance of your GPU:
gpuBenchin MATLAB Central File Exchange to do various tests, including both memory and compute intensive tasks in both single and double precision. Compare the performance of a display card with a compute card. For more information, see https://www.mathworks.com/matlabcentral/fileexchange/34080-gpubench.
paralleldemo_gpu_benchscript in Measuring GPU Performance to obtain information on your PCI bus speed, GPU memory read/write and peak calculation performances for double precision matrix calculations.
Improve Performance Using Single Precision Calculations
You can improve the performance of your GPU by doing your calculations in single precision instead of double precision. In CPU computations, on the other hand, you do not get this improvement when switching from double to single precision. The reason is that most GPU cards are designed for graphic display, demanding high single precision performance.
Typical examples of calculations suitable for single-precision computation on the GPU include image processing and machine learning, see e.g. https://www.mathworks.com/content/dam/mathworks/tag-team/Objects/d/Deep_Learning_in_Cloud_Whitepaper.pdf. However, other types of calculations, such as linear algebra problems, typically require double precision processing.
You can get a performance improvement of up to a factor of 50 for single compared
to double precision calculations, depending on the GPU card and total number of
cores. High end compute cards typically show a smaller improvement. You can
determine the performance improvement of your particular GPU by using
gpuBench, see https://www.mathworks.com/matlabcentral/fileexchange/34080-gpubench.
For a comprehensive performance overview of NVIDIA® GPU cards, see https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units. You can calculate the performance improvement factor between single precision and double precision as follows:
Find the GPU on the wiki page above.
Get the stated single and double precision performance values from the table. If there is no double precision GFLOPS value, assume the ratio is 24‐32x slower for double precision.
Divide the stated single precision GFLOPS value by the double precision GFLOPS value.
If you have a mobile graphics card in your laptop, you can use this card for GPU computing. However, the laptop GPU is likely to be much less powerful than the desktop machine equivalent and so performance is reduced.
Basic Workflow for Improving Performance
The purpose of GPU computing in MATLAB is to speed up your applications. This topic discusses fundamental
concepts and practices that can help you achieve better performance on the GPU, such
as the configuration of the GPU hardware and best practices within your code. It
discusses the trade-off between implementation difficulty and performance, and
describes the criteria you might use to choose between using gpuArray functions,
arrayfun, MEX-files, or CUDA kernels. Finally, it describes
how to accurately measure performance on the GPU.
When converting MATLAB code to run on the GPU, it is best to start with MATLAB code that already performs well. While the GPU and CPU have different performance characteristics, the general guidelines for writing good MATLAB code also help you write good MATLAB code for the GPU. The first step is almost always to profile your CPU code. The lines of code that the profiler shows taking the most time on the CPU will likely be ones that you must concentrate on when you code for the GPU.
It is easiest to start converting your code using MATLAB built-in functions that support gpuArray data. These functions take gpuArray inputs, perform calculations on the GPU, and return gpuArray outputs. A list of the MATLAB functions that support gpuArray data is found in Run MATLAB Functions on a GPU. In general, these functions support the same arguments and data types as standard MATLAB functions that are calculated on the CPU.
If all the functions that you want to use are supported on the GPU, running code
on the GPU may be as simple as calling
gpuArray to transfer input
data to the GPU, and calling
gather to retrieve the output data
from the GPU when finished. In many cases, you might need to vectorize your code,
replacing looped scalar operations with MATLAB matrix and vector operations. While
vectorizing is generally a good practice on the CPU, it is usually critical for
achieving high performance on the GPU. For more information, see Vectorize for Improved GPU Performance.
Advanced Tools for Improving Performance
It is possible that even after converting inputs to gpuArrays and vectorizing your
code, there are operations in your algorithm that are either not built-in functions,
or that are not fast enough to meet your application’s requirements. In such
situations you have three main options: use
precompile element-wise parts of your application, make use of GPU library
functions, or write a custom CUDA kernel.
If you have a purely element-wise function, you can improve its performance by
calling it with
function on the GPU turns an element-wise MATLAB function into a custom CUDA kernel,
thus reducing the overhead of performing the operation. Often, there is a subset of
your application that can be used with
arrayfun even if the
entire application cannot be. The example Improve Performance of Element-wise MATLAB® Functions on the GPU using ARRAYFUN shows the basic concepts of
this approach; and the example Using GPU ARRAYFUN for Monte-Carlo Simulations shows how this can be done
in simulations for a finance application.
MATLAB provides an extensive library of GPU-enabled functions in Parallel Computing Toolbox™, Image Processing Toolbox™, Signal Processing Toolbox™, and other products. However, there are many libraries of additional functions that do not have direct built-in analogs in MATLAB’s GPU support. Examples include the NVIDIA Performance Primitives library and the CURAND library, which are included in the CUDA toolkit that ships with MATLAB. If you need to call a function in one of these libraries, you can do so using the GPU MEX interface. This interface allows you to extract the pointers to the device data from MATLAB gpuArrays so that you can pass these pointers to GPU functions. You can convert the returned values into gpuArrays for return to MATLAB. For more information see Run MEX-Functions Containing CUDA Code.
Finally, you have the option of writing a custom CUDA kernel for the operation that you need. Such kernels can be directly integrated into MATLAB using the CUDAKernel object.
The example Illustrating Three Approaches to GPU Computing: The Mandelbrot Set shows how to implement a
simple calculation using three of the approaches mentioned in this section. This
example begins with MATLAB code that is easily converted to run on the GPU, rewrites
the code to use
arrayfun for element-wise operations, and finally
shows how to integrate a custom CUDA kernel for the same operation.
Alternately, you can write a CUDA kernel as part of a MEX-file and call it using the CUDA Runtime API inside the MEX-file. Either of these approaches might let you work with low-level features of the GPU, such as shared memory and texture memory, that are not directly available in MATLAB code. For more details, see the example Accessing Advanced CUDA Features Using MEX.
Best Practices for Improving Performance
In general you can achieve the best performance when your GPU is dedicated to computing. It is usually not practical to use the same GPU device for both computations and graphics, because of the amount of memory taken up for problems of reasonable size and the constant use of the device by the system for graphics. If possible, obtain a separate device for graphics. Details of configuring your device for compute or graphics depend on the operating system and driver version.
On Windows® systems, a GPU device can be in one of two modes: Windows Display Driver Model (WDDM) or Tesla Compute Cluster (TCC) mode. For best performance, any devices used for computing should be in TCC mode. Consult NVIDIA documentation for more details.
NVIDIA’s highest-performance compute devices, the Tesla line, support error correcting codes (ECC) when reading and writing GPU memory. The purpose of ECC is to correct for occasional bit-errors that occur normally when reading or writing dynamic memory. One technique to improve performance is to turn off ECC to increase the achievable memory bandwidth. While the hardware can be configured this way, MathWorks does not recommend this practice. The potential loss of accuracy due to silent errors can be more harmful than the performance benefit.
MATLAB Coding Practices
This topic describes general techniques that help you achieve better performance on the GPU. Some of these tips apply when writing MATLAB code for the CPU as well.
Data in MATLAB arrays is stored in column-major order. Therefore, it is beneficial to operate along the first or column dimension of your array. If one dimension of your data is significantly longer than others, you might achieve better performance if you make that the first dimension. Similarly, if you frequently operate along a particular dimension, it is usually best to have it as the first dimension. In some cases, if consecutive operations target different dimensions of an array, it might be beneficial to transpose or permute the array between these operations.
GPUs achieve high performance by calculating many results in parallel. Thus, matrix and higher-dimensional array operations typically perform much better than operations on vectors or scalars. You can achieve better performance by rewriting your loops to make use of higher-dimensional operations. The process of revising loop-based, scalar-oriented code to use MATLAB matrix and vector operations is called vectorization. For more details, see Using Vectorization.
By default, all operations in MATLAB are performed in double-precision floating-point arithmetic. However, most operations support a variety of data types, including integer and single-precision floating-point. Today’s GPUs and CPUs typically have much higher throughput when performing single-precision operations, and single-precision floating-point data occupies less memory. If your application’s accuracy requirements allow the use of single-precision floating-point, it can greatly improve the performance of your MATLAB code.
The GPU sits at the end of a data transfer mechanism known as the PCI bus.
While this bus is an efficient, high-bandwidth way to transfer data from the PC
host memory to various extension cards, it is still much slower than the overall
bandwidth to the global memory of the GPU device or of the CPU (for more
details, see the example Measuring GPU Performance). In addition, transfers
from the GPU device to MATLAB host memory cause MATLAB to wait for all pending
operations on the device to complete before executing any other statements. This
can significantly hurt the performance of your application. In general, you
should limit the number of times you transfer data between the MATLAB workspace
and the GPU. If you can transfer data to the GPU once at the start of your
application, perform all the calculations you can on the GPU, and then transfer
the results back into MATLAB at the end, that generally results in the best
performance. Similarly, when possible it helps to create arrays directly on the
GPU, using either the
'gpuArray' or the
'like' option for functions such as
Z = zeros(N,'like',g)
for existing gpuArray
Measure Performance on the GPU
The best way to measure performance on the GPU is to use
gputimeit. This function takes as
input a function handle with no input arguments, and returns the measured execution
time of that function. It takes care of such benchmarking considerations as
repeating the timed operation to get better resolution, executing the function
before measurement to avoid initialization overhead, and subtracting out the
overhead of the timing function. Also,
gputimeit ensures that all
operations on the GPU have completed before the final timing.
For example, consider measuring the time taken to compute the
lu factorization of a random matrix
N. You can do this by defining a
function that does the
lu factorization and passing the function
A = rand(N,'gpuArray'); fh = @() lu(A); gputimeit(fh,2); % 2nd arg indicates number of outputs
You can also measure performance with
toc. However, to get accurate
timing on the GPU, you must wait for operations to complete before calling
toc. There are two ways to do this. You can call
gather on the final GPU output
toc: this forces all computations to complete
before the time measurement is taken. Alternately, you can use the
wait function with a
gpuDevice object as its input. For
example, if you wanted to measure the time taken to compute the
lu factorization of matrix
wait, you can do it as follows:
gd = gpuDevice(); tic(); [l,u] = lu(A); wait(gd); tLU = toc();
You can also use the MATLAB profiler to show how computation time is distributed
in your GPU code. Note, that to accomplish timing measurements, the profiler runs
each line of code independently, so it cannot account for overlapping (asynchronous)
execution such as might occur during normal operation. For timing whole algorithms,
you should use
gputimeit, as described above. Also, the profile might not
yield correct results for user-defined MEX functions if they run
Vectorize for Improved GPU Performance
This example shows you how to improve performance by running a function on the GPU instead of the CPU, and by vectorizing the calculations.
Consider a function that performs fast convolution on the columns of a matrix. Fast convolution, which is a common operation in signal processing applications, transforms each column of data from the time domain to the frequency domain, multiplies it by the transform of a filter vector, transforms back to the time domain, and stores the result in an output matrix.
function y = fastConvolution(data,filter) [m,n] = size(data); % Zero-pad filter to the column length of data, and transform filter_f = fft(filter,m); % Create an array of zeros of the same size and class as data y = zeros(m,n,'like',data); % Transform each column of data for ix = 1:n af = fft(data(:,ix)); y(:,ix) = ifft(af .* filter_f); end end
Execute this function in the CPU on data of a particular size, and measure the
execution time using the MATLAB
timeit function. The
timeit function takes care
of common benchmarking considerations, such as accounting for startup and
a = complex(randn(4096,100),randn(4096,100)); % Data input b = randn(16,1); % Filter input c = fastConvolution(a,b); % Calculate output ctime = timeit(@()fastConvolution(a,b)); % Measure CPU time disp(['Execution time on CPU = ',num2str(ctime)]);
On a sample machine, this code displays the output:
Execution time on CPU = 0.019335
Now execute this function on the GPU. You can do this easily by changing the input
data to be gpuArrays rather than normal MATLAB arrays. The
syntax used when creating the output inside the function ensures that
y will be a gpuArray if
data is a
ga = gpuArray(a); % Move array to GPU gb = gpuArray(b); % Move filter to GPU gc = fastConvolution(ga,gb); % Calculate on GPU gtime = gputimeit(@()fastConvolution(ga,gb)); % Measure GPU time gerr = max(max(abs(gather(gc)-c))); % Calculate error disp(['Execution time on GPU = ',num2str(gtime)]); disp(['Maximum absolute error = ',num2str(gerr)]);
On the same machine, this code displays the output:
Execution time on CPU = 0.019335 Execution time on GPU = 0.027235 Maximum absolute error = 1.1374e-14
Unfortunately, the GPU is slower than the CPU for this problem. The reason is that
for-loop is executing the FFT, multiplication, and inverse
FFT operations on individual columns of length 4096. The best way to increase the
performance is to vectorize the code, so that a single MATLAB function call performs
more calculation. The FFT and IFFT operations are easy to vectorize:
fft(A) computes the FFT of each column of a matrix
A. You can perform a multiply of the filter with every column
in a matrix at once using the MATLAB binary scalar expansion function
bsxfun. The vectorized function looks like this:
function y = fastConvolution_v2(data,filter) m = size(data,1); % Zero-pad filter to the length of data, and transform filter_f = fft(filter,m); % Transform each column of the input af = fft(data); % Multiply each column by filter and compute inverse transform y = ifft(bsxfun(@times,af,filter_f)); end
Perform the same experiment using the vectorized function:
a = complex(randn(4096,100),randn(4096,100)); % Data input b = randn(16,1); % Filter input c = fastConvolution_v2(a,b); % Calculate output ctime = timeit(@()fastConvolution_v2(a,b)); % Measure CPU time disp(['Execution time on CPU = ',num2str(ctime)]); ga = gpuArray(a); % Move data to GPU gb = gpuArray(b); % Move filter to GPU gc = fastConvolution_v2(ga, gb); % Calculate on GPU gtime = gputimeit(@()fastConvolution_v2(ga,gb));% Measure GPU time gerr = max(max(abs(gather(gc)-c))); % Calculate error disp(['Execution time on GPU = ',num2str(gtime)]); disp(['Maximum absolute error = ',num2str(gerr)]);
Execution time on CPU = 0.010393 Execution time on GPU = 0.0020537 Maximum absolute error = 1.1374e-14
In conclusion, vectorizing the code helps both the CPU and GPU versions to run faster. However, vectorization helps the GPU version much more than the CPU. The improved CPU version is nearly twice as fast as the original; the improved GPU version is 13 times faster than the original. The GPU code went from being 40% slower than the CPU in the original version, to about five times faster in the revised version.
If you only have one GPU in your machine, then it is likely that your graphics card is also acting as your display card. In this case, your GPU is probably subject to timeout imposed by the operating system (OS). You can examine this for your GPU as follows:
KernelExecutionTimeout = 1, then your GPU is subject to timeout imposed by the OS, ensuring that the OS is always able to print updates to the screen. If your GPU calculation takes too much time, then the operation is killed. In this case, you must restart MATLAB to resume GPU calculations successfully.