Measure and Improve GPU Performance
Measure GPU Performance
Measure Code Performance on a GPU
An important measure of the performance of your code is how long it takes to
run. The best way to time code running on a GPU is to use the
gputimeit function which runs
a function multiple times to average out variation and compensate for overhead.
gputimeit function also ensures that all operations on
the GPU are complete before recording the time.
For example, measure the time that the
lu function takes to compute
the LU factorization of a random matrix
A of size
N. To perform this measurement,
create a function handle to the
lu function and pass the
function handle to
N = 1000; A = rand(N,"gpuArray"); f = @() lu(A); numOutputs = 2; gputimeit(f,numOutputs)
You can also time your code using
toc. However, to get accurate
timing information for code running on a GPU, you must wait for operations to
complete before calling
do this, you can use the
wait function with a
gpuDevice object as its input.
For example, measure the time taken to compute the LU factorization of matrix
D = gpuDevice; wait(D) tic [L,U] = lu(A); wait(D) toc
You can view how long each part of your code takes using the MATLAB® Profiler. For more information about profiling your code, see
profile and Profile Your Code to Improve Performance. The Profiler is useful for identifying
performance bottlenecks in your code but cannot accurately time GPU code as it
does not account for overlapping execution, which is common when you use a
Use this table to help you decide which timing method to use.
|Timing Method||Suitable Tasks||Limitations|
|Timing individual functions||
|Timing multiple lines of code or entire workflows||
|MATLAB Profiler||Finding performance bottlenecks|
The Profiler runs each line of code independently and does not account for overlapping execution, which is common when you use a GPU. You cannot use the Profiler as a way to accurately time GPU code.
Benchmark tests are useful for identifying the strengths and weaknesses of a GPU and for comparing the performance of different GPUs. Measure the performance of your GPU by using these benchmark tests:
Run the Measure GPU Performance example to obtain detailed information about your GPU, including PCI bus speed, GPU memory read/write, and peak calculation performance for double-precision matrix calculations.
gpuBenchto test memory- and computation-intensive tasks in single and double precision.
gpuBenchcan be downloaded from the Add-On Explorer or from the MATLAB Central File Exchange. For more information, see https://www.mathworks.com/matlabcentral/fileexchange/34080-gpubench.
Improve GPU Performance
The purpose of GPU computing in MATLAB is to speed up your code. You can achieve better performance on the GPU by implementing best practices for writing code and configuring your GPU hardware. Various methods to improve performance are discussed below, starting with the most straightforward to implement.
Use this table to help you decide which methods to use.
|Performance Improvement Method||When Should I Use This Method?||Limitations|
Use GPU Arrays – pass GPU arrays to supported functions to run your code on the GPU
Your functions must support
Profile and Improve Your MATLAB Code – profile your code to identify bottlenecks
The profiler cannot be used to accurately time code running on the GPU as described in the Measure Code Performance on a GPU section.
Vectorize Calculations – replace for-loops with matrix and vector operations
|When running code that operates on vectors or matrices inside a for-loop|
For more information, see Using Vectorization.
Perform Calculations in Single Precision – reduce computation by using lower precision data
|When smaller ranges of values and lower accuracy are acceptable|
Some types of calculation, such as linear algebra problems, might require double-precision processing.
For information about supported functions and
additional limitations, see
|When using a function that performs independent matrix operations on a large number of small matrices|
Not all built-in MATLAB functions are supported. For information about
supported functions and additional limitations, see
Write MEX File Containing CUDA Code – access additional libraries of GPU functions
|When you want access to NVIDIA® libraries or advanced CUDA features||Requires code written using the CUDA C++ framework.|
Configure Your Hardware for GPU Performance – make the best use of your hardware
Use GPU Arrays
If all the functions that your code uses are supported on the GPU, the only
necessary modification is to transfer the input data to the GPU by calling
gpuArray. For a list of
MATLAB functions that support
gpuArray input, see
Run MATLAB Functions on a GPU.
gpuArray object stores data in GPU memory. Because most
numeric functions in MATLAB and in many other toolboxes support
objects, you can usually run your code on a GPU by making minimal changes. These
gpuArray inputs, perform calculations on the
GPU, and return
gpuArray outputs. In general, these functions
support the same arguments and data types as standard MATLAB functions that run
on the CPU.
To reduce overhead, limit the number of times you transfer data between
the host memory and the GPU. Create arrays directly on the GPU where
possible. For more information see, Create GPU Arrays Directly. Similarly, only transfer data from
the GPU back to the host memory using
gather if the data needs to be displayed,
saved, or used in code that does not support
Profile and Improve Your MATLAB Code
When converting MATLAB code to run on a GPU, it is best to start with MATLAB
code that already performs well. Many of the guidelines for writing code that
runs well on a CPU will also improve the performance of code that runs on a GPU.
You can profile your CPU code using the MATLAB Profiler. The lines of code that take the most time on the CPU
will likely be ones that you should improve or consider moving onto the GPU
gpuArray objects. For more information about profiling
your code, see Profile Your Code to Improve Performance.
Because the MATLAB Profiler runs each line of code independently, it does not account
for overlapping execution, which is common when you use a GPU. To time whole
gputimeit as described in the Measure Code Performance on a GPU
Vector, matrix, and higher-dimensional operations typically perform much better than scalar operations on a GPU because GPUs achieve high performance by calculating many results in parallel. You can achieve better performance by rewriting loops to make use of higher-dimensional operations. The process of revising loop-based, scalar-oriented code to use MATLAB matrix and vector operations is called vectorization. For information on vectorization, see Using Vectorization and Improve Performance Using a GPU and Vectorized Calculations. This plot from the Improve Performance Using a GPU and Vectorized Calculations example shows the increase in performance achieved by vectorizing a function executing on the CPU and on the GPU.
Perform Calculations in Single Precision
You can improve the performance of code running on your GPU by calculating in single precision instead of double precision. CPU computations do not provide this improvement when switching from double to single precision because most GPU cards are designed for graphic display, which demands a high single-precision performance. For more information on converting data to single precision and performing arithmetic operations on single-precision data, see Floating-Point Numbers.
Typical examples of workflows suitable for single-precision computation on the GPU include image processing and machine learning. However, other types of calculation, such as linear algebra problems, typically require double-precision processing. The Deep Learning Toolbox™ performs many operations in single precision by default. For more information, see Deep Learning Precision (Deep Learning Toolbox).
The exact performance improvement depends on the GPU card and total number of cores. High-end compute cards typically show a smaller improvement. For a comprehensive performance overview of NVIDIA GPU cards, including single- and double-precision processing power, see https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units.
Improve Performance of Element-Wise Functions
If you have an element-wise function, you can often improve its performance by
calling it with
arrayfun function on the GPU turns an element-wise MATLAB
function into a custom CUDA kernel, which reduces the overhead of performing the
operation. You can often use
arrayfun with a subset of your
code even if
arrayfun does not support your entire code. The
performance of a wide variety of element-wise functions can be improved using
arrayfun, including functions performing many
element-wise operations within looping or branching code, and nested functions
where the nested function accesses variables declared in its parent
The Improve Performance of Element-Wise MATLAB Functions on the GPU Using arrayfun example shows a basic
arrayfun. The Using GPU arrayfun for Monte-Carlo Simulations example shows
arrayfun used to improve the performance of a function
executing element-wise operations within a loop. The Stencil Operations on a GPU example shows
arrayfun used to call a nested function that accesses
variables declared in a parent function.
Improve Performance of Operations on Small Matrices
If you have a function that performs independent matrix operations on a large
number of small matrices, you can improve its performance by calling it with
pagefun. You can use
pagefun to perform matrix operations in parallel on the
GPU instead of looping over the matrices. The Improve Performance of Small Matrix Problems on the GPU using pagefun example shows how to improve
pagefun when operating on many small
Write MEX File Containing CUDA Code
While MATLAB provides an extensive library of GPU-enabled functions, you can
access libraries of additional functions that do not have analogs in MATLAB. Examples include NVIDIA libraries such as the NVIDIA Performance Primitives (NPP), cuRAND, and cuFFT libraries. You can
compile MEX files that you write in the CUDA C++ framework using the
mexcuda function. You can
execute the compiled MEX files in MATLAB and call functions from NVIDIA libraries. For an example that shows how to write and run MEX
functions that take
gpuArray input and return
gpuArray output, see Run MEX-Functions Containing CUDA Code.
Configure Your Hardware for GPU Performance
Because many computations require large quantities of memory and most systems use the GPU constantly for graphics, using the same GPU for computations and graphics is usually impractical.
On Windows® systems, a GPU device has two modes: Windows Display Driver Model (WDDM) or Tesla Compute Cluster (TCC) mode. To attain the best performance for your code, set the devices that you use for computing to TCC mode. For more information about switching modes and which GPU devices support TCC mode, consult the NVIDIA documentation.
To reduce the likelihood of running out of memory on the GPU, do not use one
GPU on multiple instances of MATLAB. To see which GPU devices are available and selected, use the