Measure and Improve GPU Performance
Measure GPU Performance
Measure Code Performance on a GPU
An important measure of the performance of your code is how long it takes to run. The best way to time code running on a GPU is to use the
gputimeit function which runs a function multiple times to average out variation and compensate for overhead. The
gputimeit function also ensures that all operations on the GPU are complete before recording the time.
For example, measure the time that the
lu function takes to compute the LU factorization of a random matrix
A of size
N. To perform this measurement, create a function handle to the
lu function and pass the function handle to
N = 1000;
A = rand(N,"gpuArray");
f = @() lu(A);
numOutputs = 2;
You can also time your code using
toc. However, to get accurate timing information for code running on a GPU, you must wait for operations to complete before calling
toc. To do this, you can use the
wait function with a
gpuDevice object as its input. For example, measure the time taken to compute the LU factorization of matrix
D = gpuDevice; wait(D) tic [L,U] = lu(A); wait(D) toc
You can view how long each part of your code takes using the MATLAB® Profiler. For more information about profiling your code, see
profile and Profile Your Code to Improve Performance. The Profiler is useful for identifying performance bottlenecks in your code but cannot accurately time GPU code as it does not account for overlapping execution, which is common when you use a GPU.
Use this table to help you decide which timing method to use.
|Timing individual functions
|Timing multiple lines of code or entire workflows
|Finding performance bottlenecks
The Profiler runs each line of code independently and does not account for overlapping execution, which is common when you use a GPU. You cannot use the Profiler as a way to accurately time GPU code.
Benchmark tests are useful for identifying the strengths and weaknesses of a GPU and for comparing the performance of different GPUs. Measure the performance of your GPU by using these benchmark tests:
Run the Measure GPU Performance example to obtain detailed information about your GPU, including PCI bus speed, GPU memory read/write, and peak calculation performance for double-precision matrix calculations.
gpuBenchto test memory- and computation-intensive tasks in single and double precision.
gpuBenchcan be downloaded from the Add-On Explorer or from the MATLAB Central File Exchange. For more information, see https://www.mathworks.com/matlabcentral/fileexchange/34080-gpubench.
Improve GPU Performance
The purpose of GPU computing in MATLAB is to speed up your code. You can achieve better performance on the GPU by implementing best practices for writing code and configuring your GPU hardware. Various methods to improve performance are discussed below, starting with the most straightforward to implement.
Use this table to help you decide which methods to use.
|Performance Improvement Method
|When Should I Use This Method?
Use GPU Arrays – pass GPU arrays to supported functions to run your code on the GPU
Your functions must support
Profile and Improve Your MATLAB Code – profile your code to identify bottlenecks
The profiler cannot be used to accurately time code running on the GPU as described in the Measure Code Performance on a GPU section.
Vectorize Calculations – replace for-loops with matrix and vector operations
|When running code that operates on vectors or matrices inside a for-loop
For more information, see Using Vectorization.
Perform Calculations in Single Precision – reduce computation by using lower precision data
|When smaller ranges of values and lower accuracy are acceptable
Some types of calculation, such as linear algebra problems, might require double-precision processing.
For information about supported functions and additional limitations, see
|When using a function that performs independent matrix operations on a large number of small matrices
Not all built-in MATLAB functions are supported. For information about supported functions and additional limitations, see
Write MEX File Containing CUDA Code – access additional libraries of GPU functions
|When you want access to NVIDIA® libraries or advanced CUDA features
|Requires code written using the CUDA C++ framework.
Configure Your Hardware for GPU Performance – make the best use of your hardware
Use GPU Arrays
If all the functions that your code uses are supported on the GPU, the only necessary modification is to transfer the input data to the GPU by calling
gpuArray. For a list of MATLAB functions that support
gpuArray input, see Run MATLAB Functions on a GPU.
gpuArray object stores data in GPU memory. Because most numeric functions in MATLAB and in many other toolboxes support
gpuArray objects, you can usually run your code on a GPU by making minimal changes. These functions take
gpuArray inputs, perform calculations on the GPU, and return
gpuArray outputs. In general, these functions support the same arguments and data types as standard MATLAB functions that run on the CPU.
To reduce overhead, limit the number of times you transfer data between the host memory and the GPU. Create arrays directly on the GPU where possible. For more information see, Create GPU Arrays Directly. Similarly, only transfer data from the GPU back to the host memory using
gather if the data needs to be displayed, saved, or used in code that does not support
Profile and Improve Your MATLAB Code
When converting MATLAB code to run on a GPU, it is best to start with MATLAB code that already performs well. Many of the guidelines for writing code that runs well on a CPU will also improve the performance of code that runs on a GPU. You can profile your CPU code using the MATLAB Profiler. The lines of code that take the most time on the CPU will likely be ones that you should improve or consider moving onto the GPU using
gpuArray objects. For more information about profiling your code, see Profile Your Code to Improve Performance.
Because the MATLAB Profiler runs each line of code independently, it does not account for overlapping execution, which is common when you use a GPU. To time whole algorithms use
gputimeit as described in the Measure Code Performance on a GPU section.
Vector, matrix, and higher-dimensional operations typically perform much better than scalar operations on a GPU because GPUs achieve high performance by calculating many results in parallel. You can achieve better performance by rewriting loops to make use of higher-dimensional operations. The process of revising loop-based, scalar-oriented code to use MATLAB matrix and vector operations is called vectorization. For information on vectorization, see Using Vectorization and Improve Performance Using a GPU and Vectorized Calculations. This plot from the Improve Performance Using a GPU and Vectorized Calculations example shows the increase in performance achieved by vectorizing a function executing on the CPU and on the GPU.
Perform Calculations in Single Precision
You can improve the performance of code running on your GPU by calculating in single precision instead of double precision. CPU computations do not provide this improvement when switching from double to single precision because most GPU cards are designed for graphic display, which demands a high single-precision performance. For more information on converting data to single precision and performing arithmetic operations on single-precision data, see Floating-Point Numbers.
Typical examples of workflows suitable for single-precision computation on the GPU include image processing and machine learning. However, other types of calculation, such as linear algebra problems, typically require double-precision processing. The Deep Learning Toolbox™ performs many operations in single precision by default. For more information, see Deep Learning Precision (Deep Learning Toolbox).
The exact performance improvement depends on the GPU card and total number of cores. High-end compute cards typically show a smaller improvement. For a comprehensive performance overview of NVIDIA GPU cards, including single- and double-precision processing power, see https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units.
Improve Performance of Element-Wise Functions
If you have an element-wise function, you can often improve its performance by calling it with
arrayfun function on the GPU turns an element-wise MATLAB function into a custom CUDA kernel, which reduces the overhead of performing the operation. You can often use
arrayfun with a subset of your code even if
arrayfun does not support your entire code. The performance of a wide variety of element-wise functions can be improved using
arrayfun, including functions performing many element-wise operations within looping or branching code, and nested functions where the nested function accesses variables declared in its parent function.
The Improve Performance of Element-Wise MATLAB Functions on the GPU Using arrayfun example shows a basic application of
arrayfun. The Using GPU arrayfun for Monte-Carlo Simulations example shows
arrayfun used to improve the performance of a function executing element-wise operations within a loop. The Stencil Operations on a GPU example shows
arrayfun used to call a nested function that accesses variables declared in a parent function.
Improve Performance of Operations on Small Matrices
If you have a function that performs independent matrix operations on a large number of small matrices, you can improve its performance by calling it with
pagefun. You can use
pagefun to perform matrix operations in parallel on the GPU instead of looping over the matrices. The Improve Performance of Small Matrix Problems on the GPU Using pagefun example shows how to improve performance using
pagefun when operating on many small matrices.
Write MEX File Containing CUDA Code
While MATLAB provides an extensive library of GPU-enabled functions, you can access libraries of additional functions that do not have analogs in MATLAB. Examples include NVIDIA libraries such as the NVIDIA Performance Primitives (NPP) and cuRAND libraries. You can compile MEX files that you write in the CUDA C++ framework using the
mexcuda function. You can execute the compiled MEX files in MATLAB and call functions from NVIDIA libraries. For an example that shows how to write and run MEX functions that take
gpuArray input and return
gpuArray output, see Run MEX Functions Containing CUDA Code.
Configure Your Hardware for GPU Performance
Because many computations require large quantities of memory and most systems use the GPU constantly for graphics, using the same GPU for computations and graphics is usually impractical.
On Windows® systems, a GPU device has two operating models: Windows Display Driver Model (WDDM) or Tesla Compute Cluster (TCC). To attain the best performance for your code, set the devices that you use for computing to use the TCC model. To see which model your GPU device is using, inspect the
DriverModel property returned by the
gpuDevice function. For more information about switching models and which GPU devices support the TCC model, consult the NVIDIA documentation.
To reduce the likelihood of running out of memory on the GPU, do not use one GPU on multiple instances of MATLAB. To see which GPU devices are available and selected, use the