Kernel Creation from MATLAB Code

MATLAB code structures and patterns that create CUDA^® GPU kernels

GPU Coder™ generates and executes optimized CUDA kernels for specific algorithm structures and patterns in your MATLAB^® code. The generated code calls optimized NVIDIA^® CUDA libraries, including cuFFT, cuSolver, cuBLAS, cuDNN, and TensorRT. You can integrate the generated code into your project as source code, static libraries, or dynamic libraries, and compile the code for desktops, servers, and GPUs embedded on NVIDIA Jetson, DRIVE, and other platforms. You can also use GPU Coder to incorporate handwritten CUDA code into your algorithms and into the generated code.

Apps

expand all

GPU Coder

GPU Coder	Generate CUDA code from MATLAB code
GPU Environment Check	Verify and set up GPU code generation environment

Functions

expand all

Code Generation

`coder.gpuConfig`	Create GPU code generation configuration
`codegen`	Generate C or C++ code from MATLAB code
`gpucoder`	Open GPU Coder app
`coder.checkGpuInstall`	Verify GPU code generation environment

GPU Kernel Pragmas

`coder.gpu.kernel`	Pragma that maps `for`-loops to GPU kernels
`coder.gpu.kernelfun`	Pragma that maps function to GPU kernels
`coder.gpu.nokernel`	Pragma to disable kernel creation for loops
`coder.ceval`	Call C/C++ function from generated code
`coder.gpu.iterations`	Pragma that provides information to the code generator for making parallelization decisions on variable bound loops

GPU Memory Pragmas

`coder.gpu.constantMemory`	Pragma that maps a variable to the constant memory on GPU
`coder.gpu.persistentMemory`	Pragma to allocate a variable as persistent memory on the GPU
`cudaMemoryManager`	Query memory usage by shared GPU memory manager for MEX functions (Since R2024a)

GPU Atomic Operations

`gpucoder.atomicAdd`	Atomically add value and variable in global or shared memory (Since R2021b)
`gpucoder.atomicAnd`	Atomically perform bit-wise AND between value and variable in global or shared memory (Since R2021b)
`gpucoder.atomicCAS`	Atomically compare and swap value of variable in global or shared memory (Since R2021b)
`gpucoder.atomicDec`	Atomically decrement variable in global or shared memory within upper bound (Since R2021b)
`gpucoder.atomicExch`	Atomically exchange variable in global or shared memory with value (Since R2021b)
`gpucoder.atomicInc`	Atomically increment variable in global or shared memory within upper bound (Since R2021b)
`gpucoder.atomicMax`	Atomically find the maximum between value and variable in global or shared memory (Since R2021b)
`gpucoder.atomicMin`	Atomically find the minimum between value and variable in global or shared memory (Since R2021b)
`gpucoder.atomicOr`	Atomically perform bit-wise OR between value and variable in global or shared memory (Since R2021b)
`gpucoder.atomicSub`	Atomically subtract value from variable in global or shared memory (Since R2021b)
`gpucoder.atomicXor`	Atomically perform bit-wise XOR between value and variable in global or shared memory (Since R2021b)

Programming for Code Generation

`half`	Construct half-precision numeric object
`stencilfun`	Generate CUDA code for stencil functions (Since R2022b)
`selectdata`	Select slices of arrays and generate CUDA code (Since R2025a)
`gpucoder.matrixMatrixKernel`	Optimized GPU implementation of functions containing matrix-matrix operations
`gpucoder.batchedMatrixMultiply`	Optimized GPU implementation of batched matrix multiply operation
`gpucoder.stridedMatrixMultiply`	Optimized GPU implementation of strided and batched matrix multiply operation
`gpucoder.batchedMatrixMultiplyAdd`	Optimized GPU implementation of batched matrix multiply with add operation
`gpucoder.stridedMatrixMultiplyAdd`	Optimized GPU implementation of strided, batched matrix multiply with add operation
`gpucoder.sort`	Optimized GPU implementation of the MATLAB sort function
`gpucoder.ctranspose`	Optimized GPU implementation of the MATLAB transpose function
`gpucoder.transpose`	Optimized GPU implementation of the MATLAB transpose function
`gpucoder.reduce`	Optimized GPU implementation for reduction operations

Code Configuration Settings

expand all

GPU Code

Generate GPU Code	Control GPU code generation
GPU device ID	CUDA device selection
Minimum compute capability	Minimum compute capability for code generation
Custom compute capability	Virtual GPU architecture
Malloc mode	GPU memory allocation
Stack limit	Stack limit per GPU thread
Maximum blocks per kernel	Maximum number of blocks created during a kernel launch
Safe build	Error checking in the generated code
Kernel name prefix	Custom kernel name prefixes
Compiler flags	Pass additional flags to GPU compiler
Enable cuBLAS	Replace math function calls with `cuBLAS` library calls
Enable cuSOLVER	Replace math function calls with `cuSOLVER` library calls
Enable cuFFT	Replace `fft` function calls with `cuFFT` library calls
Enable GPU memory manager	Use GPU memory manager

Objects

expand all

Code configuration

`coder.GpuCodeConfig`	Configuration parameters for CUDA code generation from MATLAB code
`coder.MexCodeConfig`	Configuration parameters for MEX function generation from MATLAB code
`coder.CodeConfig`	Configuration parameters for C/C++ code generation from MATLAB code
`coder.EmbeddedCodeConfig`	Configuration parameters for C/C++ code generation from MATLAB code with Embedded Coder
`coder.gpuEnvConfig`	Configuration object for checking the GPU code generation environment

Topics

Configure GPU Code Generation
Configure the code generator using configuration objects or the GPU Coder app.
Kernels from Element-Wise Loops
Create kernels from MATLAB functions containing scalarized, element-wise math operations.
Generate GPU Kernels for Reduction Operations
Create kernels from MATLAB functions containing reduction operations.
Kernels from Library Calls

Target GPU optimized math libraries such as cuBLAS, cuSOLVER, and cuFFT.
- Generate GPU Code That Uses the NVIDIA cuBLAS Library
- cuSOLVER Example
- FFT Example
Support for GPU Arrays
Generate CUDA code that uses GPU arrays.
Use Dynamically Allocated C++ Arrays in Generated Function Interfaces
Understand and use dynamically allocated arrays from the generated CUDA C++ function interfaces.
Call Custom CUDA Kernels from the Generated Code
Integrate custom CUDA kernels with MATLAB code intended for code generation.
Call Custom CUDA Device Functions from Generated Code
Integrate custom GPU device functions with MATLAB code intended for code generation.
Design Patterns
Create kernels for MATLAB functions containing computational design patterns.
Reduce GPU Memory Allocations By Using GPU Memory Manager
Avoid repetitive memory allocations by creating and reusing memory pools for generated CUDA applications.
What Is Half Precision?
Introduction to the half-precision data type in MATLAB and Simulink^®.
Half Precision Code Generation Support
C/C++ and GPU code generation support for functions that support half-precision inputs.

Featured Examples

Build a Map from Lidar Data Using SLAM on GPU

Perform 3-D simultaneous localization and mapping (SLAM) using generated code on an NVIDIA GPU.

Open Live Script

$Simulate Diffraction Patterns Using CUDA FFT Libraries$

Simulate Diffraction Patterns Using CUDA FFT Libraries

Use GPU Coder™ to leverage the CUDA® Fast Fourier Transform library (cuFFT) to compute two-dimensional FFT on a NVIDIA® GPU. The two-dimensional Fourier transform is used in optics to calculate far-field diffraction patterns. When a monochromatic light source passes through a small aperture, such as in Young's double-slit experiment, you can observe these diffraction patterns. This example also shows how to pass GPU inputs to an entry-point function when generating CUDA MEX, source code, static libraries, dynamic libraries, and executables. By using this functionality, the performance of the generated code is improved by minimizing the number of cudaMemcpy calls in the generated code.

Open Script

QR Decomposition on NVIDIA GPU Using cuSOLVER Libraries

Create a standalone CUDA® executable that leverages the NVIDIA® cuSOLVER library. The example uses a curve fitting application that mimics automatic lane tracking on a road to illustrate:

Open Live Script

Stencil Processing on GPU

Generate CUDA® kernels for stencil type operations by implementing "Game of Life" by John H. Conway.

Open Script

Benchmark Solving a Linear System by Using GPU Coder

Benchmark solving a linear system by generating CUDA® code. Use matrix left division, also known as mldivide or the backslash operator (\), to solve the system of linear equations A*x = b for x (that is, compute x = A\b).

Open Live Script

Generate GPU Code for Fog Rectification Algorithm

Generate a CUDA® MEX function for a fog rectification algorithm. The entry-point function takes a foggy image as input and produces a defogged image.

Open Live Script

Generate GPU Code That Computes Stereo Disparity

Generate a CUDA® MEX function that computes the stereo disparity of two images. You generate code from MATLAB® functions that compute the stereo disparity. The generated CUDA MEX functions compute the stereo disparity using integer or half-precision values.

Open Live Script

Accelerating Radar Signal Processing Using GPU

Compare the performance of a conventional radar signal processing chain implemented in interpreted MATLAB and on a graphical processing unit (GPU).

(Radar Toolbox)

Since R2024a