GPU Memory Allocation and Minimization
Discrete and Managed Modes
GPU Coder™ provides you access to two different memory allocation
malloc) modes available in the CUDA® programming model,
cudaMalloc API is applicable to
the traditionally separate CPU, and GPU global memories.
cudaMallocManaged is applicable to Unified
From a programmer point of view, a traditional computer architecture requires that data be allocated and shared between the CPU and GPU memory spaces. The need for applications to manage data transfers between these two memory spaces adds to increased complexity. Unified memory creates a pool of managed memory, shared between the CPU and the GPU. The managed memory is accessible to both the CPU and the GPU through a single pointer. Unified memory attempts to optimize memory performance by migrating data to the device that needs it, at the same time hiding the migration details from the program. Though unified memory simplifies the programming model, it requires device-sync calls when data written on the GPU is being accessed on the CPU. GPU Coder inserts these synchronization calls. According to NVIDIA®, unified memory can provide significant performance benefits when by using CUDA 8.0, or when targeting embedded hardware like the NVIDIA Tegra®.
To change the memory allocation mode in the GPU Coder app, use the
Malloc Mode drop-down box under
More Settings->GPU Coder. When using the command-line interface,
MallocMode build configuration property and set it to either
GPU Coder analyzes the data dependency between CPU and GPU partitions and performs
optimizations to minimize the number of
cudaMemcpy function calls in the
generated code. The analysis also determines the minimum set of locations where data must be
copied between CPU and GPU by using
For example, the function
foo has sections of code that process data
sequentially on the CPU and in parallel on the GPU.
function [out] = foo(input1,input2) … % CPU work input1 = … input2 = … tmp1 = … tmp2 = … … % GPU work kernel1(gpuInput1, gpuTmp1); kernel2(gpuInput2, gpuTmp1, gpuTmp2); kernel3(gpuTmp1, gpuTmp2, gpuOut); … % CPU work … = out end
An unoptimized CUDA implementation can potentially have multiple
function calls to transfer all inputs
gpuInput1,gpuInput2, and the
gpuTmp1,gpuTmp2 between kernel calls. Because the
gpuTmp1,gpuTmp2 are not used outside the GPU, they
can be stored within the GPU memory resulting in fewer
function calls. These optimizations improve overall performance of the generated code. The
optimized implementation is:
gpuInput1 = input1; gpuInput2 = input2; kernel1<<< >>>(gpuInput1, gpuTmp1); kernel2<<< >>>(gpuInput2, gpuTmp1, gpuTmp2); kernel3<<< >>>(gpuTmp1, gpuTmp2, gpuOut); out = gpuOut;
To eliminate redundant
cudaMemcpy calls, GPU Coder analyzes all uses and definitions of a given variable and uses status flags to
perform minimization. An example of the original code and what the generated code looks like
is shown in this table.
|Original Code||Optimized Generated Code|
A(:) = … … for i = 1:N gB = kernel1(gA); gA = kernel2(gB); if (somecondition) gC = kernel3(gA, gB); end … end … … = C;
A(:) = … A_isDirtyOnCpu = true; … for i = 1:N if (A_isDirtyOnCpu) gA = A; A_isDirtyOnCpu = false; end gB = kernel1(gA); gA = kernel2(gB); if (somecondition) gC = kernel3(gA, gB); C_isDirtyOnGpu = true; end … end … if (C_isDirtyOnGpu) C = gC; C_isDirtyOnGpu = false; end … = C;
_isDirtyOnCpu flag tells the GPU Coder memory optimization about routines where the given variable is declared and
used either on the CPU or on then GPU.