Batched matrix multiplicaion with CUDA
4 views (last 30 days)
Show older comments
Hi,
I saw that Matlab R2020a implements new features for the GPU coder, especially the gpucoder.stridedMatrixMultiply. However, I don't understand how the batch is defined there. If you take a look at the generated CUDA code that is shown in the example, it states 1 for the batch size (cf. NVIDIA documentation). Also the variables A,B & C are expected to be 2D and of the dimensionality of the matrices to be processes.
How do I use the function correctly? I have a 3D vector in Matlab which holdes many small matrices, so A(:,:,1), A(:,:,2) and so on. The same applies for B. I would like to process them all at the same time using CUDA. I would like to calculate A(:,:,1)*B(:,:,1) etc using a CUDA function. How can I achieve that with the new GPU coder functionality? How do I interface that from Matlab?
Peter
0 Comments
Answers (1)
Erik Meade
on 5 May 2020
Edited: Erik Meade
on 5 May 2020
Hi Peter,
gpucoder.stridedMatrixMultiply works exactly as you want. You can directly pass A and B to gpucoder.stridedMatrixMultiply and it will compute them in the way you want.
A small example, say you have a function called stridedMultiply:
function c = stridedMultiply(a, b)
c = gpucoder.stridedMatrixMultiply(a, b);
end
Then we can generate code for it and verify that the answer is correct with the following code:
% 3D-vector inputs
a = rand(5,4,100);
b = rand(4,5,100);
% Generate Code
codegen -config coder.gpuConfig('mex') -args {a, b} stridedMultiply
% Verify correctness
c_mex = stridedMultiply_mex(a, b);
c = zeros(size(c_mex));
for i = 1:100
c(:,:,i) = a(:,:,i) * b(:,:,i);
end
% Check MATLAB answer vs. stridedMatrixMultiply generated code
tolerance = 1e-8;
assert(all(abs(c(:) - c_mex(:)) < tolerance));
If we look at the generated code, we will see that the batch size has been properly set to 100:
cublasDgemmStridedBatched(getCublasGlobalHandle(), CUBLAS_OP_N, CUBLAS_OP_N, 5,
5, 4, (double *)gpu_alpha1, (double *)&(*gpu_a)[0], 5, 20, (double *)
&(*gpu_b)[0], 4, 20, (double *)gpu_beta1, (double *)&(*gpu_c)[0], 5, 25, 100);
With regards to the example in the doc page you cited, since the input matrices in the example are both 2D, there is only 1 batch to be computed, therefore the parameter is set to 1. I understand your confusion however, since gpucoder.stridedMatrixMultiply is mostly intended to be used with 3D inputs. To clarify, gpucoder.stridedMatrixMultiply multiplies along the first two dimensions only. I understand how that example can be confusing however, and we will look into updating that example.
I hope that answers your question!
0 Comments
See Also
Categories
Find more on Get Started with GPU Coder in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!