GPUCoder does not generate parallelized code
1 view (last 30 days)
I am currently working on optimizing a program that otherwise runs on the CPU. For this, I am using the GPU coder to run my CPU code on the GPU. However, this does not provide a significant speedup.
Now I tried to build a function as simple as possible with the GPU coder.
function [out] = simple_function(vector)
out = sqrt(sqrt(sqrt(vector))));
I then call this function with very large input vectors, which definitely requires computation. However, when I analyze this function with gpucoder.profile(...) and then display the result of the profiling in the NVIDIA Visual Profiler, it indicates that the code is not well optimized. In particular, it shows that 0% of the time parallelized computations are being performed.
Even though this is a very easy function to parallelize. Is there any way to set the GPU coder to parallelize more?
Joss Knight on 29 Apr 2022
This looks about right to me, because your kernel is too simple and you're transferring data from and to the CPU on every call. Try recompiling with gpuArray input and output (if you have PCT) to remove the data transfer bit, or else write some code that will require the GPU to launch multiple kernels. Do some reductions perhaps?
sz = size(x);
for i = 1:100
y = sum(sqrt(sqrt(sqrt(abs(x)))),"all");
x = y*randn(sz,"like",x);