Speed up gain by GPU parallel processing
29 views (last 30 days)
Show older comments
Hi,
I tried to run the following code using both GPU and CPU parallel processing. The amount of speed up I achieved after running this code using GPU parallel processing was significantly lower than what I could have using CPU parallel processing. Does any one know how MATLAB's builtin functions such as "find" can gain an amazing speed up by running on GPUs, but I can't get a comparable speed when I run the code below on GPU?
I appreciate any help you can provide in this direction.
N = 5000000;
Data = rand(N,3);
% V and F are the vertices and faces on a point cloud geomtery,
% respectively. They are connected to each other. We cannot recreate them
% using arbitrary random numbers.
%V = an array of the size (4000000,3);
%F = an array of the size (7000000,3);
SearchWindowSize = 0.02;
Data = gpuArray(Data);
V = gpuArray(V);
F = gpuArray(F);
for i=1:N
C = Data(i,:);
IDsInWindow = find((abs(V(:,2)-C(2))<SearchWindowSize)&(abs(V(:,3)-C(3))<SearchWindowSize)&(V(:,1)>=C(1)));
[a1,b1]=ismember(F(:, 1),IDsInWindow);
[a2,b2]=ismember(F(:, 2),IDsInWindow);
[a3,b3]=ismember(F(:, 3),IDsInWindow);
aT=a1+a2+a3;
f1 = find(aT>0);
F_in = F(f1,:);
if(isempty(F_in)==0)
inter_mat = [];
for j = 1:size(F_in, 1)
F_V = V(F_in(j, :), :);
if((C(2)<=max(F_V(:,2)))&(C(2)>=min(F_V(:,2)))&(C(3)<=max(F_V(:,3)))&(C(3)>=min(F_V(:,3))))
u = ((F_V(2,3) - F_V(3,3))*(C(2) - F_V(3,2)) + (F_V(3,2) - F_V(2,2))*(C(3) - F_V(3,3))) / ((F_V(2,3) - F_V(3,3))*(F_V(1,2) - F_V(3,2)) + (F_V(3,2) - F_V(2,2))*(F_V(1,3) - F_V(3,3)));
v = ((F_V(3,3) - F_V(1,3))*(C(2) - F_V(3,2)) + (F_V(1,2) - F_V(3,2))*(C(3) - F_V(3,3))) / ((F_V(2,3) - F_V(3,3))*(F_V(1,2) - F_V(3,2)) + (F_V(3,2) - F_V(2,2))*(F_V(1,3) - F_V(3,3)));
w = 1 - u - v;
in = u >= 0 && v >= 0 && w >= 0 && u <= 1 && v <= 1 && w <= 1;
if in
inter_mat(j) = 1;
end
end
end
inter_Count = sum(inter_mat);
if mod(inter_Count, 2) == 1
IDs_All(i,1) = 1;
else
IDs_All(i,1) = 0;
end
end
end
2 Comments
Accepted Answer
Walter Roberson
on 22 Aug 2023
You appear to be misunderstanding how a GPU works.
When you declare a GPU variable, it is not the case that MATLAB immediately starts taking an internal record of all of the following instructions without executing them and then upon reaching the end of the loop, analyzes all of the instructions in the loop and decomposes them into GPU primitives and then sends the data and the GPU primitives to the GPU to execute all of those instructions. Declaring a GPU variable does not, in other words, accelerate everything in MATLAB .
Instead, MATLAB keeps flowing on the CPU. When it encounters an expression involving a GPU variable, it more or less adds the operation to a work queue and returns something that is effectively a promise that the results will be available later. The facility that is constructing the work queue is smart enough to be able to combine operations, so if it saw X*5+1 where X is on the GPU, it might first create a work entry for multipying X by 5 on the GPU, but then it would see the + 1 and would extend the work entry to be able to do the X*5+1 on the GPU.
It can continue to combine those work entries (and corresponding promises that results will be available later) until it reaches some internal (undocumented) limit of complexity -- or until the code specifically requests to wait for results or the code specifically requests to pull back results... Or until the code requests to store the result into a portion of an already-defined variable that was not created as a promise.
Once the calculation gets sufficiently complex, or the code specifically asks for the results, or the result has to be stored into a non-GPU array, then the work is dispatched to the GPU to be executed. If the reason for dispatch was that the calculation reached the complexity limit, then while the GPU is executing, MATLAB can keep flowing on the CPU, building up more work entries. Potentially there could be a bunch of different work entries each ready to be dispatched to the GPU, and MATLAB on the CPU can continue executing until the code asked for the result (gather or wait) or the code wants to store the result into a non-GPU output variable. (It would not surprise me if there was also some kind of timeout consideration -- that if the code has not built up more work for the GPU in a particular time frame, then it could be time to ask the GPU to execute what is already queued.)
Excecution on the GPU is not inherently confined by the boundaries of a script or function. If X and Y are GPU arrays then a function that calculated Z=X+Y and returned Z can return the "promise" that the calculation will be available later rather than having to finish the calculation before returning.
Eventually the code asks to gather() or wait() or to store into a non-GPU variable, or to display the output, and the results are waited for and transfered back from the GPU.
Notice in this scheme that only calculations on GPU objects are potentially sped up: calculations not involving GPU objects are still being executed on the CPU. And notice that in this scheme that the data and instructions about what is to be done have to be sent to the GPU, and the results have to be transferred back to the CPU; it might also be necessary to wait for the GPU to finish the calculation (but if the CPU was busy working on other things, the GPU might already have finished by the time the CPU asks for the results, so there is not necessarily a wait for the GPU to finish.)
Because of the time to transfer data to the GPU, and the time to instruct the GPU on what to do, it is not always faster to use the GPU. The CPU might not be as fast as the GPU, but it takes a bunch of CPU instructions to send the data and information about what is to be calculated (and to figure out a good way to calculate the requests efficiently), and sometimes it is just easier / faster for the CPU to do the work itself since it already knows what is to be done.
More Answers (1)
Matt J
on 22 Aug 2023
Edited: Matt J
on 22 Aug 2023
The operations that benefit from GPU acceleration are vectorized matrix operations and commands, e.g.,
There appears to be little if any vectorization in your current code.
2 Comments
Walter Roberson
on 22 Aug 2023
Vectorize.
Unfortunately this Answers facility does not have access to a GPU, and unfortunately I would have to boot one of my systems into an old operating system and old MATLAB version to get GPU access.
GPU is not always faster: transfering data back and forth with the CPU slows it down a lot.
AA= rand(1000000,1);
tic
Flag = AA > 0.1;
toc
clear Flag
tic
for i=1:size(AA,1)
if(AA(i,:)>0.1)
Flag(i) = true;
else
Flag(i) = false;
end
end
toc
gpu = gpuDevice();
tic
AA_G = gpuArray(AA);
FlagG = AA_G > 0.1;
Flag = gather(FlagG);
toc
clear Flag
tic
for i = 1 : size(AA_G,1);
if(AA_G(i,:)>0.1)
Flag(i) = true;
else
Flag(i) = false;
end
end
wait(gpu);
toc
See Also
Categories
Find more on GPU Computing in MATLAB in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!