Speed up gain by GPU parallel processing

29 views (last 30 days)
Hi,
I tried to run the following code using both GPU and CPU parallel processing. The amount of speed up I achieved after running this code using GPU parallel processing was significantly lower than what I could have using CPU parallel processing. Does any one know how MATLAB's builtin functions such as "find" can gain an amazing speed up by running on GPUs, but I can't get a comparable speed when I run the code below on GPU?
I appreciate any help you can provide in this direction.
N = 5000000;
Data = rand(N,3);
% V and F are the vertices and faces on a point cloud geomtery,
% respectively. They are connected to each other. We cannot recreate them
% using arbitrary random numbers.
%V = an array of the size (4000000,3);
%F = an array of the size (7000000,3);
SearchWindowSize = 0.02;
Data = gpuArray(Data);
V = gpuArray(V);
F = gpuArray(F);
for i=1:N
C = Data(i,:);
IDsInWindow = find((abs(V(:,2)-C(2))<SearchWindowSize)&(abs(V(:,3)-C(3))<SearchWindowSize)&(V(:,1)>=C(1)));
[a1,b1]=ismember(F(:, 1),IDsInWindow);
[a2,b2]=ismember(F(:, 2),IDsInWindow);
[a3,b3]=ismember(F(:, 3),IDsInWindow);
aT=a1+a2+a3;
f1 = find(aT>0);
F_in = F(f1,:);
if(isempty(F_in)==0)
inter_mat = [];
for j = 1:size(F_in, 1)
F_V = V(F_in(j, :), :);
if((C(2)<=max(F_V(:,2)))&(C(2)>=min(F_V(:,2)))&(C(3)<=max(F_V(:,3)))&(C(3)>=min(F_V(:,3))))
u = ((F_V(2,3) - F_V(3,3))*(C(2) - F_V(3,2)) + (F_V(3,2) - F_V(2,2))*(C(3) - F_V(3,3))) / ((F_V(2,3) - F_V(3,3))*(F_V(1,2) - F_V(3,2)) + (F_V(3,2) - F_V(2,2))*(F_V(1,3) - F_V(3,3)));
v = ((F_V(3,3) - F_V(1,3))*(C(2) - F_V(3,2)) + (F_V(1,2) - F_V(3,2))*(C(3) - F_V(3,3))) / ((F_V(2,3) - F_V(3,3))*(F_V(1,2) - F_V(3,2)) + (F_V(3,2) - F_V(2,2))*(F_V(1,3) - F_V(3,3)));
w = 1 - u - v;
in = u >= 0 && v >= 0 && w >= 0 && u <= 1 && v <= 1 && w <= 1;
if in
inter_mat(j) = 1;
end
end
end
inter_Count = sum(inter_mat);
if mod(inter_Count, 2) == 1
IDs_All(i,1) = 1;
else
IDs_All(i,1) = 0;
end
end
end
  2 Comments
Walter Roberson
Walter Roberson on 21 Aug 2023
Have you considered using rangesearch with a KDTree ?
Memo Remo
Memo Remo on 21 Aug 2023
Edited: Memo Remo on 21 Aug 2023
Dear Walter,
Thanks for the reply.
What this code is doing is different from what the "find" or "rangesearch" commands can do. I exemplified "find" just to show what speed up degree I am considering here. What my code is doing is that it searches among 5 million vertices and finds those that are located inside an enclosed point cloud geometry (The matrices V and F comes from this geometry). To do this, it is using the ray casting algorithm that works by determining the number of times a ray originated from the query vertex intersects with the surfaces of the point cloud geometry.
The algorithm consists a series of simple mathematical operations and some builtin functions that are all compatible with the MATLAB's GPU parallel computing. I am wondering why I am not getting a considerable speed up here.

Sign in to comment.

Accepted Answer

Walter Roberson
Walter Roberson on 22 Aug 2023
You appear to be misunderstanding how a GPU works.
When you declare a GPU variable, it is not the case that MATLAB immediately starts taking an internal record of all of the following instructions without executing them and then upon reaching the end of the loop, analyzes all of the instructions in the loop and decomposes them into GPU primitives and then sends the data and the GPU primitives to the GPU to execute all of those instructions. Declaring a GPU variable does not, in other words, accelerate everything in MATLAB .
Instead, MATLAB keeps flowing on the CPU. When it encounters an expression involving a GPU variable, it more or less adds the operation to a work queue and returns something that is effectively a promise that the results will be available later. The facility that is constructing the work queue is smart enough to be able to combine operations, so if it saw X*5+1 where X is on the GPU, it might first create a work entry for multipying X by 5 on the GPU, but then it would see the + 1 and would extend the work entry to be able to do the X*5+1 on the GPU.
It can continue to combine those work entries (and corresponding promises that results will be available later) until it reaches some internal (undocumented) limit of complexity -- or until the code specifically requests to wait for results or the code specifically requests to pull back results... Or until the code requests to store the result into a portion of an already-defined variable that was not created as a promise.
Once the calculation gets sufficiently complex, or the code specifically asks for the results, or the result has to be stored into a non-GPU array, then the work is dispatched to the GPU to be executed. If the reason for dispatch was that the calculation reached the complexity limit, then while the GPU is executing, MATLAB can keep flowing on the CPU, building up more work entries. Potentially there could be a bunch of different work entries each ready to be dispatched to the GPU, and MATLAB on the CPU can continue executing until the code asked for the result (gather or wait) or the code wants to store the result into a non-GPU output variable. (It would not surprise me if there was also some kind of timeout consideration -- that if the code has not built up more work for the GPU in a particular time frame, then it could be time to ask the GPU to execute what is already queued.)
Excecution on the GPU is not inherently confined by the boundaries of a script or function. If X and Y are GPU arrays then a function that calculated Z=X+Y and returned Z can return the "promise" that the calculation will be available later rather than having to finish the calculation before returning.
Eventually the code asks to gather() or wait() or to store into a non-GPU variable, or to display the output, and the results are waited for and transfered back from the GPU.
Notice in this scheme that only calculations on GPU objects are potentially sped up: calculations not involving GPU objects are still being executed on the CPU. And notice that in this scheme that the data and instructions about what is to be done have to be sent to the GPU, and the results have to be transferred back to the CPU; it might also be necessary to wait for the GPU to finish the calculation (but if the CPU was busy working on other things, the GPU might already have finished by the time the CPU asks for the results, so there is not necessarily a wait for the GPU to finish.)
Because of the time to transfer data to the GPU, and the time to instruct the GPU on what to do, it is not always faster to use the GPU. The CPU might not be as fast as the GPU, but it takes a bunch of CPU instructions to send the data and information about what is to be calculated (and to figure out a good way to calculate the requests efficiently), and sometimes it is just easier / faster for the CPU to do the work itself since it already knows what is to be done.
  1 Comment
Memo Remo
Memo Remo on 22 Aug 2023
Thank you, Walter.
Your answers have all information I was looking for. I appreciate your help!

Sign in to comment.

More Answers (1)

Matt J
Matt J on 22 Aug 2023
Edited: Matt J on 22 Aug 2023
The operations that benefit from GPU acceleration are vectorized matrix operations and commands, e.g.,
There appears to be little if any vectorization in your current code.
  2 Comments
Memo Remo
Memo Remo on 22 Aug 2023
Edited: Memo Remo on 22 Aug 2023
Dear Matt,
I appreciate your attention to my question.
Let me simplify my question. Consider the code below in which we review all rows of a tall vector with the length of 1M to find the ones that are greater than 0.1.
tic
AA= rand(1000000,1);
for i=1:size(AA,1)
i
if(AA(i,:)>0.1)
Flag(i) = true;
else
Flag(i) = false;
end
end
toc
Now if I want to speed up this process by GPU parallel computing, is it sufficient to modify the code as follows? Or I need to change the structure of my code in a way that is compatible with GPU parallel processing? Is there any way to gain significant speed up using GPU parallel computing by appling minimal modifications to this code? (If there is, then I can use the same method to speed up my initial code). I apologize for asking a too naive question.
tic
AA= rand(1000000,1);
AA_G = gpuArray(AA);
for i=1:size(AA_G,1)
i
if(AA_G(i,:)>0.1)
Flag(i) = true;
else
Flag(i) = false;
end
end
toc
Walter Roberson
Walter Roberson on 22 Aug 2023
Vectorize.
Unfortunately this Answers facility does not have access to a GPU, and unfortunately I would have to boot one of my systems into an old operating system and old MATLAB version to get GPU access.
GPU is not always faster: transfering data back and forth with the CPU slows it down a lot.
AA= rand(1000000,1);
tic
Flag = AA > 0.1;
toc
Elapsed time is 0.012533 seconds.
clear Flag
tic
for i=1:size(AA,1)
if(AA(i,:)>0.1)
Flag(i) = true;
else
Flag(i) = false;
end
end
toc
Elapsed time is 0.096996 seconds.
gpu = gpuDevice();
Error using gpuDevice
Unable to find a supported GPU device. For more information on GPU support, see GPU Computing Requirements.
tic
AA_G = gpuArray(AA);
FlagG = AA_G > 0.1;
Flag = gather(FlagG);
toc
clear Flag
tic
for i = 1 : size(AA_G,1);
if(AA_G(i,:)>0.1)
Flag(i) = true;
else
Flag(i) = false;
end
end
wait(gpu);
toc

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!