How do Matlab workers work?

18 views (last 30 days)
Tobias Brambier
Tobias Brambier on 10 May 2022
Commented: Edric Ellis on 23 May 2022
So I have been working on optimizing/parallelising an existing piece of code as part of a project in my studies. And while doing so, I have encountered a rather strange problem, atleast to me.
The 'problem' occurs when I let the code run in a normal for-loop, versus a parfor-loop. According to the tic-toc command the parfor-loop's runtime using a single worker is about half the runtime it would take using a standard for-loop.
My problem with this is that according to my understanding standalone Matlab is single threaded. But using a single worker that also uses one thread is still faster, way faster even. And this is exactly where my understanding of the situation leaves me confused.
And yes, I have checked, there is actually only a single worker using a single thread.
On top of that I sort of need an explanation for this in my final papers for this project.
I really hope to find any explanation or even a hint at all to this, for me, strange behaviour.
  4 Comments
Tobias Brambier
Tobias Brambier on 11 May 2022
Since I don't know how to quote, or if it is even possible:
Question from Edric Ellis: "What happens if you run parfor with no parallel pool?"
Answer: Running the parfor loop without a parallel pool results in a runtime close to the time I get when running a standard for-loop. Although a bit faster, still nowhere close to the times I got using parpool of size 1.
Recommendation from Walter Roberson: "Also try experimenting with setting numcompthread to 1."
Answer: I would assume you mean the NumThreads option in the Preference Menu, then yes I tried doing that too. Actually that was what I meant when I wrote: "And yes, I have checked, there is actually only a single worker using a single thread." I now realise, I could have clarified that with a bit more detail.
Assumption from Edric Ellis: "Providing you're using parpool("local") (and not "threads")..."
Answer: Yes, since the local Cluster was the default one I used that for most of testing. Using threads instead results in runtimes very similar, but still slower, to ones using local though.
Question from Edric Ellis: "Is your for loop in a script or a function?"
Answer: It is currently used in a script. Since I am unsure what exactly you mean, I just tried using the function keyword, which resulted in runtimes even slower then just using it in a script.
Question from Edric Ellis: "Are you able to narrow this down to a simple reproduction that you can share?"
Answer: I will ask my supervisors if I am allowed to just simply share the piece of example code that is causing the confusion. Although I don't know how helpful that would even be. I can also try and reduce the code down to an even simpler version to share and see if the situation stays the same.
Thanks so far for your recommendations! What I also forgot to mention is that the code is simply used to assemble stiffnes matrices for usage in a "finite element method"-program called "DAEdalon" made by the supervising professor.
Walter Roberson
Walter Roberson on 11 May 2022
https://www.mathworks.com/help/matlab/ref/maxnumcompthreads.html

Sign in to comment.

Answers (1)

Tobias Brambier
Tobias Brambier on 19 May 2022
Edited: Tobias Brambier on 19 May 2022
Ok, so after I found out that the formerly implemented waitbar created some huge overhead here is a simplified version of the code that still causes the same confusion:
%this is only used for generating the usually externally provided input data
%this number (sz) is the one to increase to change the workload!
sz = 120;
for i = 1 : sz
for j = 1 : sz
x( ( i - 1 ) * sz + j, : ) = [ j - 1, i - 1 ];
end
end
for j = 1 : sz - 1
for i = 1 : sz
inc = ( j - 1 ) * sz + i;
conn( inc, : ) = [ inc, inc + sz ];
end
end
for j = 1 : sz
for i = 1 : sz - 1
inc = ( j - 1 ) * sz + i;
conn( end + 1, : ) = [ inc, inc + 1 ];
end
end
numnp = size(x, 1);
ndm = size(x, 2);
ndf = ndm;
numelem = size(conn, 1);
k_parallel = sparse(numnp*ndf, numnp*ndf);
%this is the important part
parpool( 1 )
tic
parfor e = 1:numelem
k_elem_gs = sparse(numnp*ndf, numnp*ndf);
I = conn(e,1);
J = conn(e,2);
t = x(J,:) - x(I,:);
L = norm(t);
t = t/L;
B = [-t, t]/L;
k_elem = B'*B*L;
for i = 1:2
I = conn(e, i);
for j = 1:2
J = conn(e, j);
k_elem_gs(ndf*(I-1)+1:I*ndf, ndf*(J-1)+1:J*ndf) = ...
k_elem_gs(ndf*(I-1)+1:I*ndf, ndf*(J-1)+1:J*ndf) ...
+ k_elem(ndf*(i-1)+1:i*ndf, ndf*(j-1)+1:j*ndf);
end
end
k_parallel = k_parallel + k_elem_gs;
end
toc
  1 Comment
Edric Ellis
Edric Ellis on 23 May 2022
I don't have a definitive answer, but here's what I think is going on. The serial MATLAB profiler reveals that the expensive pieces of the computation are the increments to k_elem_gs, and the update of k_parallel. In particular, I think the updates to k_parallel get more expensive as the number of non-zero elements increases. This is significant, because a parfor loop reorders these additions. Even when using a single worker, parfor divides up the work of the loop into "subranges", which execute separately. Each of these subranges will start the addition of k_parallel from scratch - i.e. starting from "cheap" additions. So whereas the client for loop does this:
k = k + k + k + k + k + k + k;
the single-worker parfor does something more like this:
k = {k + k + k} + {k + k} + {k + k};
(where each k on the right-hand side is different of course). There are still the same number of additions, but not all of them are expensive additions.

Sign in to comment.

Categories

Find more on Parallel for-Loops (parfor) in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!