Why does performance of functions saturate with number of cores using parfeval but not with parfor?

2 views (last 30 days)
I am developing an application that MUST take advantage of parallelization, and ideally offer real-time updates after each iteration, which makes use of parfeval prefarable. I believe the algorithm that I have developed is highly parallelizable (see attached for performance of 'WT_Ex_2_b' as a function of number of cores used in parfeval function). From 1 to 8 cores, the speedup factor agrees with theoretical expectation (Amdahl's Law with p=0.95), however, performance of my application saturates at 8 cores. This led me to create a dummy function (see attached script) to compare the performance of using parfor and parfeval as a function of number of cores. I discovered that the parfor version behaves quite similarly to theoretical expectation (Ahmdal's Law, also with p=0.95), however the parfeval version continues to show strange saturation behavior, even for the dummy function. Notice how the Speedup factor improves with core number upto 12 cores, then suddenly no further improvement is observed. I have attached the script in case you want to reproduce this behavior on your end.
Is there a fundamental limitation to the number of cores the parfeval function can leverage? Or is there an obvious mistake I am making in the way I am using the parfeval function? Why does the performance behavior of the dummy algorithm suddenly saturate at 12 cores? Any recommendation how to use the parfeval function to perform as well as parfor?
I would like to emphasize that I have already developed my application to use parfeval, so converting to parfor would be time-consuming and prevent me from utilizing the update-after-iteration feature of parfeval.
Thank you for your help on this critical matter.
  4 Comments
Rik
Rik on 30 Jun 2020
I'm not sure what people could do with it, but I think I would redact that license number. I'm on mobile now, so it's a pain to edit it away for you.

Sign in to comment.

Accepted Answer

Edric Ellis
Edric Ellis on 1 Jul 2020
The main difference between parfor and parfeval is that in the parfeval case, you are responsible for scheduling the work on the workers. parfor has an advantage over parfeval in that it knows how many loop iterations there are, and so what it does is schedule a fixed number of chunks of work per worker (see the documentation for parforOptions - the chunks are referred to as "sub-ranges"). So, in your case, parfeval will incur more overhead since each parfeval request is sent on its own to a worker, where as parfor groups things together, and this will generally be more efficient in the case where the request durations are of a similar duration to the overheads of making a single remote request.
So, parfeval doesn't have a fundamental limitation in this regard, but you might need to amalgamate your requests if they are too short to match parfor performance. Another option might be to use parfor together with DataQueue which would let you perform updates at the client after each parfor iteration completes.
  8 Comments
Joseph Smalley
Joseph Smalley on 13 Jul 2020
Edric, after taking a break and returning to this problem, your last above recommended code still behaves in the non-physical way from my previous comment. The disp() line within my parfor loop (code below) checks a property of both the W_temp and W object. All values of "Src Intensity" should be equal to ~1. However after exceeding a multiple of the number of workers in my pool (6), the properties experience a "step-like" behavior in that one of the workers sees an updated object rather than the original object (before the parfor loop begins). Below shows Src Intensity jumping to 2 on the 7th worker and 3 on the 13th worker.
N=7
Iteration #3 of 7 complete. Src Intensity(temp)=1, Src Intensity(main)=1
Iteration #5 of 7 complete. Src Intensity(temp)=1, Src Intensity(main)=1
Iteration #4 of 7 complete. Src Intensity(temp)=1, Src Intensity(main)=1
Iteration #2 of 7 complete. Src Intensity(temp)=1, Src Intensity(main)=1
Iteration #1 of 7 complete. Src Intensity(temp)=1, Src Intensity(main)=1
Iteration #7 of 7 complete. Src Intensity(temp)=1, Src Intensity(main)=1
Iteration #6 of 7 complete. Src Intensity(temp)=2.0002, Src Intensity(main)=2.0002
----
N=13
Iteration #2 of 13 complete. Src Intensity(temp)=1, Src Intensity(main)=1
Iteration #3 of 13 complete. Src Intensity(temp)=1, Src Intensity(main)=1
Iteration #5 of 13 complete. Src Intensity(temp)=1, Src Intensity(main)=1
Iteration #4 of 13 complete. Src Intensity(temp)=1, Src Intensity(main)=1
Iteration #1 of 13 complete. Src Intensity(temp)=1, Src Intensity(main)=1
Iteration #6 of 13 complete. Src Intensity(temp)=1, Src Intensity(main)=1
Iteration #8 of 13 complete. Src Intensity(temp)=2, Src Intensity(main)=2
Iteration #11 of 13 complete. Src Intensity(temp)=2, Src Intensity(main)=2
Iteration #10 of 13 complete. Src Intensity(temp)=2, Src Intensity(main)=2
Iteration #9 of 13 complete. Src Intensity(temp)=2, Src Intensity(main)=2
Iteration #7 of 13 complete. Src Intensity(temp)=2, Src Intensity(main)=2
Iteration #13 of 13 complete. Src Intensity(temp)=2, Src Intensity(main)=2
Iteration #12 of 13 complete. Src Intensity(temp)=3.0001, Src Intensity(main)=3.0001
It then appears that W is updated, WITHIN the parfor loop, after a completed cycle of 6 workers. However upon completion of the parfor loop, only the W_temp object is updated. Hence I need to "manually" update the properties of W with a serial for loop, which is OK. The problem is that I do not want W_temp or W to be updated within the parfor loop after completion of a multiple of the size of the parallel pool. I want all workers to see the original W object for all iterations. Is this possible? Thank you for your continued assistance.
% pre-allocation
maxSegPerRay = W.maxSegments*W.maxBranches;
rayListAll_origin = zeros(3,N,maxSegPerRay);
rayList_length = zeros(N,1);
% W is the handle object whose properties include other handle classes that we want to update
W_temp(N,1) = W;
for i=1:N
W_temp(i) = W;
end
parfor i=1:N
% Convert broadcast variable into temporary variable
W_temp2 = W;
% Call main function
[~,rayList] = IterTrace_oneParent_par(W_temp2,inputRays(i));
% Convert updated rayList properties to numeric array (not a problem)
rayList_length(i) = length(rayList);
zeroList_length = maxSegPerRay - rayList_length(i);
rayListAll_origin(:,i,:) = [rayList.origin, zeros(3,zeroList_length)];
% Update W_temp object and display Src Intensity of W_temp and W (should always be ~1)
W_temp(i) = W_temp2;
disp(['Iteration #', num2str(i), ' of ', num2str(N) ' complete. Src Intensity(temp)=', num2str(sum([W_temp(i).objects{5}.rays.intensity])),...
', Src Intensity(main)=', num2str(sum([W.objects{5}.rays.intensity]))]);
end
% Note: W_temp and W are both updated WITHIN the parfor loop after subRange is complete, but only W_temp is updated on completion of the parfor loop
% "Manually" update properties of detector objects contained in W
for i=1:N
for j=1:W.numObj
if class(W.objects{j})=="Detector"
if ~isempty(W_temp(i).objects{j}.rays)
W.objects{j}.rays = W_temp(i).objects{j}.rays;
end
end
end
end
Joseph Smalley
Joseph Smalley on 11 Aug 2020
Just wanted to say that I accepted this answer because, overall, the problem is addressed more easily by switching to a parfor loop, as Edric first proposed. Additionally I switched all my classes to value classes over handle classes. The latter is a compromise for my application, and was first motivated by requirements of codegen for MEX files. Nonetheless the combination of parfor with values classes has been working for several weeks now, with pretty good scalability of 12x at 24 cores.

Sign in to comment.

More Answers (0)

Categories

Find more on Parallel for-Loops (parfor) in Help Center and File Exchange

Products


Release

R2020a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!