MATLAB Answers

0

Unexpectedly slow performance of eig with parfor

Asked by David Crosby on 28 May 2019
Latest activity Answered by David Crosby on 6 Jun 2019
I have a computation that requires the repeated evaluation of the eigenvalues and eigenvectors of many 800x800 matrices. From profiling the code I can see that around 70% of the execution time is spent executing the eig() command so this is my first target for improving execution speed. The matrices are single precision floating points, dense and in general lack any convenient symmetry (as such the eigenvalues are complex). I am trying to use parallel processing to speed up execution and whilst I am seeing improvements the increases in speed are rather less than I was expecting (or hoping for) - at best I'm getting a factor of 5-6x improvement, despite having available to me an AMD Ryzen 2990WX CPU with 32 cores.
I realise that the speed-up one can expect with parallel processing depends on many factors, including the overhead for setting up the task, communication between processes, competition for shared resources such and memory and bandwidth and trade-offs against any implicit parallelism from the use of multithreading by Matlab's calculations. However, I believe for this task that the overhead requirements are small compared to the calculation itself and from what I can tell eig() is not a function that benefits hugely from multithreading (see below - the optimum appears to be a few threads rather than many but the differences are relatively small).
My hope is that there is something wrong with what I'm doing. I've done some benchmarking with some very simple code which adequately demonstrates the disappointing scaling with cores (see below). The results are given below and show both the modest variation with number of compuational threads and disappointing speed-up with respect to the number of workers. Given the simplicity of the code I'm struggling to see what else I could do or what I could be doing wrong, but I was hoping for much better. I would be very grateful for any suggestions on how this situation may be improved.
Test code:
m = single(rand(800)); % note the same m is used for all tests throughout this trial
%%
nT = 1; % number of threads: 1,2,4,8, or 16 (32 & 64 also done for for loop)
nP = 4; % number of workers: 1,2,4,8,16 or 32
p = parpool(nP)
tic; parfor il=1:32; maxNumCompThreads(nT); [a,b] = eig(m); end; t = toc, t/32
Test Results:

  0 Comments

Sign in to comment.

Products


Release

R2019a

2 Answers

Answer by Edric Ellis
on 5 Jun 2019
 Accepted Answer

As you rightly point out, there are many factors that could be at play here. Here are the factors that spring to mind when looking at the parallel performance here.
  1. There is a serial portion to the work - in this case, one potentially significant portion is serializing and sending m to each worker. This serial portion obviously places an upper bound on the speedup you can achieve, in line with Amdahl's Law.
  2. There are other overheads related to the Parallel Computing Toolbox infrastructure such as dividing the work up into portions, sending and receiving messages, etc.
  3. The underlying hardware of your machine can be constrained by factors other than simply the number of CPU cores available. For instances, looking at the specs for your CPU, it shows that there are 4 channels to main memory (I'm not an expert in this area, but I must admit that I had expected eig performance to be bound more by CPU than memory, but maybe I'm wrong).
To explore (3) a bit further, I wrote some simple code that runs the kernel of that computation multiple times simultaneously on PCT workers, varying both the number of workers executing simultaneously, and the number of threads that they employ. Here's the code:
%% Perform the computations
p = gcp();
nwMax = p.NumWorkers;
ntMax = 4; % max threads/worker
out = nan(ntMax, nwMax);
for nw = 1:nwMax
spmd (nw)
times = iTimes(ntMax);
end
out(:, nw) = times{1}
clear times
end
%% Plot the results
eigsPerSecond = (1:nwMax) ./ out;
plot(eigsPerSecond' / eigsPerSecond(1));
legend((1:ntMax) + " threads / worker", "Location", "southeast");
xlabel("Num workers")
ylabel("Relative EIGs / second")
%% Timing kernel - tries to avoid timing anything other than the call to "eig"
function times = iTimes(ntMax)
times = nan(ntMax, 1);
m = single(rand(800));
[a,b] = eig(m);
for nt = 1:ntMax
maxNumCompThreads(nt);
labBarrier;
t = tic();
for ni = 1:10
[a,b] = eig(m);
end
labBarrier;
t = toc(t) / 10;
times(nt) = t;
end
end
On my CPU (running R2019a on GLNXA64) which has 6 cores (12 "threads" when employing hyperthreading), I see the following results from that experiment:
rel-eigs-per-second.png
The results in that plot are scaled to show the "relative EIGs / second" - i.e. relative speed-up compared to a single worker running in single-threaded mode.
This shows that my machine is able to keep the 6 cores busy when running in single-threaded-worker mode, and that running more threads does not help at all in this case. I also note that when using something like your original reproduction steps
m = single(rand(800));
tic; for il=1:32; maxNumCompThreads(1); [a,b] = eig(m); end; t = toc; tser = t/32;
tic; parfor il=1:32; [a,b] = eig(m); end; t = toc; tpar = t/32;
speedup = tser / tpar
I see a speedup of around 5 when using 6 single-threaded local workers.

  0 Comments

Sign in to comment.


Answer by David Crosby on 6 Jun 2019

Thank you for the detailed response. I've run your code on the AMD Ryzen machine and I've also enlisted a couple of colleagues and run the code on a 14-core Intel i9-9940X based machine and a twin Xeon Gold 6130 machine (32 cores total). Here are the results:
AMD Ryzen 2990WX (32 cores)
Ryzen performance scaling 2.png
i9-9940X (14 cores)
i9 eig test scaling.png
Twin Xeon Gold 6130 (2*16 cores)
19-06-05_Beast6_eigTest(new).png
My observation that extra threads doesn't help and can in fact lead to slower performance seems to be born out but the overall speed-up is much greater with your code than mine. The maximum speed-up compared to single thread + worker is ~20 for the Ryzen PC, ~11 for the i9 PC and ~18 for the Twin Xeon PC. I'm getting somewhat out of my depth here but perhaps this is due to the overhead issues that are reduced by your approach. I think I'll have to take a closer look at spmd rather than parfor although surely that can't be the reason for the difference here. It's good to confirm that the sort of speed-ups that I was hoping from parallelization are available, I was feeling rather disappointed after my attempt.

  0 Comments

Sign in to comment.