Parallelizing computation with memory restrictions

Asked by Henry Shackleton

Henry Shackleton (view profile)

on 19 Jun 2019
Latest activity Answered by Edric Ellis

Edric Ellis (view profile)

on 20 Jun 2019
There's a program that I would like to run in parallel, as I have about a dozen cores available to me. However, I only have 128GB of RAM, which puts some constraints on how I want to parallelize the code.
A is a list of 50 matrices. Each matrix (and all matrices involved) take up about 1GB of memory, which is where the memory constraint comes in. Schematically, I want to execute the code
for i=1:1000
B = longCalculation(i) % This is the step that takes a lot of time
for j=1:50
shorterCalculation(A{j}, B)
end
end
Since longCalculation takes the longest to run, I would like to parallelize that - i.e., convert the first for loop into a parfor loop. However, each worker needs access to all of A, and I can't just make a copy for each worker due to memory constraints. Paralellizing the second for loop, and only giving each worker access to a small part of A, won't speed up the code that much. Any suggestions on changing/modifying this code so that it can be run in parallel? Thanks!

Release

R2019a

Answer by Edric Ellis

Edric Ellis (view profile)

on 20 Jun 2019

Ok, this is somewhat dependent on what it is that you need to do with the results, but here's one way that you can avoid replicating A on each worker, by using a combination of spmd and for-drange. The basic idea is:
1. Partition A so that each worker stores only a piece
2. Perform longCalculation in batches
3. Reduce the result using for-drange and then gplus.
%% Step 1: build A, but ensure each worker only gets a portion.
% Use for-drange to achieve that. This presumes that you can build
% pieces of 'A' directly on the workers.
nA = 50;
nLoop = 1000;
spmd
A = cell(1, nA);
for idx = drange(1,nA)
A{idx} = ones(1000) * idx;
end
end
% At this point, each worker has an independent 'A' where only some of the
% cells are filled in.
%% Step 2: perform the calculations in parallel.
spmd
% Allocate the full output cell array.
output = cell(1, nLoop);
% Loop over the full range, stepped by 'numlabs'
for idx = 1:numlabs:(nLoop+numlabs)
% Each worker performs one longCalculation
myIdx = idx + (labindex - 1);
myB = longCalculation(myIdx);
% Next, we need to work with each 'myB', and perform
% shorterComputation. So, loop over 'numlabs', and use
% labBroadcast to give each worker the value of B.
for bIdx = 1:numlabs
% Make sure we don't exceed the loop range
outIdx = (idx + bIdx - 1);
if outIdx > nLoop
break;
end
% Get the value of B to each worker.
B = labBroadcast(bIdx, myB);
% Reduce the result on each worker using shorterCalculation
partialResult = 0;
for aIdx = drange(1, nA)
partialResult = partialResult + shorterCalculation(A{aIdx}, B);
end
% Combine the overall result into 'output'.
output{outIdx} = gplus(partialResult, 1);
end
end
end
x = output{1};
x = [x{:}]
%% Dummy "longCalculation".
function x = longCalculation(x)
pause(0.1);
x = -x;
end
%% Dummy "shorterCalculation".
function x = shorterCalculation(Ai, B)
x = Ai(1) * B;
end