Vectorize and use parfor in a nested loop with five levels
4 views (last 30 days)
Show older comments
Hi all,
I'm working on an application that, for now, requires computing a nested loop with five levels. I'm trying to vectorize some parts and to use a parfor for one of the loops. However, I haven't been very successful. Any suggestions would be great. A simplified version of the code looks as follows,
clear all
clc
p1=(1:1:50)';
p2=p1;
np=max(size(p1));
[P1,P2]=meshgrid(p1,p2);
DP=unique(P1-P2);
K=max(size(DP));
ns=1000;
ne=10;
S1=randn(ns,ne);
S2=randn(ns,ne);
S3=randn(ns,ne);
S4=randn(ns,ne);
L1=rand(ns,ne);
L2=rand(ns,ne);
L3=rand(ns,ne);
L4=rand(ns,ne);
a=2;
BR=zeros(np,K,K,K,K);
P=BR;
for i=1:K,
DP1=DP(i);
C1=mean(a*S1*DP1>=0,2);
e1=bsxfun(@times,L1,(1-2*C1));
for j=1:K,
DP2=DP(j);
C2=mean(a*S2*DP2>=0,2);
e2=bsxfun(@times,L2,(1-2*C2));
for l=1:K,
DP3=DP(l);
C3=mean(a*S3*DP3>=0,2);
e3=bsxfun(@times,L3,(1-2*C3));
for m=1:K,
DP4=DP(m);
C4=mean(a*S4*DP4>=0,2);
e4=bsxfun(@times,L4,(1-2*C4));
for pp=1:np,
pr=p2(pp);
dp1=p1-pr;
A=reshape(-bsxfun(@times,a.*C4,dp1'),ns,1,np);
N=bsxfun(@minus,A,e4);
f=exp(N);
PR=f./(1+f);
rev=zeros(ns,ne,np);
for zz=1:np,
rev(:,:,zz)=PR(:,:,zz)*p1(zz);
end
er=reshape(sum(mean(rev,2)),np,1);
[mpr,ind]=max(er);
BR(pp,m,l,j,i)=p1(ind);
P(pp,m,l,j,i)=mpr;
end
end
end
end
end
In this version the problems start when evaluating BR and P in the inner loop. Even though all pp, m, l, j and i are defined, it takes a lot of memory and time to evaluate BR(pp,m,l,j,i)=p1(ind), and I really don't know why. Then, after solving that, I need to improve on vectorizing what might be possible to vectorize, and choose a loop to use the parfor.
Thanks,
0 Comments
Accepted Answer
Jan
on 28 Jun 2013
Edited: Jan
on 30 Jun 2013
Some small improvements:
% Original code:
rev=zeros(ns,ne,np);
for zz=1:np,
rev(:,:,zz)=PR(:,:,zz)*p1(zz);
end
% Vectorized (is this faster?):
rev = bsxfun(@times, PR, reshape(p1, 1, 1, np));
Calculating sum(mean(rev,2)) contains 2 summations. Perhaps one sum is faster:
er = sum(reshape(rev, np, []), 2) / ne;
Avoir some multiplications by -1:
% Original:
dp1 = p1-pr;
A = reshape(-bsxfun(@times,a.*C4,dp1'),ns,1,np);
% New
dp1 = pr - p1;
A = reshape(bsxfun(@times, a.*C4, dp1'), ns, 1, np);
In these lines:
C1 = mean(a*S1*DP1>=0,2);
some simplifications are possible: a * S1 * DP1 >= 0 does not depend on the positivbe constant a. DP1 is a scalar and only its sign is interesting. So the meanS1 = mean(S1>=0, 2) can be calculated once and in case of a negative sign of DP1 1 - meanS1 can be calculated fast. DP1==0 must be caught also.
I'd start a parallelization after these mathematical simplifications have been performed. Avoiding calculations is simply more efficient than trying to do them faster.
[EDITED] Now I've found the time to run your code on my computer. Beside the fact that 4GB are not enough to run your code (how much RAM do you have?), a simple test revealed the bottleneck: Use the profiler to find out, which lines consumes the most time. You will see, that all my ideas are meaningless, because they improve parts of your code, which require only one percent of the total time. Only vectorizing the creation of rev matters significantly. But the rest is dominated by the very expensive calculation of:
f = exp(N);
So this is the point where improvements are required at first. N is created by BSXFUN and you can exploit, that exp(a - b) equals exp(a) / exp(b), and that exp(a * b) equals exp(a) ^ b. Then the innermost loop becomes:
exp_e4_inv = exp(e4);
exp_a_C4 = exp(a * C4);
for pp = 1:np
dp1 = p2(pp) - p1;
Ax = reshape(bsxfun(@power,exp_a_C4, dp1'), ns, 1, np);
f = bsxfun(@times, Ax, exp_e4_inv);
PR = f./(1+f);
rev = bsxfun(@times, PR, reshape(p1, 1, 1, np));
er = reshape(sum(reshape(rev, [], 3)) / ne, np, 1);
[mpr,ind]=max(er); % No changes here...
BR(pp,m,l,j,i)=p1(ind);
P(pp,m,l,j,i)=mpr;
end
This reduces the total number of elements the exponential function is applied to. For the much smaller problem defined by p1=(1:3)', I get 1.45 seconds runtime instead of 3.02 of the original version. The results differ by rounding errors inside the expected limits from the original. So some basic mathematical simplifications gained the double speed.
Well, double speed is not impressing by such a large problem. But I hesitate to start to parallelize code without considering this. Now this is a good point to come back to the original question: How do we PARFOR this?
5 Comments
More Answers (0)
See Also
Categories
Find more on Loops and Conditional Statements in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!