Applying vectorization techniques to speedup the performance of dividing a 3D matrix by a 2D matrix

I'm working on removing a for loop in my Matlab code to improve performance. My original code has one for loop (from j=1:Nx) that is harmful to performance (in my production code, this for loop is processed over 20 million times if I test large simulations). I am curious if I can remove this for loop through vectorization, repmat, or a similar technique. My original Matlab implementation is given below.
clc; clear all;
% Test Data
% I'm trying to remove the for loop for j in the code below
N = 10;
M = 10;
Nx = 32; % Ny=Nx=Nz
Nz = 32;
Ny = 32;
Fnmhat = rand(Nx,Nz+1);
Jnmhat = rand(Nx,1);
xi_n_m_hat = rand(Nx,N+1,M+1);
Uhat = zeros(Nx,Nz+1);
Uhat_2 = zeros(Nx,Nz+1);
identy = eye(Ny+1,Ny+1);
p = rand(Nx,1);
gammap = rand(Nx,1);
D = rand(Nx+1,Ny+1);
D2 = rand(Nx+1,Ny+1);
D_start = D(1,:);
D_end = D(end,:);
gamma = 1.5;
alpha = 0; % this could be non-zero
ntests = 100;
% Original Code (Partially vectorized)
tic
for n=0:N
for m=0:M
b = Fnmhat.';
alphaalpha = 1.0;
betabeta = 0.0; % this could be non-zero
gammagamma = gamma*gamma - p.^2 - 2*alpha.*p; % size (Nx,1)
d_min = 1.0;
n_min = 0.0; % this could be non-zero
r_min = xi_n_m_hat(:,n+1,m+1);
d_max = -1i.*gammap;
n_max = 1.0;
r_max = Jnmhat;
A = alphaalpha*D2 + betabeta*D + permute(gammagamma,[3,2,1]).*identy;
A(end,:,:) = repmat(n_min*D_end,[1,1,Nx]);
b(end,:) = r_min;
A(end,end,:) = A(end,end,:) + d_min;
A(1,:,:) = repmat(n_max*D_start,[1,1,Nx]);
A(1,1,:) = A(1,1,:) + permute(d_max,[2,3,1]);
b(1,:) = r_max;
% Non-vectorized code - can this part be vectorized?
for j=1:Nx
utilde = linsolve(A(:,:,j),b(:,j)); % A\b
Uhat(j,:) = utilde.';
end
end
end
toc
Here is my attempt at vectorizing the code (and removing the for loop for j).
% Same test data as original code
% New Code (completely vectorized but incorrect)
tic
for n=0:N
for m=0:M
b = Fnmhat.';
alphaalpha = 1.0;
betabeta = 0.0; % this could be non-zero
gammagamma = gamma*gamma - p.^2 - 2*alpha.*p; % size (Nx,1)
d_min = 1.0;
n_min = 0.0; % this could be non-zero
r_min = xi_n_m_hat(:,n+1,m+1);
d_max = -1i.*gammap;
n_max = 1.0;
r_max = Jnmhat;
A2 = alphaalpha*D2 + betabeta*D + permute(gammagamma,[3,2,1]).*identy;
A2(end,:,:) = repmat(n_min*D_end,[1,1,Nx]);
b(end,:) = r_min;
A2(end,end,:) = A2(end,end,:) + d_min;
A2(1,:,:) = repmat(n_max*D_start,[1,1,Nx]);
A2(1,1,:) = A2(1,1,:) + permute(d_max,[2,3,1]);
b(1,:) = r_max;
% Non-vectorized code - can this part be vectorized?
%for j=1:Nx
% utilde_2 = linsolve(A2(:,:,j),b(:,j)); % A2\b
% Uhat_2(j,:) = utilde_2.';
%end
% My attempt - this doesn't work since I don't loop through the index j
% in repmat
utilde_2 = squeeze(repmat(linsolve(A2(:,:,Nx),b(:,Nx)),[1,1,Nx]));
utilde_2 = utilde_2(:,1);
Uhat_2 = squeeze(repmat(utilde_2',[1,1,Nx]));
Uhat_2 = Uhat_2';
end
end
toc
diff = norm(Uhat - Uhat_2,inf); % is 0 if correct
I'm curious if repmat (or a different builtin Matlab function) can speed up this part of the code:
for j=1:Nx
utilde = linsolve(A(:,:,j),b(:,j)); % A\b
Uhat(j,:) = utilde.';
end
Is the for loop for j absolutely necessary or can it be removed?

 Accepted Answer

If you have C compilers the fatest methods are perhaps mmx and MultipleQR avaikable on FEX

9 Comments

Oddly enough, it appears that the for loop is faster than MultipleQR (at least locally on my machine). I could be testing something wrong. I implemented the following test scenario:
clc; clear all;
% Test Data
N = 10;
M = 10;
Nx = 32; % Ny=Nx=Nz
Nz = 32;
Ny = 32;
Fnmhat = rand(Nx,Nz+1);
Jnmhat = rand(Nx,1);
xi_n_m_hat = rand(Nx,N+1,M+1);
Uhat = zeros(Nx,Nz+1);
Uhat_2 = zeros(Nx,Nz+1);
identy = eye(Ny+1,Ny+1);
p = rand(Nx,1);
gammap = rand(Nx,1);
D = rand(Nx+1,Ny+1);
D2 = rand(Nx+1,Ny+1);
D_start = D(1,:);
D_end = D(end,:);
gamma = 1.5;
alpha = 0; % this could be non-zero
ntests = 100;
% Original Code (Partially vectorized)
tic
for ii=1:ntests
for n=0:N
for m=0:M
b = Fnmhat.';
alphaalpha = 1.0;
betabeta = 0.0; % this could be non-zero
gammagamma = gamma*gamma - p.^2 - 2*alpha.*p; % size (Nx,1)
d_min = 1.0;
n_min = 0.0; % this could be non-zero
r_min = xi_n_m_hat(:,n+1,m+1);
d_max = -1i.*gammap;
n_max = 1.0;
r_max = Jnmhat;
A = alphaalpha*D2 + betabeta*D + reshape(gammagamma,1,1,Nx).*identy;
A(end,:,:) = repmat(n_min*D_end,[1,1,Nx]);
b(end,:) = r_min;
A(end,end,:) = A(end,end,:) + d_min;
A(1,:,:) = repmat(n_max*D_start,[1,1,Nx]);
A(1,1,:) = A(1,1,:) + reshape(d_max,1,1,Nx);
b(1,:) = r_max;
% Non-vectorized code
for j=1:Nx
utilde = linsolve(A(:,:,j),b(:,j)); % A\b
Uhat(j,:) = utilde.';
end
end
end
end
toc
% New Code with MultipleQRSolve
tic
for ii=1:ntests
for n=0:N
for m=0:M
b = Fnmhat.';
alphaalpha = 1.0;
betabeta = 0.0; % this could be non-zero
gammagamma = gamma*gamma - p.^2 - 2*alpha.*p; % size (Nx,1)
d_min = 1.0;
n_min = 0.0; % this could be non-zero
r_min = xi_n_m_hat(:,n+1,m+1);
d_max = -1i.*gammap;
n_max = 1.0;
r_max = Jnmhat;
A2 = alphaalpha*D2 + betabeta*D + reshape(gammagamma,1,1,Nx).*identy;
A2(end,:,:) = repmat(n_min*D_end,[1,1,Nx]);
b(end,:) = r_min;
A2(end,end,:) = A2(end,end,:) + d_min;
A2(1,:,:) = repmat(n_max*D_start,[1,1,Nx]);
A2(1,1,:) = A2(1,1,:) + reshape(d_max,1,1,Nx);
b(1,:) = r_max;
% Non-vectorized code
%for j=1:Nx
% utilde_2 = linsolve(A2(:,:,j),b(:,j)); % A\b
% Uhat_2(j,:) = utilde_2.';
%end
b = reshape(b,Nx+1,1,Nx);
Uhat_2 = squeeze(MultipleQRSolve(A2,b,1e-12));
Uhat_2 = Uhat_2';
end
end
end
toc
diff = norm(Uhat - Uhat_2,inf); % returns 15.7722, I might have wrote something wrong above.
which returned
Elapsed time of method 1 (for loop) is 14.830669 seconds.
Elapsed time of method 2 (MultipleQRSolve) is 15.170965 seconds.
I'm also testing mmx (and am working through a setup error).
MultipleQR is faster for small matrix (say < 10). You might try MMX.
@Bruno Luong: (This is a very long comment that I might want to put in an answer) I fixed the installation error with MMX. I tested against 4 different solutions:
  1. The for loop A(:,:,j)\b(:,j) with Matlab divides replaced by linsolve (which is faster with my data).
  2. The MultipleQRSolve located from the File Exchange.
  3. The Multiple same size linear solver from the File Exchange.
  4. MMX which is now moved to Github.
My test code is below.
clc; clear all;
% Test Data
N = 8;
M = 8;
Nx = 32; % Ny=Nx=Nz
Nz = 32;
Ny = 32;
Fnmhat = rand(Nx,Nz+1);
Jnmhat = rand(Nx,1);
xi_n_m_hat = rand(Nx,N+1,M+1);
Uhat = zeros(Nx,Nz+1);
Uhat_2 = zeros(Nx,Nz+1);
identy = eye(Ny+1,Ny+1);
p = rand(Nx,1);
gammap = rand(Nx,1);
D = rand(Nx+1,Ny+1);
D2 = rand(Nx+1,Ny+1);
D_start = D(1,:);
D_end = D(end,:);
gamma = 1.5;
alpha = 0; % this could be non-zero
ntests = 100;
% Method 1: Original Code (Partially vectorized)
tic
for ii=1:ntests
for n=0:N
for m=0:M
b = Fnmhat.';
alphaalpha = 1.0;
betabeta = 0.0; % this could be non-zero
gammagamma = gamma*gamma - p.^2 - 2*alpha.*p; % size (Nx,1)
d_min = 1.0;
n_min = 0.0; % this could be non-zero
r_min = xi_n_m_hat(:,n+1,m+1);
d_max = -1i.*gammap;
n_max = 1.0;
r_max = Jnmhat;
A = alphaalpha*D2 + betabeta*D + reshape(gammagamma,1,1,Nx).*identy;
A(end,:,:) = repmat(n_min*D_end,[1,1,Nx]);
b(end,:) = r_min;
A(end,end,:) = A(end,end,:) + d_min;
A(1,:,:) = repmat(n_max*D_start,[1,1,Nx]);
A(1,1,:) = A(1,1,:) + reshape(d_max,1,1,Nx);
b(1,:) = r_max;
% Non-vectorized code
for j=1:Nx
utilde = linsolve(A(:,:,j),b(:,j)); % A\b
Uhat(j,:) = utilde.';
end
end
end
end
toc
% Method 2: New Code with MultipleQRSolve
tic
for ii=1:ntests
for n=0:N
for m=0:M
b = Fnmhat.';
alphaalpha = 1.0;
betabeta = 0.0; % this could be non-zero
gammagamma = gamma*gamma - p.^2 - 2*alpha.*p; % size (Nx,1)
d_min = 1.0;
n_min = 0.0; % this could be non-zero
r_min = xi_n_m_hat(:,n+1,m+1);
d_max = -1i.*gammap;
n_max = 1.0;
r_max = Jnmhat;
A2 = alphaalpha*D2 + betabeta*D + reshape(gammagamma,1,1,Nx).*identy;
A2(end,:,:) = repmat(n_min*D_end,[1,1,Nx]);
b(end,:) = r_min;
A2(end,end,:) = A2(end,end,:) + d_min;
A2(1,:,:) = repmat(n_max*D_start,[1,1,Nx]);
A2(1,1,:) = A2(1,1,:) + reshape(d_max,1,1,Nx);
b(1,:) = r_max;
% Non-vectorized code
%for j=1:Nx
% utilde_2 = linsolve(A2(:,:,j),b(:,j)); % A\b
% Uhat_2(j,:) = utilde_2.';
%end
b = reshape(b,Nx+1,1,Nx);
Uhat_2 = squeeze(MultipleQRSolve(A2,b,1e-12));
Uhat_2 = Uhat_2';
end
end
end
toc
% Method 3: New Code with MultiSolver
tic
for ii=1:ntests
for n=0:N
for m=0:M
b = Fnmhat.';
alphaalpha = 1.0;
betabeta = 0.0; % this could be non-zero
gammagamma = gamma*gamma - p.^2 - 2*alpha.*p; % size (Nx,1)
d_min = 1.0;
n_min = 0.0; % this could be non-zero
r_min = xi_n_m_hat(:,n+1,m+1);
d_max = -1i.*gammap;
n_max = 1.0;
r_max = Jnmhat;
A3 = alphaalpha*D2 + betabeta*D + reshape(gammagamma,1,1,Nx).*identy;
A3(end,:,:) = repmat(n_min*D_end,[1,1,Nx]);
b(end,:) = r_min;
A3(end,end,:) = A3(end,end,:) + d_min;
A3(1,:,:) = repmat(n_max*D_start,[1,1,Nx]);
A3(1,1,:) = A3(1,1,:) + reshape(d_max,1,1,Nx);
b(1,:) = r_max;
% Non-vectorized code
%for j=1:Nx
% utilde_2 = linsolve(A3(:,:,j),b(:,j)); % A\b
% Uhat_2(j,:) = utilde_2.';
%end
Uhat_3 = squeeze(MultiSolver(A3, b));
Uhat_3 = Uhat_3';
end
end
end
toc
% Method 4: New Code with MMX
tic
for ii=1:ntests
for n=0:N
for m=0:M
b = Fnmhat.';
alphaalpha = 1.0;
betabeta = 0.0; % this could be non-zero
gammagamma = gamma*gamma - p.^2 - 2*alpha.*p; % size (Nx,1)
d_min = 1.0;
n_min = 0.0; % this could be non-zero
r_min = xi_n_m_hat(:,n+1,m+1);
d_max = -1i.*gammap;
n_max = 1.0;
r_max = Jnmhat;
A4 = alphaalpha*D2 + betabeta*D + reshape(gammagamma,1,1,Nx).*identy;
A4(end,:,:) = repmat(n_min*D_end,[1,1,Nx]);
b(end,:) = r_min;
A4(end,end,:) = A4(end,end,:) + d_min;
A4(1,:,:) = repmat(n_max*D_start,[1,1,Nx]);
A4(1,1,:) = A4(1,1,:) + reshape(d_max,1,1,Nx);
b(1,:) = r_max;
% Non-vectorized code
%for j=1:Nx
% utilde_2 = linsolve(A4(:,:,j),b(:,j)); % A\b
% Uhat_2(j,:) = utilde_2.';
%end
b = reshape(b,Nx+1,1,Nx);
Uhat_4 = squeeze(mmx('backslash',A4,b)); % this turns complex double to type double
Uhat_4 = Uhat_4';
end
end
end
toc
diff = norm(Uhat - Uhat_2,inf); % returns 15.7722, I might have wrote something wrong above.
diff2 = norm(Uhat - Uhat_3,inf); % returns 15.7722, I might have wrote something wrong above.
diff3 = norm(Uhat - Uhat_4,inf); % returns 474 since mmx turns the complex double to type double!
The timings are
Elapsed time of method 1 (for loop) is 10.034689 seconds.
Elapsed time of method 2 (MultipleQRSolve) is 10.101741 seconds.
Elapsed time of method 3 (MultiSolver) is 19.232889 seconds.
Elapsed time of method 4 (MMX) is 3.241713 seconds. Changes data from type complex double to type double.
Therefore, MMX is faster. However, the values of diff, diff2, and diff3 are all nonzero. The mmx 'backslash' command changes my data in the matrix A from a complex double to a double (which will destroy my real test data). Through reading the documentation for MMX, complex numbers aren't supported. Inside the 'discussions part' of the URL on file exchange, the user Israel Vaughn suggests writing as new function called mmxc which I have copied below.
function [ C ] = mmxc( actionStr,A,B, transString )
% complex mmx
% not very optimal, but hey, it works
if (strcmp(actionStr,'square') & isempty(B))
At = real(A);
Ai = imag(A);
clear A;
C = mmx('square',At,[]);
C = C - 1i.*mmx('mult',At,Ai,'nt');
C = C + 1i.*mmx('mult',Ai,At,'nt');
C = C + mmx('square',Ai,[]);
else if (exist('transString'))
At = real(A);
Ai = imag(A);
clear A;
Bt = real(B);
Bi = imag(B);
clear B;
C = mmx(actionStr,At,Bt,transString);
C = C + 1i.*mmx(actionStr,At,Bi,transString);
clear At;
C = C + 1i.*mmx(actionStr,Ai,Bt,transString);
clear Bt;
C = C - mmx(actionStr,Ai,Bi,transString);
else
At = real(A);
Ai = imag(A);
clear A;
Bt = real(B);
Bi = imag(B);
clear B;
C = mmx(actionStr,At,Bt);
C = C + 1i.*mmx(actionStr,At,Bi);
clear At;
C = C + 1i.*mmx(actionStr,Ai,Bt);
clear Bt;
C = C - mmx(actionStr,Ai,Bi);
end
end
This new function appears to be incorrect since replacing
Uhat_4 = squeeze(mmxc('backslash',A4,b)); % Changed to complex mmxc
still returns a non-zero diff3. Through reviewing this question on Stack Exchange, it looks like there is a potential "hack" in the answer provided by user Alec Jacobson:
% Method 5: Try the "hack" on Stack Exchange
tic
for ii=1:ntests
for n=0:N
for m=0:M
b = Fnmhat.';
alphaalpha = 1.0;
betabeta = 0.0; % this could be non-zero
gammagamma = gamma*gamma - p.^2 - 2*alpha.*p; % size (Nx,1)
d_min = 1.0;
n_min = 0.0; % this could be non-zero
r_min = xi_n_m_hat(:,n+1,m+1);
d_max = -1i.*gammap;
n_max = 1.0;
r_max = Jnmhat;
A5 = alphaalpha*D2 + betabeta*D + reshape(gammagamma,1,1,Nx).*identy;
A5(end,:,:) = repmat(n_min*D_end,[1,1,Nx]);
b(end,:) = r_min;
A5(end,end,:) = A5(end,end,:) + d_min;
A5(1,:,:) = repmat(n_max*D_start,[1,1,Nx]);
A5(1,1,:) = A5(1,1,:) + reshape(d_max,1,1,Nx);
b(1,:) = r_max;
% Non-vectorized code
%for j=1:Nx
% utilde_2 = linsolve(A5(:,:,j),b(:,j)); % A\b
% Uhat_2(j,:) = utilde_2.';
%end
% Slightly esoteric solution from Alec Jacobson
w = size(A5,1);
k = size(A5,3);
AS = reshape(permute(A,[1 3 2]),w*k,w);
S = sparse( ...
repmat(1:w*k,w,1)', ...
bsxfun(@plus,reshape(repmat(1:w:w*k,w,1),[],1),0:w-1), ...
AS,w*k,w*k);
% The line below blows up since b is of size(Nx+1,Nx) and code assumes
% it is of size(Nx,1)
utilde_5 = reshape(S\repmat(b,k,1),w,k);
utilde_5 = utilde_5';
end
end
end
toc
I'm going to try and fix this. It seems that MMS backslash command would work for double data types (as opposed to complex doubles). Maybe I can get the hack above to work. I will also look at the parallel computing toolbox.
You can write complex N x N linear equations as real 2N*2N linear equations, with some speed penalty unfortunately.
I think a viable solution is to change the function mmxc to (something similar to)
function [ C ] = mmxc(actionStr,A,B,transString )
% complex mmx
% complex division
a = real(A);
b = imag(A);
c = real(B);
d = imag(B);
% z_1 = a+i*b
% z_2 = c+i*d
% z_1/z_2 = [(a*c + b*d)+1i*(b*c - a*d)]/(c^2 +d^2)
% This isn't quite correct
% Numerator
Num = mmx('mult',a,c);
Num = Num + mmx('mult',b,d);
Num = Num + 1i.*mmx('mult',b,c);
Num = Num - 1i.*mmx('mult',a,d);
% Denominator
D1 = c.*c; % mmx('mult',c,c);
D2 = d.*d; % mmx('mult',d,d);
Den = D1 + D2;
% Division
C = mmx('backslash',Num,Den);
end
Although, this isn't correct. It could be that the reshape in method 4 isn't necessary.
b = reshape(b,Nx+1,1,Nx);
Uhat_4 = squeeze(mmx('backslash',A4,b));
Uhat_4 = Uhat_4';
This should allow the user to apply the MMX backslash function and divides with complex matrices.
Here is how I would use mmx to solve complex systems
m=3;
n=3;
p=1;
q=1e5;
% for-loop
tic
Ac=randn(m,n,q)+1i*randn(m,n,q);
Bc=randn(m,p,q)+1i*rand(m,p,q);
z=zeros(n,p,q);
for k=1:q
z(:,:,k) = Ac(:,:,k)\Bc(:,:,k);
end
toc % Elapsed time is 0.905283 seconds.
% mmx
tic
Ar = real(Ac);
Ai = imag(Ac);
Br = real(Bc);
Bi = imag(Bc);
AA = [Ar,-Ai;Ai,Ar];
BB = [Br;Bi];
zz = mmx('backslash', AA, BB);
z=zz(1:n,:,:)+1i*zz(n+1:end,:,:);
toc % Elapsed time is 0.085284 seconds.
% MultipleQRSolve
tic
z = MultipleQRSolve(Ac,Bc);
toc % Elapsed time is 0.043745 seconds.
I think that the performance of individual methods heavily depends on the size of the input matrices. For larger matrices, the for loop appears to win every time:
% These parameters mimic the real data in my code
m=33;
n=33;
p=1;
q=32; % don't run out of memory!
ntests = 100000; % One Hundred Thousand
% for-loop
tic
for ii=1:ntests
Ac=randn(m,n,q)+1i*randn(m,n,q); % A is a complex double of size (33,33,32)
Bc=randn(m,p,q)+1i*rand(m,p,q); % B is a double (not complex) of size (33,32). I resphape it to size (33,1,32) for the backslash/divide function.
z=zeros(n,p,q);
for k=1:q
z(:,:,k) = Ac(:,:,k)\Bc(:,:,k);
end
end
toc % Elapsed time is 236.284139 seconds.
% mmx
tic
for ii=1:ntests
Ar = real(Ac);
Ai = imag(Ac);
Br = real(Bc);
Bi = imag(Bc);
AA = [Ar,-Ai;Ai,Ar];
BB = [Br;Bi];
zz = mmx('backslash', AA, BB);
z2=zz(1:n,:,:)+1i*zz(n+1:end,:,:);
end
toc % Elapsed time is 331.318650 seconds.
% MultipleQRSolve
tic
for ii=1:ntests
z3 = MultipleQRSolve(Ac,Bc);
end
toc % Elapsed time is 272.109689 seconds.
I'm on a slower computer and will be on a faster one tomorrow morning. I will do some more testing with real Matlab code. It is strange that the for loop appears to beat all of the other solutions with larger data (outside of the parallel computing toolbox, which I will investigate).
MMX and MultipleQRSolve make parallel loop on page, MATLAB for-loop make parallize on the algorithm of linsolve.
That's why the matrix size matters, and possibly the number of the physical processor cores.
MultipleQRSolve is less efficient for larg matrix because my implementation of QR is less efficient than the Lapack function called by MMX and MATLAB native for-loop.
I think that MMX can be optimized if both A and B are not complex doubles. In my data, B is not a complex double so it may be possible to speed up the MMX calculation. Here is how I would implement the three different methods in my real Matlab code.
% These parameters mimic the real data in my code
m=33;
n=33;
p=1;
q=32;
ntests = 10000;
% My code calculates Ac and Br before going into the for loop
Ac = rand(m,n,q)+1i*rand(m,n,q); % A is a complex double of size (33,33,32)
Br = rand(m,q); % B is a (real) double of size (33,32)
% Before I decide to use a for loop/mmx/MultipleQRSolve my code
% "understands" that A is a complex double of size (33,33,32) and B is a
% (real) double of size (33,32). I don't need to calculate what A or B are inside
% the for loop. I only reshape B inside MMX and MultipleQRSolve because I
% have to for the divides operation.
% Here is how I would write the three functions below in my "real" code.
% for-loop
tic
for ii=1:ntests
z1 = zeros(q,m);
for j=1:q
% This is how my code currently computes A\b
utilde = linsolve(Ac(:,:,j),Br(:,j)); % A\b
z1(j,:) = utilde.';
end
end
toc % Elapsed time is 14.231135 seconds.
% mmx
tic
for ii=1:ntests
Bnew = reshape(Br,m,1,q); % Make Br size(33,1,32) to apply MMX
Ar = real(Ac);
Ai = imag(Ac);
Br = real(Bnew);
Bi = imag(Bnew); % is zero as b is a real double
% z_1 = Ar+Ai*i
% z_2 = Br+Bi*i
% z_1/z_2 = [(Ar*Br + Ai*Bi) + 1i*(Ai*Br - Ar*Bi)]/(Br^2 + Bi^2);
% Since Bi == 0, this is simplified to
% z_1/z_2 = [(Ar*Br) + 1i*(Ai*Br)]/(Br^2);
% I think that this makes the code below
%AA = [Ar,-Ai;Ai,Ar];
%BB = [Br;Bi];
%zz = mmx('backslash', AA, BB);
%z2=zz(1:n,:,:)+1i*zz(n+1:end,:,:);
% Into the faster version
Num = mmx('mult', Ar, Br);
Num = Num + 1i*mmx('mult', Ai, Br);
Den = Br.^2;
z2 = mmx('backslash',Num,Den);
z2 = permute(z2,[3 1 2]);
end
toc % Elapsed time is 2.441799 seconds.
% MultipleQRSolve
tic
for ii=1:ntests
Bnew_2 = reshape(Br,m,1,q); % Make Br size(33,1,32) to apply MultipleQRSolve
z3 = MultipleQRSolve(Ac,Bnew_2);
z3 = permute(z3,[3 1 2]);
end
toc % Elapsed time is 25.991396 seconds.
diff = norm(z1-z2,inf); % Not zero since my code for z_1/z_2 isn't correct.
diff2 = norm(z1-z3,inf);
If the code for
AA = [Ar,-Ai;Ai,Ar];
BB = [Br;Bi];
zz = mmx('backslash', AA, BB);
z2=zz(1:n,:,:)+1i*zz(n+1:end,:,:);
isn't needed (as B is not a complex double) then MMX would "beat" the for loop. Thanks for all of your help with this question (and for writing the MultipleQRSolve).

Sign in to comment.

More Answers (2)

Another idea.
clc; clear all;
% Test Data
% I'm trying to remove the for loop for j in the code below
N = 10;
M = 10;
Nx = 32; % Ny=Nx=Nz
Nz = 32;
Ny = 32;
AA=kron(speye(Nx),ones(Nx+1));
map=logical(AA);
% Original Code (Partially vectorized)
tic
for n=0:N
for m=0:M
....
%Vectorized code
AA(map)=A(:);
Uhat=reshape(AA\b(:),Nx+1,Nx).';
end
end
toc

5 Comments

In a local test kron appears to be slower than writing for loops:
clc; clear all;
% Test Data
N = 10;
M = 10;
Nx = 32; % Ny=Nx=Nz
Nz = 32;
Ny = 32;
Fnmhat = rand(Nx,Nz+1);
Jnmhat = rand(Nx,1);
xi_n_m_hat = rand(Nx,N+1,M+1);
Uhat = zeros(Nx,Nz+1);
Uhat_2 = zeros(Nx,Nz+1);
identy = eye(Ny+1,Ny+1);
p = rand(Nx,1);
gammap = rand(Nx,1);
D = rand(Nx+1,Ny+1);
D2 = rand(Nx+1,Ny+1);
D_start = D(1,:);
D_end = D(end,:);
gamma = 1.5;
alpha = 0; % this could be non-zero
ntests = 100;
AA=kron(speye(Nx),ones(Nx+1));
map=logical(AA);
% Method 1: Original Code (Partially vectorized)
tic
for ii=1:ntests
for n=0:N
for m=0:M
b = Fnmhat.';
alphaalpha = 1.0;
betabeta = 0.0; % this could be non-zero
gammagamma = gamma*gamma - p.^2 - 2*alpha.*p; % size (Nx,1)
d_min = 1.0;
n_min = 0.0; % this could be non-zero
r_min = xi_n_m_hat(:,n+1,m+1);
d_max = -1i.*gammap;
n_max = 1.0;
r_max = Jnmhat;
A = alphaalpha*D2 + betabeta*D + reshape(gammagamma,1,1,Nx).*identy;
A(end,:,:) = repmat(n_min*D_end,[1,1,Nx]);
b(end,:) = r_min;
A(end,end,:) = A(end,end,:) + d_min;
A(1,:,:) = repmat(n_max*D_start,[1,1,Nx]);
A(1,1,:) = A(1,1,:) + reshape(d_max,1,1,Nx);
b(1,:) = r_max;
% Non-vectorized code
for j=1:Nx
utilde = linsolve(A(:,:,j),b(:,j)); % A\b
Uhat(j,:) = utilde.';
end
end
end
end
toc
% Method 2: New Code with kron
tic
for ii=1:ntests
for n=0:N
for m=0:M
b = Fnmhat.';
alphaalpha = 1.0;
betabeta = 0.0; % this could be non-zero
gammagamma = gamma*gamma - p.^2 - 2*alpha.*p; % size (Nx,1)
d_min = 1.0;
n_min = 0.0; % this could be non-zero
r_min = xi_n_m_hat(:,n+1,m+1);
d_max = -1i.*gammap;
n_max = 1.0;
r_max = Jnmhat;
A2 = alphaalpha*D2 + betabeta*D + reshape(gammagamma,1,1,Nx).*identy;
A2(end,:,:) = repmat(n_min*D_end,[1,1,Nx]);
b(end,:) = r_min;
A2(end,end,:) = A2(end,end,:) + d_min;
A2(1,:,:) = repmat(n_max*D_start,[1,1,Nx]);
A2(1,1,:) = A2(1,1,:) + reshape(d_max,1,1,Nx);
b(1,:) = r_max;
% Non-vectorized code
%for j=1:Nx
% utilde_2 = linsolve(A2(:,:,j),b(:,j)); % A\b
% Uhat_2(j,:) = utilde_2.';
%end
AA(map)=A2(:);
Uhat_2=reshape(AA\b(:),Nx+1,Nx).';
end
end
end
toc
diff = norm(Uhat - Uhat_2,inf); % is near 1e-14
The elapsed time is
Elapsed time for method 1 (for loops) is 16.629913 seconds.
Elapsed time for method 2 (kron) is 42.182836 seconds.
The line
AA(map)=A2(:);
appears to take more time than the for loop.
But kron is only executed once. Your original post says the for-loop is executed millions of times...
That is true. I need to test a larger execution loop - a preliminary test (with ntests = 100) showed that internal for loops are faster. It may not be the same with ntests = 1,000,000.
AA(map)=A2(:)
Well, I know someone who is wonder why (:) can be much slower than reshape. ;-)
Yeah, I didn't see that A was complex-valued. So,
AA(map)=rehape(A,[],1);
Uhat=reshape(AA\b(:),Nx+1,Nx).';

Sign in to comment.

On the GPU (i.e. if A and b are gpuArrays), the for-loop can be removed:
Uhat = permute( pagefun(@mldivide,A,reshape(b,[],1,Nx)) ,[2,1,3]);

1 Comment

This approach requires the Parallel Computing Toolbox. I will investigate getting this toolbox. Is there another approach that doesn't require a separate toolbox?

Sign in to comment.

Categories

Find more on MATLAB in Help Center and File Exchange

Products

Release

R2020a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!