Improving the performance of matrix multiplication and division with nested for loops

Question

Matthew Kehoe on 25 Jul 2021

0
Link

Direct link to this question

https://se.mathworks.com/matlabcentral/answers/885254-improving-the-performance-of-matrix-multiplication-and-division-with-nested-for-loops

Commented: Matthew Kehoe on 31 Jul 2021

My Matlab code has a section of code which repeatedly performs matrix multiplication and division (through the backslash operator =A\b). When testing large datasets, a certain function named solvebvp_colloc.m is called millions of times. As this function contains multiple matrix multiplications and divisions, it tends to take up a lot of runtime. I'm curious if I could rewrite solvebvp_collect.m (or the code before solvebvp_collect.m) in order to improve performance. I'm not sure if matrix decomposition will help or if I could remove one of the for loops referencing j=1:Nx completely.

My original implementation is shown below. The function solvebvp_colloc.m is harmful to the overall performance of my Matlab code.

% Test Data 
N = 3;
M = 3;
% In real test scenarios, I would have larger N,M such as N=M=30
Nx = 32; % Ny=Nx=Nz
Nz = 32;
Ny = 32;
% My production data doesn't have rand(). I am only creating random
% matrices to test peformance.
Fnmhat = rand(Nx,Nz+1);
Jnmhat = rand(Nx,1);
xi_n_m_hat = rand(Nx,N+1,M+1);
Uhat = zeros(Nx,Nz+1);
Uhat_new = zeros(Nx,Nz+1);
identy = eye(Ny+1,Ny+1);
p = rand(Nx,1);
gammap = rand(Nx,1);
D = rand(Nx+1,Ny+1);
D2 = rand(Nx+1,Ny+1);
gamma = 1.5;
alpha = 0; % this could be non-zero
ntests = 150; % I'm creating a ntests variable to test running the code below multiple times. The 
% ntests variable is not part of my production code (and is only used for testing performance).
tic
% Original Implementation
for ii=1:ntests
 for n=0:N
   for m=0:M
    for j=1:Nx
      Fnmhat_p = Fnmhat(j,:).';
      alphaalpha = 1.0;
      betabeta = 0.0; % this could be non-zero
      gammagamma = gamma*gamma - p(j)^2 - 2*alpha*p(j);
      d_min = 1.0;
      n_min = 0.0; % this could be non-zero
      r_min = xi_n_m_hat(j,n+1,m+1);
      d_max = -1i*gammap(j);
      n_max = 1.0;
      r_max = Jnmhat(j);
      
      % This subroutine is expensive for large N,M. The problem is that
      % solvebvp_collec.m needs to run against three for loops from
      % 1:N,1:M,1:Nx and can be called millions of times.
      uhat_p = solvebvp_colloc(Fnmhat_p,alphaalpha,betabeta,gammagamma,...
        d_min,n_min,r_min,d_max,n_max,r_max,identy,D,D2);
      Uhat(j,:) = uhat_p.';
    end
   end
 end
end
toc

where the function solvebvp_colloc is of the form

function [utilde] = solvebvp_colloc(ftilde,alpha,beta,gamma,d_min,n_min,...
    r_min,d_max,n_max,r_max,identy,D,D2)
A = alpha*D2 + beta*D + gamma*identy;
b = ftilde;
A(end,:) = n_min*D(end,:);
A(end) = A(end) + d_min;
b(end) = r_min;
A(1,:) = n_max*D(1,:);
A(1,1) = A(1,1) + d_max;
b(1) = r_max;
utilde = A\b;
return;

This code is slow for large N,M (large being anything over N=M=10. I would consider N=M=30 as a resonable test case for testing large datasets.) I think that my code design has three major performance flaws including:

The function solvebvp_colloc.m is processed a large number of times (as it needs to run for all values of the for loops 1:N, 1:M, and 1:Nx). When N and M are large (say N=M=20), the profiler shows that this function is called 20,829,312 times and takes a total runtime of 1042 seconds or 17.3 minutes. The operation utilde = A\b is expensive and calling this millions of times takes up a lot of runtime. Calling the matrix multiplication operations in solvebvp_colloc.m is also expensive.
Running an extra for loop with j=1:Nx creates larger runtime.
I don't use matrix decomposition. Since the matrix A changes in every iteration of j (because of the value of gammagamma is different for different values of j), I'm not sure if I can use matrix decomposition to speed up this calculation.

My initial thought was to avoid writing the additional for loop from j=1:Nx. I implemented this in a second impementation below. I wasn't able to figure out how to move the matrix A out of the for loop because the size of gammagamma is (Nx,1) and the size of the identity matrix (Nx+1,Nx+1) are different.

% Test Data 
N = 3;
M = 3;
% In real test scenarios, I would have larger N,M such as N=M=30
Nx = 32; % Ny=Nx=Nz
Nz = 32;
Ny = 32;
% My production data doesn't have rand(). I am only creating random
% matrices to test peformance.
Fnmhat = rand(Nx,Nz+1);
Jnmhat = rand(Nx,1);
xi_n_m_hat = rand(Nx,N+1,M+1);
Uhat = zeros(Nx,Nz+1);
Uhat_new = zeros(Nx,Nz+1);
identy = eye(Ny+1,Ny+1);
p = rand(Nx,1);
gammap = rand(Nx,1);
D = rand(Nx+1,Ny+1);
D2 = rand(Nx+1,Ny+1);
gamma = 1.5;
alpha = 0; % this could be non-zero
ntests = 150;
% New Implementation
tic
for ii=1:ntests
  for n=0:N
    for m=0:M
      % Moved outside of the for loop
      Fnmhat_p = Fnmhat.';
      alphaalpha = 1.0;
      betabeta = 0.0;  % this could be non-zero
      gammagamma = gamma*gamma - p.^2 - 2*alpha.*p; % size (Nx,1)
      d_min = 1.0;
      n_min = 0.0;  % this could be non-zero
      r_min = xi_n_m_hat(:,n+1,m+1);
      d_max = -1i.*gammap;
      n_max = 1.0;
      r_max = Jnmhat;
      % Moved b outside of the for loop
      b = Fnmhat_p;
      
      % I don't know how to move A outside of the for loop as gammagamma
      % is of size(Nx,1) and identy is of size(Ny+1,Ny+1) so writing
      %     A = alphaalpha*D2 + betabeta*D + gammagamma.*identy 
      % wouldn't work since the two matrices are of different sizes.
       
      for j=1:Nx
        uhat_p_new = solvebvp_colloc_new(b,alphaalpha,betabeta,gammagamma,...
            d_min,n_min,r_min,d_max,n_max,r_max,identy,D,D2,j);
        Uhat_new(j,:) = uhat_p_new.';
      end
    end
  end
end
toc
diff = norm(Uhat - Uhat_new,inf); % is zero

where the new function solvebvp_colloc_new.m is written as

function [utilde] = solvebvp_colloc_new(b,alpha,beta,gamma,d_min,n_min,...
    r_min,d_max,n_max,r_max,identy,D,D2,j)
A = alpha*D2 + beta*D + gamma(j)*identy;
A(end,:) = n_min*D(end,:);
A(end) = A(end) + d_min;
b(end,j) = r_min(j);
A(1,:) = n_max*D(1,:);
A(1,1) = A(1,1) + d_max(j);
b(1,j) = r_max(j);
utilde = A\b(:,j);
return;

A local test shows that the second implementation is "slightly faster."

Elapsed time of original implementation is 4.087457 seconds.
Elapsed time of new implementation  3.934619 seconds.

For large M,N (say N=M=20 or N=M=30) this difference would be larger. However, I wasn't able to significantly speed up the calculation because I still do a large amount of matrix multiplications and divisions inside the new function solvebvp_colloc_new.m. I'm interested in analyzing how I can speed up this calculation (I will need to test N=M=20 and N=M=30) and the profiler shows that this calculation takes up about 1/3 of the total runtime. I'm interested in analyzing the following performance improvements:

Can Matlab matrix decomposition help here even though A changes for every value of j (since gammagamma is different for every value of j)?
Is it possible to avoid writing the for loop from j=1:Nx in order to perform utilde = A\b less and matrix multiplication less?
Are there any other performance improvements that would speed up this calculation (the function solvebvp_colloc.m is called millions of times for large N,M)? I think that bsxfun may help where the backslash operator becomes utilde = bsxfun(@rdivide, A, b); although this doesn't match the dimensions of the inner for loop through j.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Chunru on 25 Jul 2021

0
Link

Direct link to this answer

https://se.mathworks.com/matlabcentral/answers/885254-improving-the-performance-of-matrix-multiplication-and-division-with-nested-for-loops#answer_753289

Edited: Chunru on 25 Jul 2021

Analyse your program to see if the matrix A is independent of which loop variables (it seems A is independent to some loop variables at the first look of the program). Compute A outside the loops of the independent variables. In inner-loop contactate the right-hand b into a matrix B. Then solve multiple linear systems with the same A in one-go: X = A \ B. This way you will only do one matrix "division" for all the independent loops (if any) and hopefully the performance can be improved.

If A is changing over all loop variable, then you may not be able to improve the performance drastically.

4 Comments
Show 2 older commentsHide 2 older comments

Matthew Kehoe on 25 Jul 2021

Edited: Matthew Kehoe on 25 Jul 2021

The ii loop is not part of my code (it is only used for testing purposes). I only include the ii loop to see how long it takes to run multiple simulations (I also could make N and M larger). I understand that deleting it would improve performance (however, this wouldn't accurately reflect my production code). In my production code, the section of the code referencing solvebvp_colloc.m is called thousands of times by other functions with other for loops. I did not put all of my code in the original post (as it wouldn't be self-contained).

Matthew Kehoe on 31 Jul 2021

I've optimized some of the code and have asked a different question. I am going to close this question by accepting your answer.

Sign in to comment.

Improving the performance of matrix multiplication and division with nested for loops

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

4 Comments
Show 2 older commentsHide 2 older comments

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

Improving the performance of matrix multiplication and division with nested for loops

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

4 Comments Show 2 older commentsHide 2 older comments

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

4 Comments
Show 2 older commentsHide 2 older comments