help with parallelization of matrix operations

Question

Daniel Ackerberg on 23 Sep 2023

0
Link

Direct link to this question

https://se.mathworks.com/matlabcentral/answers/2024727-help-with-parallelization-of-matrix-operations

Edited: James Tursa on 23 Sep 2023

Hi. I'm trying to fully utilize a multiprocessor machine, and am running into a problem in that some matrix multiplications seem to be parallelized, and others do not. A simple illustration of the issue is the below code. Calculating both x and y in the code require 1 billion multiplications. But when it is an element by element multiplication of two 1bill x 1 vectors (i.e. x = a.*b), it is highly parallelized (I can see all CPUs being used), but when it is an outer product of a 100millx1 vector times a 1x10 vector (i.e. y=c*d), Matlab does not appear to parallelize the operation, and it takes about 4 times as long.

It seems that since both y=c*d and x = a.*b are doing 1 billion multiplications, there should be a way to get the y=c*d operation done in parallel and at least as quick as the x = a.*b operation (the actual problem I am doing is of the form y=c*d). Kron(c,d') (also 1 billion multiplications) does a little better and does seem to parallelize, but still not as fast as a.*b. Thanks in advance for any help.

a = rand(1000000000,1); 
b = rand(1000000000,1);
c = rand(100000000,1);
d = rand(1,10);
for t=1:10 
    tic 
    x=a.*b;
    toc 
end;
for t=1:10 
    tic 
    y=c*d;
    toc 
end;

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

James Tursa on 23 Sep 2023

0
Link

Direct link to this answer

https://se.mathworks.com/matlabcentral/answers/2024727-help-with-parallelization-of-matrix-operations#answer_1316577

Edited: James Tursa on 23 Sep 2023

Open in MATLAB Online

Both of these operations can be multi-threaded in the background, but this kind of timing test is not straightforward to do because large amounts of memory and thus caching on the CPU will heavily come into play. Also, I don't know the rules the MATLAB BLAS library functions use for when and how to multi-thread the operations based on input sizes (or even if they call a BLAS function for the outer product). The element-wise multiply is certainly the most efficient use of cache that you can get, linear memory access used only once. And if I "square" the outer product you can get somewhat comparable timings (tests slightly altered to make sure no left-over cache gets used in the next iteration):

ti = zeros(10,1);
for t=1:10 
    a = rand(100000000,1); 
    b = rand(100000000,1);
    tic
    x=a.*b;
    ti(t) = toc;
end
mean(ti)
ans = 0.0899
to = zeros(10,1);
for t=1:10 
    c = rand(10000,1);
    d = rand(1,10000);
    tic 
    y=c*d;
    to(t) = toc;
end
mean(to)
ans = 0.1695

So, the outer product takes longer than I would have expected compared to the element-wise multiply (about double the time), but the fact that it takes longer is to be expected (at least by me) because of the cache issue. I would guess some memory gets pulled into the cache more than once. I haven't looked at core usage, but I would be surprised if all cores were not used in both cases. The large sizes would certainly justify using all cores in the background.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

help with parallelization of matrix operations

0 Comments
Show -2 older commentsHide -2 older comments

Answers (1)

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Community Treasure Hunt

help with parallelization of matrix operations

0 Comments Show -2 older commentsHide -2 older comments

Answers (1)

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments