**You are now following this question**

- You will see updates in your followed content feed.
- You may receive emails, depending on your communication preferences.

# Without resorting to FEX/MEX, can the overhead of repeated calls to sparse() be avoided with operations on I, J, V directly?

5 views (last 30 days)

Show older comments

I have a time-marching problem bottlenecked by matrix multiplication , where the sparse matrix is the sum of many time-independent sparse matrices with time-dependent scalar coefficients:

I'm aware of FEX solutions that would enable a 3D to be defined once prior to time-marching, though I would prefer to rule out native MATLAB first. My current thinking is to supplement the I, J, V vectors with an index vector K, and do one of the following:

- Generate a cell array of sparse matrices along index K, with multiplied with bsxfun, similar to this. (Probably the slowest option)
- Sparse matrix construction with B=sparse(I, J, V.*beta(K), N, N)*X
- Avoid sparse matrix entirely with B=accumarray(I, V.*beta(K).*X(J), N, N)

I haven't tested option 1, but option 3 with accumarray runs about 50x slower than option 2 (with or without issparse option). This makes sense, as the multiplication with X in accumarray is done before accumulation, hence on a much larger array.

However, option 2 still appears to be wasting performance. For example, sparse(I, J, V, N, N)*X runs about 50x slower than basic A*X. My initial interpretation of this is that much of the time in sparse is spent on unavoidable summation of common terms in I,J,V. I cant profile sparse internally, but tested this by accumulating V prior to sparse(I, J, V, N, N)*X, which comes out to ~7x slower than A*X.

My conclusion is that there is still some significant & (potentially) avoidable overhead from calling sparse many times unrelated to summation of duplicate indices, and the larger piece of summation that is unavoidable due to the summation of time-dependent terms co-located in .

To boil this down, I have two questions I cannot answer:

- Considering the pattern of summation of duplicate indices is always the same (static I/J), can the summation overhead in sparse be improved?
- What is the source of the remaining overhead in sparse and can it be avoided?

##### 20 Comments

Bruno Luong
on 6 Aug 2022

To get a feeling, can you provide

- typical size of A0
- typical density of Ak
- typical density of A0
- typical length of k (number of matrices)
- typical length of t

Joel Lynch
on 6 Aug 2022

Edited: Joel Lynch
on 6 Aug 2022

I attached a small reference dataset (example.mat), but it may be of limited use/function because I had to really shrink it to get to 5MB. I/J/V/K is stored in a cell array for each A_k, you'd have to cat these to get the full I/J/V/K as described in the question. "example_A0" is an example of a full A_0 matrix. (Note, this A matrix is pre-processed before mtimes)

- A_0 is square, N ~2000 (on a side), possibly up to 20k. Density of <4%.
- Size of A_k matches A_0, density obviously less than A_0 but varies considerably.
- Typically N_k of 50-100 matrices, but possibly up to 500. Set at runtime.
- Time steps in the thousands or more, enough to make anything outside the time-march negligible, and anything in it worth optimizing as much as possible. Unavoidably serial.

Joel Lynch
on 6 Aug 2022

Edited: Joel Lynch
on 6 Aug 2022

FYI, I think I may have a solution, working on it now. I'll detail once I test it, but basically define a new dense or sparse matrix outside the loop where Z=sub2ind([N, N], I, J), then set V=M*beta inside the loop.

Then B=accumarray(I, V.*X(J), N, N) may exceed B=sparse(I, J, V, N, N)*X as they now both have the same number of multiplications. Essentially just ensuring the sum precedes the multiply.

Bruno Luong
on 6 Aug 2022

Bruno Luong
on 6 Aug 2022

BTW does X change with time? If not you can precompute X(J) ready for accumarray.

You can even precompute

P(l,k) := M(z,k).*X(J)

then

B = accumarray(I, P*beta,N, N)

Bruno Luong
on 6 Aug 2022

Edited: Bruno Luong
on 6 Aug 2022

The question can be summarized as: Is

B=accumarray(I, V.*beta(K).*X(J), N, N);

faster than

S = sparse(I, J, V, [N N]) * X;

? It's not clear to me.

Joel Lynch
on 6 Aug 2022

Edited: Joel Lynch
on 6 Aug 2022

That's not quite right, you would need beta in the sparse version for an apples-to-apples comparison.

But more importantly, accumarray(I, V.*beta(K).*X(J), N, N) is much slower becuse X(J) is multiplied onto many duplicate I/J coordinates that cannot be pre-summed in time, as they are from each A_k*beta(k) matrix. Using sparse, these sums are performed before A*X, but they are performed second in accumarray.

I would say my question is best summarized by asking:

Which of the following is faster, and is there still any avoidable overhead?

B=accumarray(I2, M*beta.*X(J2), sz);

% Where:

% sz = [N, N]

% Z = unique(sub2ind(sz, I, J)) "aka, index of unique non-zero I/J pairs in V"

% M is size numel(Z) x N_k (sparse or dense)

% I2/J2 are: [I2, J2] = ind2sub(sz, Z)

or

B=sparse(I, J, V.*beta(K), N, N)*X

EDIT: If M is sparse, both should have the same number of operations, but accumarray should also have less overhead, because sparse's summation of like I/J terms requires sorting or indexing overhead, whereas M*beta is a simple sum along each row.

James Tursa
on 7 Aug 2022

Joel Lynch
on 7 Aug 2022

Edited: Joel Lynch
on 7 Aug 2022

Technically no, the goal is to generate B=Ax, so the actual A_0 sparse matrix does not need to be generated & can be skipped. But generally yes, A_0 is time-dependent with constant A_k matrices whose sparsity pattern does not change. A_0 sparsity could change with zero beta's only.

I'm stuck without an understanding or ability to mod sparse's behavior, but it seems reasonable to me that there is some overhead in sparse to organize indices that ideally is only done once and stored.

James Tursa
on 7 Aug 2022

Edited: James Tursa
on 7 Aug 2022

Joel Lynch
on 7 Aug 2022

X is a full column vector, Nx1. X changes with each iteration, so is time dependent.

Bruno Luong
on 7 Aug 2022

James Tursa
on 7 Aug 2022

Edited: James Tursa
on 7 Aug 2022

Joel Lynch
on 7 Aug 2022

A detail I didn't emphasize in my question is that the occupancy of is similar for all k (otherwise sparsity of would be destroyed for large k), so there's a significant number of points that must be summed inside the time loop. I tried your solution in native MATLAB, but it was slower (I think) because summation is occurring after multiplication, which is more expensive than the other way around.

So as I understand it, the nut of this problem is how to leverage the fixed occupancy (because sparse can't) to efficiently sum co-located indices before multiplication with x. The solution Bruno & I came up with was to pre-calculate a sparse matrix Y, where each row contains co-located I/J points with the column indexed to beta, which ostensibly moves the indexing overhead when adding N_k sparse matrices outside the time-loop, replacing it with a more regular Y*beta.

James Tursa
on 7 Aug 2022

Edited: James Tursa
on 7 Aug 2022

This is the math as I understand it, although this is not how I would physically implement it. Were I to write code for this, e.g. I would not physically replicate the into one long vector and multiply all the elements by β as I show, but do the equivalent math virtually and only do the necessary non-zero multiplies. If speed is the issue and I knew there were no inf or NaN present, could skip those checks.

= the individual NxN matrices (static),

Abig = [] % concatenate the horizontally into NxNM matrix (static)

x = Nx1 column vector (time dependent)

beta = 1xM row vector (time dependent)

Xbig = x .* beta % a NxM matrix with each column using virtual array expansion

B = Abig * Xbig(:) % the result of one big matrix*vector multiply

The above is just the notional math. The actual coding would combine the last two steps virtually, and could even be multi-threaded. The coding would also depend on how the are actually stored (as separate matrices, as one big matrix, as sparse, or as I,J,V triplets).

Does this match the calculations you want, or am I missing something?

Joel Lynch
on 7 Aug 2022

Edited: Joel Lynch
on 7 Aug 2022

Yes, I think we've arrived at effectively the same solution. Your method hides the X multiplication in with beta, so ABig*XBig is on all elements A_k, but at the same cost as Y*beta in Bruno's solution. Bruno's solution does cut down on the overall size of Y by ignoring elements sparse for all A_k (columns in your ABig matrix), but you could perform the same optimization with your scheme.

My best guess is that any differences between these solutions would be splitting hairs compared to MEX implementation, so it's probably the end of the road for native MATLAB.

Bruno Luong
on 7 Aug 2022

@James Tursa yes you get it right. However

- In the range of density/size and N_k (m in your notation) your method is still 50% slower than the accumarray solution we come up with,
- The runtime scale with N_k and is not bounded by the assumption of limited "bandwidth" (density of A0) stated by @Joel Lynch. It is probably an important requirement for his application.

### Accepted Answer

Bruno Luong
on 7 Aug 2022

Edited: Bruno Luong
on 7 Aug 2022

EDIT: Here is the final code. Method used in B2 is the fatest;

M = 10000;

N = 10000;

N_k = 100;

A_k = cell(1,N_k);

for k=1:N_k

A_k{k} = sprand(M,N,0.0005);

end

N_k = length(A_k);

[M,N] = size(A_k{1});

beta = rand(N_k,1);

X = rand(N,1);

I = cell(1,N_k);

J = cell(1,N_k);

V = cell(1,N_k);

K = cell(1,N_k);

for k=1:N_k

[I{k},J{k},V{k}] = find(A_k{k});

K{k} = k + zeros(size(I{k}));

end

I = cat(1,I{:});

J = cat(1,J{:});

K = cat(1,K{:});

V = cat(1,V{:});

[IJ,~,ic] = unique([I,J],'rows');

I2 = IJ(:,1);

J2 = IJ(:,2);

Y = sparse(ic, K, V, size(IJ,1), N_k);

tic

B1=sparse(I, J, V.*beta(K), M, N)*X ;

toc

Elapsed time is 0.252217 seconds.

tic

B3 = sparse(I2,J2,Y*beta,M,N)*X;

toc

Elapsed time is 0.195761 seconds.

tic

B2 = accumarray(I2,(Y*beta).*X(J2),[M,1]);

toc

Elapsed time is 0.096573 seconds.

##### 5 Comments

Joel Lynch
on 7 Aug 2022

Edited: Joel Lynch
on 7 Aug 2022

Thanks for this!

The problem I see is that your Y matrix has a row for every corresponding I/J coordinate in the MxN matrix, which is excessive when A_0 is still sparse. Also, using sprand with large N_k will probably fill up A_0 beyond what is reasonable (I got density of 63% with your script).

Nevertheless, this is what I came up with, which was about 2x faster than your first test (for B1). The basic idea here is to ensure Y's rows are limited to I/J points with at least 1 non-zero A_k value (i.e., all unique Z indicies). Then use the ia/ic pointers from unique to yo-yo the Y matrix into a smaller mtimes operation with X, and back out into the correct size for B.

Z = sub2ind([M,N],I,J);

[Zred, ia, ic] = unique(Z);

I2 = I(ia);

J2 = J(ia);

Y = sparse(ic, K, V, numel(Zred), N_k);

tic

B2 = accumarray(I2,(Y*beta).*X(J2),[N,1]);

toc

I don't have a quick way to test it yet, but I suspect the 2x improvement is still rather conservative. This is because aligning A_k sparsity in a more realistic way (for my problem anyway) where A_0 remains reasonably sparse would mean with the same number of non-zero elements in A_k, but more terms summed in Y*beta before multiplication by X.

Joel Lynch
on 7 Aug 2022

Sounds good! I did a quick test with A_k with sparsity that aligns along some random diagonals, and found improvement of ~23x.

diag = [1000, 990, 500, 490, 48, 3, 2, 1];

diag = [-diag, 0, flip(diag)];

for k=1:N_k

A_k{k} = spdiags(rand(M, numel(diag)),diag,M,N);

end

This gives two extremes between the two, and both are improvements. This seems like the way forward.

Bruno Luong
on 7 Aug 2022

Edited: Bruno Luong
on 7 Aug 2022

Joel Lynch
on 7 Aug 2022

Yeah, at this point I'm just documenting for future viewers. Feel free to post this up top, I cleaned some things up and added comments. Also added integer type casting to reduced memory.

M = 10000; % A_0/A_k rows

N = 10000; % A_0/A_k columns

N_k = 100; % Number of A_k matricies

% Random beta/X

beta = rand(N_k,1);

X = rand(N,1);

% Pick smallest integer type for indices

int_type = 'uint64';

if max(N,M) < intmax("uint16")

int_type = 'uint16';

elseif max(N,M) < intmax("uint32")

int_type = 'uint32';

end

% Generate A_k matricies in single stream

I = []; J=[]; K=[]; V=[];

bandwith = round(M*0.25); % Controls final density of A_0

A_k_ref_density = 1e-4; % density of each A_k

for k=1:N_k

A_k = sprand(M, N, A_k_ref_density);

[Inew, Jnew] = find(A_k);

ind = abs(Inew-Jnew)<bandwith; % Keep i/j's within band

I = [I; cast(Inew(ind), int_type)];

J = [J; cast(Jnew(ind), int_type)];

V = [V; A_k(ind)];

K = [K; repmat(cast(k, int_type), sum(ind), 1)];

% spy(sparse(I,J,K, M,N)) % Spy A_k

end

%spy(sparse(I,J,K, M,N)) % Spy Final A_0 for non-zero Beta's

% Smaller vector of any non-zero i/j's in all A_k's

% numel(ic) = size([I, J], 1)

% ic points to the location in the reduced/unique array of [I, J] pairs

% (i.e. ic holds the row index where each nonzero I/J in A_0 resides)

[IJ, ~, ic] = unique([I, J], 'rows', 'stable');

Iu = IJ(:,1); % unique I in [I, J]

Ju = IJ(:,2); % unique J in [I, J]

Nrow_u = size(IJ, 1);

% Time-Independent Y matrix, such that Y*beta is a column vector of

% non-zero elements in A_0, with I/J defined by Iu, Ju

Y = sparse(ic, double(K), V, Nrow_u, N_k);

% Native approach, construct A_0 within time loop, then A*x

t1 = timeit(@() sparse(I, J, V.*beta(K), M, N)*X)

% Construct sparse, but with the reduced Y

t2 = timeit(@() sparse(Iu, Ju, Y*beta, M, N)*X)

% Optimal Approach, accumarray which avoids sparse()

t3 = timeit(@() accumarray(Iu, (Y*beta).*X(Ju), [M,1]))

### More Answers (0)

### See Also

### Tags

### Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!**An Error Occurred**

Unable to complete the action because of changes made to the page. Reload the page to see its updated state.

Select a Web Site

Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .

You can also select a web site from the following list:

## How to Get Best Site Performance

Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.

### Americas

- América Latina (Español)
- Canada (English)
- United States (English)

### Europe

- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)

- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)

### Asia Pacific

- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)