Intersect() with with repetition

The syntax [C,ia,ib] = intersect(A,B,'rows') returns elements without repetitions. However, I need every potential combination of ia and ib that gives C. For example:
A=[1,1;1,1;1,2];
B=[0,1;1,1;1,1];
I need the output to be:
C=[1,1;1,1;1,1;1,1];
ia=[1;1;2;2];
ib=[2;3;2;3];
The answer here generates indcies into A and B that are intersecting elements: https://www.mathworks.com/matlabcentral/answers/338649-arrays-intersection-with-repetition However there are no correspondace between ia and ib generated by this method. e.g., A(ia,:)~=B(ib,:). it also doesn't give every potential combination of indices.
Any ideas?
Thanks

2 Comments

C=[1,1;1,1;1,1;1,1];
ia=[1;1;2;2];
ib=[1;2;1;2];
This output does not satisfy the definition of interesct: C = A(ia,:), C = B(ib,:) . Therefore it is unclear, what the realtion between your A,B and C and ia and ib is.
This does not match "every potential combination of ia and ib that gives C" also.
Fixed the typo in ib. Good catch!

Sign in to comment.

 Accepted Answer

If you really need the redundant information in iAB_C, iAB_Cprime, idx1, idx2, this is faster than the original version tested with the data:
A = randi([0, 200], 1e6, 2);
B = randi([0, 200], 1e6, 2);
I get 1.12 s instead of 1.86 s.
function [C, iAB_C, iAB_Cprime, idx1, idx2] = repintersect_1b(A2, B2)
% Join data for faster sorting:
A = A2 * [4294967296; 1];
B = B2 * [4294967296; 1];
[uA, ~, iAx] = unique(A, "stable");
[a, idx] = sort(iAx);
aV = 1:numel(A);
aV = aV(idx).';
aI = RunLen(a);
[uB,~,iBx] = unique(B, "stable");
[b, idx] = sort(iBx);
bV = 1:numel(B);
bV = bV(idx).';
bL = cumsum([1, RunLen(b)]);
[C2, iuA, iuB] = intersect(uA, uB, "stable");
% Split data for the output:
C = [floor(C2 / 4294967296), rem(C2, 4294967296)];
iAB_C = cell(numel(iuA), 1);
a0 = 1;
for ii = 1:numel(iuA)
a1 = a0 + aI(ii); % Easier indexing for A
aa = aV(a0:a1 - 1);
a0 = a1;
na = numel(aa);
b0 = bL(iuB(ii)); % Need accumulated RunLength for B
b1 = bL(iuB(ii) + 1) - 1;
bb = bV(b0:b1);
nb = numel(bb);
qa = repmat(aa.', nb, 1); % Replace MESHGRID
qb = repmat(bb, 1, na);
iAB_C{ii} = [qa(:), qb(:)];
end
iAB_Cprime = cell2mat(iAB_C);
idx2 = cumsum(cellfun('size', iAB_C, 1)); % Faster than cellfun(@(x) size(x,1), C)
idx1 = [1; idx2(1:end-1) + 1];
end
% Helper function:
function n = RunLen(x)
d = [true; diff(x(:)) ~= 0]; % TRUE if values change
n = diff(find([d.', true])); % Indices of changes
end
I've omitted the temporarily used cell arrays.
Further time could be saved, if you do not create iAB_Cprime and the set of iAB_C and the indices, because both contain the same information.

3 Comments

Yi-xiao Liu
Yi-xiao Liu on 11 Sep 2022
Edited: Yi-xiao Liu on 11 Sep 2022
Thank you @Jan
It is indeed a lot faster after getting rid of mat2cell().
Do you think it is possible/worthwhile to further vecterize the loop, assuming that the na and nb only fluctuate within a very small margin at each iteration? (e.g., for 95% of the time na<=2, and 5% 2<na<=5, same goes for nb). I can see that a0, a1, b0, b1 can be easily moved outside of the loop, but qa and qb will be more difficult
PS: why did you replace the meshgrid method? It doesn't seem to provide any performance advantage (unlike the RunLen(x) vs accumarray(x,1) )
Vectorizing is not faster, if large arrays must be stored temporarily. I assume that vectorizing is not an advantage here.
"e.g., for 95% of the time na<=2, and 5% 2<na<=5, same goes for nb" - I've asked repeatedly for typical input arguments, because the runtimes for different approachs depend on the number and size of the overlaps.
"why did you replace the meshgrid method?" - see above: I've moved the code, which is performed inside meshgrid directly to the code to avoid the overhead of the input checks. Now it matters, how often meshgrid is called. In the input data I have invented, I see at least a small advantage. But maybe this is different for the original input data.
Another option is, that I have Matlab R2018b only. Maybe the current version uses a faster implementation of meshgrid.
As said before, the posted function wastes time and memory with producing 2 sets of redundant outputs, but this is, what you are asking for. Having realistic input data would allow for adjusting the code to increase the speed. A general purpose algorithm cannot exploit the nature of specific inputs.
Yi-xiao Liu
Yi-xiao Liu on 14 Sep 2022
Edited: Yi-xiao Liu on 14 Sep 2022
That (the distribution of na/nb) was the observation when I was trying your function on the data.
However, now come to think about it, the majority of na/nb being small means that the size of repetition is consistent for most intersections. So the function can (I can) be optimized to use vectorized code to cover those na==nb==1, and then loop over only the case when na>1|nb>1. (or depending on the data, also vectorize (na==1&nb==2)|(na==2&nb==1), etc.)
I know the output is redundant, and I only need one (set) of them. I was just saying eihter is acceptable.
In short, thanks for your help. I now know how to tailor it for my data.

Sign in to comment.

More Answers (3)

Jan
Jan on 18 Aug 2022
Edited: Jan on 18 Aug 2022
A simple loop approach:
A = [1,1; ...
1,1; ...
1,2];
B = [0,1; ...
1,1; ...
1,1];
[C, iA, iB] = RepIntersectRows(A, B)
C = 4×2
1 1 1 1 1 1 1 1
iA = 4×1
1 1 2 2
iB = 4×1
2 3 2 3
function [C, iA, iB] = RepIntersectRows(A, B)
% Sizes of inputs:
nA = size(A, 1);
nB = size(B, 1);
w = size(A, 2);
% Pre-allocate output:
C = zeros(nA * nB, w);
iA = zeros(nA * nB, 1);
iB = zeros(nA * nB, 1);
% Find matchs:
iC = 0;
for kA = 1:nA
a = A(kA, :);
for kB = 1:nB
if isequal(a, B(kB, :))
iC = iC + 1;
C(iC, :) = a;
iA(iC) = kA;
iB(iC) = kB;
end
end
end
% Crop unused elements:
C = C(1:iC, :);
iA = iA(1:iC);
iB = iB(1:iC);
end
If A and B have 1e4 rows, the runtime is 4 seconds already. But it is thought to clarify,what you exactly need. The output of your approach does not match the original question exactly. Before I optimize my code, I want to know, if it matchs your needs at all.

4 Comments

I think your output and mine can be easily translated into each other: see the example I gave above.
So I think either format of output is fine. However you make a good point on speed, in my real data I expect C to have ~1e10 rows, so any way to vectorize/optimize this will be critical
The loops can beat unqiue/intersect for small inputs with about 100 elements. For large arrays a sorting and binary search is obligatory.
This is 5 times faster than the version in my answer, but no satifying approach for large inputs:
A = randi([0,5], 1e4, 2);
B = randi([0,5], 1e4, 2);
tic
[C, iA, iB] = RepIntersectRows(A, B);
toc
Elapsed time is 0.825147 seconds.
function [C, iA, iB] = RepIntersectRows(A, B)
% Sizes of inputs:
nA = size(A, 1);
nB = size(B, 1);
w = size(A, 2);
% Pre-allocate output:
iAC = cell(nA, 1);
iBC = cell(nA, 1);
% Find matchs:
for kA = 1:nA
a = A(kA, :);
M = find(all(a == B, 2));
iAC{kA} = repmat(kA, numel(M), 1);
iBC{kA} = M;
end
% Crop unused elements:
iA = cat(1, iAC{:});
iB = cat(1, iBC{:});
C = A(iA, :);
end
Yi-xiao Liu
Yi-xiao Liu on 18 Aug 2022
Edited: Yi-xiao Liu on 18 Aug 2022
@Jan How do you propose to use binary search? It doesn't sound sth that should be done in MATLAB
@Yi-xiao Liu: Matlab's unique and intersect are based on binary searchs, while the loops (e.g. in my prove of concept code) perform a linear search.
Binary search means, that the data are sorted at first, such that you do not have to compare all elements, but log2 of the elements only by dividing the inteval of interest by 2 in each step.
I'm try to find a shorter and faster solution, when I'm at home.

Sign in to comment.

I frankly have not quite figured out the logic of what you want as output, but I'm pretty sure that you can construct what you want by using the ismember command instead of intersect. You'll probably need both "directions", and possibly all outputs:
A=[1,1;1,1;1,2];
B=[0,1;1,1;1,1];
[tf_ab, loc_ab] = ismember(A,B,"rows")
tf_ab = 3×1 logical array
1 1 0
loc_ab = 3×1
2 2 0
[tf_ba, loc_ba] = ismember(B,A,"rows")
tf_ba = 3×1 logical array
0 1 1
loc_ba = 3×1
0 1 1

2 Comments

The problem I have with ismember, is that loc only gives the first index. For example, if one element of A occurs twice in B, I need both indcies. I suppose I can loop through every element in A, but that would be rather inefficient.
I have a slightly improved version that loops every unique and intersecting element to do what I need below. Hopefully that will make my goal a bit more clear.
Ah, I get it now. Also, in my mind I was thinking that ismember had that third output like unique does, that gives the mapping back to all elements of the original vector.

Sign in to comment.

Here is my current solution.
A=[1,1;1,1;1,2];
B=[0,1;1,1;1,1];
[uA,~,iA]=unique(A,"rows","stable");
iA=sortrows([iA,(1:size(A,1))'],1);
iA=mat2cell(iA(:,2),accumarray(iA(:,1),1));
[uB,~,iB]=unique(B,"rows","stable");
iB=sortrows([iB,(1:size(B,1))'],1);
iB=mat2cell(iB(:,2),accumarray(iB(:,1),1));
[C,iuA,iuB]=intersect(uA,uB,"rows","stable");
iA_C=cell(size(iuA));iB_C=cell(size(iuA));iAB_C=cell(size(iuA));
for ii=1:numel(iAB_C)
iA_C{ii}=iA{iuA(ii)};
iB_C{ii}=iB{iuB(ii)};
[iAB_C_iiA,iAB_C_iiB]=meshgrid(iA_C{ii},iB_C{ii});
iAB_C{ii}=[iAB_C_iiA(:),iAB_C_iiB(:)];
end
disp(iAB_C)
{4×2 double}
disp(iAB_C{1})
1 2 1 3 2 2 2 3
s

11 Comments

This is does not reply exactly the output you have mentioned in the question.
No it doesn't. However it's easily to modify it make it so:
A=[1,1;1,1;1,2];
B=[0,1;1,1;1,1];
[uA,~,iA]=unique(A,"rows","stable");
iA=sortrows([iA,(1:size(A,1))'],1);
iA=mat2cell(iA(:,2),accumarray(iA(:,1),1));
[uB,~,iB]=unique(B,"rows","stable");
iB=sortrows([iB,(1:size(B,1))'],1);
iB=mat2cell(iB(:,2),accumarray(iB(:,1),1));
[C,iuA,iuB]=intersect(uA,uB,"rows","stable");
iA_C=cell(size(iuA));iB_C=cell(size(iuA));iAB_C=cell(size(iuA));C_rep=cell(size(iuA));
for ii=1:numel(iAB_C)
iA_C{ii}=iA{iuA(ii)};
iB_C{ii}=iB{iuB(ii)};
[iAB_C_iiA,iAB_C_iiB]=meshgrid(iA_C{ii},iB_C{ii});
iAB_C{ii}=[iAB_C_iiA(:),iAB_C_iiB(:)];
C_rep{ii}=repmat(C(ii,:),size(iAB_C{ii},1),1);
end
C_rep=cell2mat(C_rep)
C_rep = 4×2
1 1 1 1 1 1 1 1
iAB_C=cell2mat(iAB_C);
ia=iAB_C(:,1)
ia = 4×1
1 1 2 2
ib=iAB_C(:,2)
ib = 4×1
2 3 2 3
For the timings it matters, which output you exactly need.
What are usual inputs? If the values are e.g. positive integers of a certain range, it might be much cheaper to work with UINT8 arrays or with A(:, 1)*10000+A(:,2) (and the same for B).
I have thought about using uint data types, however they are intergers in the range of 1 to a dozen millions. I guess that needs at least 24bit, which doesn't come that off from the standard double?
BTW in my profiling I found mat2cell() to be (one of) the bottleneck. How curious. I thought this just add a layer of pointers to the original data stored in RAM.
As for the output, ideally the output can be a m-by-2n array. where every 2n and 2n-1 column contains the data of iAB_C{n}, this should make vectorzing subsequents steps easier. However I cannot guarantee iAB_C{n} has the same length so I guess this is not possible?
No, mat2cell creates new arrays with headers of about 100 bytes per element.
Again, which output are required exactly? Are iA_C and iB_C needed outside the loop?
I do not understand the meaning of: "the output can be a m-by-2n array. where every 2n and 2n-1 column contains the data of iAB_C{n}".
"At least 24 bits" does not clarify, if the input is guarantee to match into 32 bits. Then you can either operate with UINT32, which might increase the speed by a factor of 2, if the memory bandwidth is the bottleneck, or you can combine the columns of A and B to a vector and omit the 'rows' flag for unique and ismember.
I've done a bunch of experiments but as long as I am not sure, what you want as output, it would be a waste of time to post the codes. In the original question you ask for C, ia and ib, but in the code you produce iA, iB, iA_C, iB_C, iAB_C and C_rep. I cannot optimize code efficiently, if I have to guess, what is needed as output.
What I need as output is C and iAB_C of my first code. However, considering the siginificant overhead of cell datatypes, I am not so sure that is a good idea. I don't really need it to be a cell array, for example, take the outout of my first example:
if there are 2 vectors, idx1 and idx2, that satisfy
iAB_C_prime=cell2mat(iAB_C);
iAB_C_prime(idx1(i):idx2(i),:)==iAB_C{i}
and producing iAB_C_prime, idx1, idx2 is faster than the cell version iAB_C, it will do.
Alternatively, if
iAB_C_prime2=[iAB_C{1},iAB_C{2},...,iAB_C{end}];
can be produced faster, it will also do. It may be necessary to pad the end of each (2) columns with NaN though.
the 24bit comment was that all elements in A and B should be integers smaller than 2^24-1. I am not sure how you propose to omit the 'rows' flag and get the same result.
@Yi-xiao Liu: "Alternatively" is not the right way. I cannot decide, what you need and I do not want to optimze code based on my assumptions about what you need.
If you post a function with inputs and outputs, which reply exactly, what you need, I can try to improve the speed without guessing. Something like this (see my answer):
function [C, iA, iB] = RepIntersectRows(A, B)
...
Is this a valid input:
A = randi([0, 2^24-1], 1e6, 2);
B = randi([0, 2^24-1], 1e6, 2);
The algorithms have different speeds depending on the number of overlapping pairs. So the faster method might be different, if such data are more relaisttic:
A = randi([0, 32], 1e6, 2);
B = randi([0, 32], 1e6, 2);
If both columns of A and B contain 24 bit integers, you can combine them to one column with 64 bit: AA = uint64(A(:, 1) * 2^24 + A(:, 2)). If the original variables are stored as doubles, this reduced the memory consumption by a factor 2. The sorting is faster also:
A = randi([0, 2^24-1], 1e6, 2);
tic; Au = unique(A, 'rows'); toc
Elapsed time is 0.115813 seconds.
tic; AA = uint64(A(:, 1) * 2^24 + A(:, 2)); AAu = unique(AA); toc
Elapsed time is 0.048099 seconds.
Do you see it?
Here it is the wrapped function. (Besides C) either iAB_C, as a cell array,
or iAB_Cprime, idx1, and idx2, as numerical arrays will be acceptable as outputs. I just thought producing the latter can be faster given how clumsy you made cell datatype sounds.
In fact, you can se that iAB_C{ii}==iAB_Cprime(idx1(ii):idx2(ii),:)
A=round(5*rand(20,2));
B=round(5*rand(20,2));
[C,iAB_C,iAB_Cprime,idx1,idx2]=repintersect(A,B);
for ii=1:size(C,1)
all(iAB_C{ii}==iAB_Cprime(idx1(ii):idx2(ii),:),"all")
end
ans = logical
1
ans = logical
1
ans = logical
1
ans = logical
1
function [C,iAB_C,iAB_Cprime,idx1,idx2]=repintersect(A,B)
[uA,~,iA]=unique(A,"rows","stable");
iA=sortrows([iA,(1:size(A,1))'],1);
iA=mat2cell(iA(:,2),accumarray(iA(:,1),1));
[uB,~,iB]=unique(B,"rows","stable");
iB=sortrows([iB,(1:size(B,1))'],1);
iB=mat2cell(iB(:,2),accumarray(iB(:,1),1));
[C,iuA,iuB]=intersect(uA,uB,"rows","stable");
iA_C=cell(size(iuA));iB_C=cell(size(iuA));iAB_C=cell(size(iuA));
for ii=1:numel(iAB_C)
iA_C{ii}=iA{iuA(ii)};
iB_C{ii}=iB{iuB(ii)};
[iAB_C_iiA,iAB_C_iiB]=meshgrid(iA_C{ii},iB_C{ii});
iAB_C{ii}=[iAB_C_iiA(:),iAB_C_iiB(:)];
end
iAB_Cprime=cell2mat(iAB_C);
idx1=cellfun(@(x) size(x,1),iAB_C);
idx2=cumsum(idx1);
idx1=[1;idx2(1:end-1)+1];
end
I am not sure what
uint64(A(:, 1) * 2^24 + A(:, 2))
exactly does. Trying to juice my rusty CS knowledge, does it shift A(:,1) leftward, by 24 bits, and leave the last 24 bits to A(:,2) to make up a new number to compare? Clever.
Although why is this faster than 'rows' flag? I thought MATLAB would do this internally in its built-in functions.
Another question is why not shift 32 bits? you are using uint64 anyway.
Jan
Jan on 9 Sep 2022
Edited: Jan on 9 Sep 2022
@Yi-xiao Liu: The shown code multiplies the first column by 2^24 and adds the 2nd column. Of course you can do this with 2^32 also.
unique(A, 'rows') has to process 2 columns, while unique(AA) operates on 1 column only. This reduces the memory access and comparisons by 50% and in consequence uses about half of the processing time. The overhead by adding the columns and splitting them afterwards is negligible in this case, because it is linear in time (double data size => double computing time), while the sorting and the binary search are more expensive with growing data sizes.
Again: Should I guess, that
A = randi([0, 2^24-1], 1e6, 2);
B = randi([0, 2^24-1], 1e6, 2);
are typical inputs or is there a higher rate of overlaps as in:
A = randi([0, 32], 1e6, 2);
B = randi([0, 32], 1e6, 2);
You have mentioned the output size, but what are typical input sizes?
It is a certain amount of work to obtain exact explanations from you.
Sorry. I was posting late last night just before sleep.
For typical inputs, (elements of) A and B are integers in range of 1 to 2^24. They are expected to have ~10^10 rows each, in the same scale of output C. If it helps, A usually is a subset of B (not always).
And about our enivorment, the typical node we can request has 48 CPU cores and 177GB memory. We can request large memory (~1TB) nodes if necessary.

Sign in to comment.

Products

Release

R2019b

Asked:

on 17 Aug 2022

Edited:

on 14 Sep 2022

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!