Intersect() with with repetition
Show older comments
The syntax [C,ia,ib] = intersect(A,B,'rows') returns elements without repetitions. However, I need every potential combination of ia and ib that gives C. For example:
A=[1,1;1,1;1,2];
B=[0,1;1,1;1,1];
I need the output to be:
C=[1,1;1,1;1,1;1,1];
ia=[1;1;2;2];
ib=[2;3;2;3];
The answer here generates indcies into A and B that are intersecting elements: https://www.mathworks.com/matlabcentral/answers/338649-arrays-intersection-with-repetition However there are no correspondace between ia and ib generated by this method. e.g., A(ia,:)~=B(ib,:). it also doesn't give every potential combination of indices.
Any ideas?
Thanks
2 Comments
Jan
on 18 Aug 2022
C=[1,1;1,1;1,1;1,1];
ia=[1;1;2;2];
ib=[1;2;1;2];
This output does not satisfy the definition of interesct: C = A(ia,:), C = B(ib,:) . Therefore it is unclear, what the realtion between your A,B and C and ia and ib is.
This does not match "every potential combination of ia and ib that gives C" also.
Yi-xiao Liu
on 18 Aug 2022
Accepted Answer
More Answers (3)
A simple loop approach:
A = [1,1; ...
1,1; ...
1,2];
B = [0,1; ...
1,1; ...
1,1];
[C, iA, iB] = RepIntersectRows(A, B)
function [C, iA, iB] = RepIntersectRows(A, B)
% Sizes of inputs:
nA = size(A, 1);
nB = size(B, 1);
w = size(A, 2);
% Pre-allocate output:
C = zeros(nA * nB, w);
iA = zeros(nA * nB, 1);
iB = zeros(nA * nB, 1);
% Find matchs:
iC = 0;
for kA = 1:nA
a = A(kA, :);
for kB = 1:nB
if isequal(a, B(kB, :))
iC = iC + 1;
C(iC, :) = a;
iA(iC) = kA;
iB(iC) = kB;
end
end
end
% Crop unused elements:
C = C(1:iC, :);
iA = iA(1:iC);
iB = iB(1:iC);
end
If A and B have 1e4 rows, the runtime is 4 seconds already. But it is thought to clarify,what you exactly need. The output of your approach does not match the original question exactly. Before I optimize my code, I want to know, if it matchs your needs at all.
4 Comments
Yi-xiao Liu
on 18 Aug 2022
The loops can beat unqiue/intersect for small inputs with about 100 elements. For large arrays a sorting and binary search is obligatory.
This is 5 times faster than the version in my answer, but no satifying approach for large inputs:
A = randi([0,5], 1e4, 2);
B = randi([0,5], 1e4, 2);
tic
[C, iA, iB] = RepIntersectRows(A, B);
toc
function [C, iA, iB] = RepIntersectRows(A, B)
% Sizes of inputs:
nA = size(A, 1);
nB = size(B, 1);
w = size(A, 2);
% Pre-allocate output:
iAC = cell(nA, 1);
iBC = cell(nA, 1);
% Find matchs:
for kA = 1:nA
a = A(kA, :);
M = find(all(a == B, 2));
iAC{kA} = repmat(kA, numel(M), 1);
iBC{kA} = M;
end
% Crop unused elements:
iA = cat(1, iAC{:});
iB = cat(1, iBC{:});
C = A(iA, :);
end
Yi-xiao Liu
on 18 Aug 2022
Edited: Yi-xiao Liu
on 18 Aug 2022
Jan
on 18 Aug 2022
@Yi-xiao Liu: Matlab's unique and intersect are based on binary searchs, while the loops (e.g. in my prove of concept code) perform a linear search.
Binary search means, that the data are sorted at first, such that you do not have to compare all elements, but log2 of the elements only by dividing the inteval of interest by 2 in each step.
I'm try to find a shorter and faster solution, when I'm at home.
I frankly have not quite figured out the logic of what you want as output, but I'm pretty sure that you can construct what you want by using the ismember command instead of intersect. You'll probably need both "directions", and possibly all outputs:
A=[1,1;1,1;1,2];
B=[0,1;1,1;1,1];
[tf_ab, loc_ab] = ismember(A,B,"rows")
[tf_ba, loc_ba] = ismember(B,A,"rows")
2 Comments
Yi-xiao Liu
on 18 Aug 2022
the cyclist
on 18 Aug 2022
Ah, I get it now. Also, in my mind I was thinking that ismember had that third output like unique does, that gives the mapping back to all elements of the original vector.
11 Comments
Jan
on 18 Aug 2022
This is does not reply exactly the output you have mentioned in the question.
Jan
on 31 Aug 2022
For the timings it matters, which output you exactly need.
What are usual inputs? If the values are e.g. positive integers of a certain range, it might be much cheaper to work with UINT8 arrays or with A(:, 1)*10000+A(:,2) (and the same for B).
Yi-xiao Liu
on 31 Aug 2022
Jan
on 4 Sep 2022
No, mat2cell creates new arrays with headers of about 100 bytes per element.
Again, which output are required exactly? Are iA_C and iB_C needed outside the loop?
I do not understand the meaning of: "the output can be a m-by-2n array. where every 2n and 2n-1 column contains the data of iAB_C{n}".
"At least 24 bits" does not clarify, if the input is guarantee to match into 32 bits. Then you can either operate with UINT32, which might increase the speed by a factor of 2, if the memory bandwidth is the bottleneck, or you can combine the columns of A and B to a vector and omit the 'rows' flag for unique and ismember.
I've done a bunch of experiments but as long as I am not sure, what you want as output, it would be a waste of time to post the codes. In the original question you ask for C, ia and ib, but in the code you produce iA, iB, iA_C, iB_C, iAB_C and C_rep. I cannot optimize code efficiently, if I have to guess, what is needed as output.
Yi-xiao Liu
on 7 Sep 2022
@Yi-xiao Liu: "Alternatively" is not the right way. I cannot decide, what you need and I do not want to optimze code based on my assumptions about what you need.
If you post a function with inputs and outputs, which reply exactly, what you need, I can try to improve the speed without guessing. Something like this (see my answer):
function [C, iA, iB] = RepIntersectRows(A, B)
...
Is this a valid input:
A = randi([0, 2^24-1], 1e6, 2);
B = randi([0, 2^24-1], 1e6, 2);
The algorithms have different speeds depending on the number of overlapping pairs. So the faster method might be different, if such data are more relaisttic:
A = randi([0, 32], 1e6, 2);
B = randi([0, 32], 1e6, 2);
If both columns of A and B contain 24 bit integers, you can combine them to one column with 64 bit: AA = uint64(A(:, 1) * 2^24 + A(:, 2)). If the original variables are stored as doubles, this reduced the memory consumption by a factor 2. The sorting is faster also:
A = randi([0, 2^24-1], 1e6, 2);
tic; Au = unique(A, 'rows'); toc
tic; AA = uint64(A(:, 1) * 2^24 + A(:, 2)); AAu = unique(AA); toc
Do you see it?
Yi-xiao Liu
on 9 Sep 2022
@Yi-xiao Liu: The shown code multiplies the first column by 2^24 and adds the 2nd column. Of course you can do this with 2^32 also.
unique(A, 'rows') has to process 2 columns, while unique(AA) operates on 1 column only. This reduces the memory access and comparisons by 50% and in consequence uses about half of the processing time. The overhead by adding the columns and splitting them afterwards is negligible in this case, because it is linear in time (double data size => double computing time), while the sorting and the binary search are more expensive with growing data sizes.
Again: Should I guess, that
A = randi([0, 2^24-1], 1e6, 2);
B = randi([0, 2^24-1], 1e6, 2);
are typical inputs or is there a higher rate of overlaps as in:
A = randi([0, 32], 1e6, 2);
B = randi([0, 32], 1e6, 2);
You have mentioned the output size, but what are typical input sizes?
It is a certain amount of work to obtain exact explanations from you.
Yi-xiao Liu
on 9 Sep 2022
Categories
Find more on Matrix Indexing in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!