Matlab comparison of two large matricies

1 view (last 30 days)
I am trying to retrieve the index of exact matches (row-specific) between two large matricies. I have a n x 61 matrix A containing values from 0 to 9 and another n x 61 matrix B , whereas each row here contains values from 0 to 9 but mostly NaN (only 2 to 8 columns in each row of matrix B contain actual numbers). Matrix A can be expected to have between 1.5 million and 3 million rows, whereas matrix B has around 0.2 to 0.5 million rows. Here is an example of the setup:
% create matrix a with random data
dataSample = [0 9];
numRows = 1000000;
numCols = 61;
A = randi(dataSample,numRows,numCols);
% create matrix B with random data
numRows = 100000;
numCols = 61;
numColsUse = 2:8;
dataRange = 0:9;
B = NaN(numRows,numCols);
for i = 1:size(B,1)
% randomly selet number of columns to fill
numColsFill = datasample(numColsUse,1);
% randomly select column index from available columns
colIdx = datasample([1:numCols],numColsFill);
% randomly select values from 0 to 9
numFill = datasample([0:9],numColsFill);
% insert numbers at respective column in matrix B
B(i,colIdx) = numFill;
end
I want to compare every single row of matrix A with the entire matrix B and find exact matches, where the numbers of matrix B match the numbers of matrix A at their respective positions (columns) - hence the NaN in matrix B are to be ignored.
I can achieve the desired result using cellfun, where I slice matrix A in several subsets and then use a custom function to compare the rows of the subset with each row in matrix B, like so:
% put all rows of matrix B in single cell
cellB = {B};
% take subset of matrix A and convert to cell array
subA = A(1000:5000,:);
subA = num2cell(subA,2);
% prepare cellB to meet cellfun conditions
cellB = repmat(cellB, [size(subA,1) 1]);
% apply cellfun to retrieve index of each exact match
idxContainer = cellfun(@findMatch, cellB, subA, 'UniformOutput', false);
Function findMatch looks as follows:
function [ idx ] = LTableEval( cellB, subA )
idxCheckLT = lt(cellB, repmat(subA, [size(cellB,1) 1]));
idxCheckGT = gt(cellB, repmat(subA, [size(cellB,1) 1]));
idxCheck = idxCheckLT + idxCheckGT;
idxSum = sum(idxCheck,2);
idx = find(idxSum == 0);
end
This approach works, but it appears to be very inefficient, especially RAM-wise, as the cellfun requires all inputs to have the same size and hence a multiplication of the same data sets. Any ideas on how to tackle this problem in a more efficient way? Many thanks!

Accepted Answer

Guillaume
Guillaume on 28 May 2017
This is how I'd do it:
matches = cell(size(B, 1), 1);
for Brow = 1:size(B, 1)
Bcols = find(~isnan(B(Brow, :)));
matches{Brow} = find(all(A(:, Bcols) == B(Brow, Bcols), 2)); %requires R2016b or later
end
It's certainly a lot more efficient than any of the solutions you already have.
Note: in R2015b or earlier replace the relevant line by:
matches{Brow} = find(all(bsxfun(@eq, A(:, Bcols), B(Brow, Bcols)), 2));
  4 Comments
Guillaume
Guillaume on 29 May 2017
Try this:
matches = cell(size(B, 1), 1);
for Brow = 1:size(B, 1)
Bcols = find(~isnan(B(Brow, :)));
matchedrows = find(all(A(:, Bcols) == B(Brow, Bcols), 2));
matches{Brow} = [matchedrows, repmat(Brow, size(matchedrows))];
end
matches = cell2mat(matches);
finalScores = accumarray(matches(:, 1), matches(:, 2), [size(A, 1), 1], @(Brows) mean(Bscores(Brows)), nan);
However, if most rows of A all match at least some rows of B it may be more efficient to have just one loop over the rows of A. (I regard accumarray as another loop)
Benvaulter
Benvaulter on 29 May 2017
It works like a charm - absolutely brillinat - thank you so much!

Sign in to comment.

More Answers (1)

Matthew Eicholtz
Matthew Eicholtz on 26 May 2017
A couple comments:
1. Did you mean to convert to the cell array in this manner?
subA = num2cell(subA);
If you want to look at each row as its own cell, I think you need:
subA = num2cell(subA,2);
2. I am not sure how much more efficient this solution will be, but you can replace
idxContainer = cellfun(@findMatch, cellB, subA, 'UniformOutput', false);
with
idxContainer = cellfun(@(x,y) find(all(isnan(x)|x==y,2)), cellB, subA, 'UniformOutput', false);
Let me know if this helps at all.
  1 Comment
Benvaulter
Benvaulter on 28 May 2017
Hi Matthew,
thanks for your suggestions. First of all, you are correct, I has to be
subA = num2cell(subA,2);
as I want to conduct the operation on each row. Your alternative suggestion regarding the cellfunction returns the following error for me:
Error using ==
Matrix dimensions must agree.
Error in @(x,y)find(all(isnan(x)|x==y,2))
I have found other solutions suggesting matrix substraction, which makes sense to me. However, I am facing two challenges:
1. How to maximize performance when comparing each single row (i.e. what you suggested)
2. How to best setup the overall code (i.e. using cellfun vs. for-loops, vs. other ideas) to maximize speed overall (with these amounts of data, I also believe RAM might be a factor in what I am able to do.
Someone suggested the following solution:
for br = 1:size(B,1)
abs_diff = abs(repmat(B(br,:),[size(A,1) 1]) - A);
abs_diff(isnan(abs_diff)) = 0;
match = abs_diff == 0;
ind = find(sum(match,2)==size(match,2));
matches{br} = [repmat(br,[length(ind) 1]) ind];
end
matches = cell2mat(matches');
which is already a step into the right correction and which does not use "slicing the dataset" into smaller parts. However, each iteration of the loop here is still over a second long and I am wondering whether there are still better ways to do this. Any help is much appreciated - thanks!

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!