How do I compare a cell array containing string arrays against a string array without using loops?

I'm comparing two variables (data attached):
  • tableOfTextByTime.("tweetUniqueMentions"), which is a 500x1 cell array. The content of each cell is a string array that may contain 0 or more words. Each cell can contain a different number of words. See screenshot:
  • tableOfUsers{:,1}, which is a 334x1 string array
The code below works using a for loop, an anonymous function, and cellfun, but it's slow.
It's ok for a small test dataset, but when running on a real data set (20,000 x 1 cell array) and (5,000 x 1 string array) it takes way too long.
for i = height(tableOfUsers): -1: 1
% create a wrapped strcmp anon fcn that takes each cell element and
% each string element
wStrcmp = @(anonInp1) any(strcmp(anonInp1, tableOfUsers{i,1}));
% create matrix of indices for the entries that match the criteria (500x334)
idxMat(:,i) = cellfun( wStrcmp, tableOfTextByTime.("tweetUniqueMentions"),'UniformOutput',false);
% grab the relevant text that match the criteria
correspondingText{i,1} = tableOfTextByTime(cell2mat(idxMat(:,i)),:);
end
How can I get an equivalent result while drastically speeding up the code? Is there a way to do this in a vectorized or element-wise manner? bsxfun and arrayfun seem to have limitations when working with strings. Parallel computing toolbox not an option : )

3 Comments

Can you upload the data? You can use the paper clip icon in the INSERT section of the toolbar. We can't work with a picture of the data.
Please describe in words what the desired outcome is.
  • for each cell, you need to know whether at least one string in the cell appears anywhere in the string array?
  • for each string in each cell, you need to know of the string appears anywhere in the string array?
  • for each string in the string array, you need to know which cells it appears in?
@the cyclist - uploaded sample data that can be shared (sizes may be slightly different than described in the question).
@Walter Roberson - option 3 - the desired outcome, captured in the matrix of indices (idxMat), is to know:
  • for each string in the string array, what are the cells that contain that string?
  • with that knowledge, extract the table rows that match that criteria (tableOfTextByTime(cell2mat(idxMat(:,i)),:))

Sign in to comment.

 Accepted Answer

S = load('answersData.mat');
A = S.tableOfTextByTime.("tweetUniqueMentions")
A = 100×1 cell array
{["" ]} {["vids_v" ]} {["ItsRoshni08070" ]} {["" ]} {["shradhaarao" ]} {["" ]} {["GemsOfBollywood"]} {["" ]} {["" ]} {["simmyxchauhan" ]} {["" ]} {["" ]} {["" ]} {["cheerlights" ]} {["Sumanth_077" ]} {2×1 string }
B = S.tableOfUsers{:,1}
B = 88×1 string array
"ArylieSumaan" "BeingSalmanKhan" "ColorsTV" "MaximZiatdinov" "RRejeleene" "ANINewsUP" "Arsalan418296" "Aslam29Munawar" "BOLNETWORK" "BiggBoss" "DramebaazPorgi" "FIFAcom" "FawadAhsanFawad" "GemsOfBollywood" "HarisRauf14" "High_735" "ItsRoshni08070" "Keth_2000" "KhadkaDeepali" "KhudaJaane_" "MahuaMoitra" "MathWorks" "NVIDIAAI" "NaeemRehmanEngr" "OrmaxMedia" "PSushreesangita" "RashamiXmagic" "Rsumaiya" "SUBWAY" "SabirMehmood26"
tic
T = vertcat(A{:});
X = repelem(1:numel(A),cellfun(@numel,A)).';
[Y,Z] = ismember(T,B);
F = @(x) {S.tableOfTextByTime(x,:)};
C = accumarray(Z(Y),X(Y),[],F);
toc
Elapsed time is 0.043696 seconds.
C
C = 88×1 cell array
{2×8 table} {2×8 table} {2×8 table} {2×8 table} {2×8 table} {1×8 table} {1×8 table} {1×8 table} {1×8 table} {1×8 table} {1×8 table} {1×8 table} {1×8 table} {1×8 table} {1×8 table} {1×8 table}
Lets compare against a loop (as the cyclist mentioned, faster due to moving indexing before the loop):
tic
for kk = numel(B):-1: 1
% create a wrapped strcmp anon fcn that takes each cell element and
% each string element
wStrcmp = @(anonInp1) any(strcmp(anonInp1, B{kk}));
% create matrix of indices for the entries that match the criteria (500x334)
idxMat(:,kk) = cellfun( wStrcmp, A,'UniformOutput',false);
%idx = cell2mat(idxMat(:,kk))
% grab the relevant text that match the criteria
correspondingText{kk,1} = S.tableOfTextByTime(cell2mat(idxMat(:,kk)),:);
end
toc
Elapsed time is 0.066449 seconds.
correspondingText
correspondingText = 88×1 cell array
{2×8 table} {2×8 table} {2×8 table} {2×8 table} {2×8 table} {1×8 table} {1×8 table} {1×8 table} {1×8 table} {1×8 table} {1×8 table} {1×8 table} {1×8 table} {1×8 table} {1×8 table} {1×8 table}
isequal(C,correspondingText)
ans = logical
1

1 Comment

Thank you @the cyclist and @Stephen23!
Using vertcat, repelem, ismember, anon fcn, and acumarray results in a massive speedup.
For a real dataset of size 14,025 by 4,444 this now completes in 0.501984 seconds. That's unimaginable compared to what I was seeing before. I'll need to read up on acumarray and what's it's doing, but it all works as intended. Happy holidays.

Sign in to comment.

More Answers (1)

I think more can be done, but here a couple improvements that make the small test case faster. Hopefully it is an ever larger speed-up on your real problem.
load("answersData.mat")
tic
for i = height(tableOfUsers): -1: 1
% create a wrapped strcmp anon fcn that takes each cell element and
% each string element
wStrcmp = @(anonInp1) any(strcmp(anonInp1, tableOfUsers{i,1}));
% create matrix of indices for the entries that match the criteria (500x334)
idxMat(:,i) = cellfun( wStrcmp, tableOfTextByTime.("tweetUniqueMentions"),'UniformOutput',false);
% grab the relevant text that match the criteria
correspondingText{i,1} = tableOfTextByTime(cell2mat(idxMat(:,i)),:);
end
toc
Elapsed time is 0.544876 seconds.
tic
% Preallocate, and pull out desired subset of data (so indexing doesn't need to be done repeatedly)
idxMat2 = false(height(tableOfTextByTime),height(tableOfUsers));
C = tableOfTextByTime.("tweetUniqueMentions");
T = tableOfUsers{:,1};
for i = height(tableOfUsers): -1: 1
% create a wrapped strcmp anon fcn that takes each cell element and
% each string element
wStrcmp2 = @(anonInp1) any(strcmp(anonInp1, T(i)));
% create matrix of indices for the entries that match the criteria (500x334)
idxMat2(:,i) = cellfun( wStrcmp2, C);
% grab the relevant text that match the criteria
correspondingText2{i,1} = tableOfTextByTime((idxMat2(:,i)),:);
end
toc
Elapsed time is 0.062504 seconds.
% Test that the two methods result in the same output
isequal(correspondingText,correspondingText2)
ans = logical
1

Categories

Products

Release

R2022b

Asked:

on 21 Dec 2022

Edited:

on 21 Dec 2022

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!