Find indices of multiple strings within another string

14 views (last 30 days)
I am trying to efficiently find which strings (character vectors) match between two cell arrays.
One cell array contains ~1000 equations written as strings that I'm trying to parse by matching to strings in another array (100,000 items). I need to know the indices from the 100,000 items that are found within the ~1000 equations. There may be multiple of the 100,000 items found within each of the 1000 equations.
I'm currently implementing this as such:
Equations.Equation % this is a list of ~1000 equations, a cell array of character vectors
OutputData.DataName % list of ~100,000 possible strings I'm looking for in the equations (my variable names)
for ii = 1:length(Equations)
matches=cellfun(@(x) contains(Equations(ii).Equation,x),OutputData.DataName);
indices = find(matches);
% do some other stuff with the matches found, then move onto the next iteration of the loop
end
This is fairly slow. Is there a way to more efficiently find within Equations(ii).Equation which items within OutputData.DataName are found and the index of those items?
  4 Comments
Paul
Paul on 9 Apr 2022
Something's not working with this example data and the code in the question. Is there a typo somewherer?
Equations.Equation = { '(123X + 123Y).^2'; ...
'500 + 456X + 123Z'; ...
'200 * abs(789Z * pi) + 123X'}
Equations = struct with fields:
Equation: {3×1 cell}
OutputData.DataName = {'123A'; '123B'; '123C'; '123X'; '123Y'; '123Z'; '456X'; '456Y'; '456Z'; '789X'; '789Y'; '789Z'};
for ii = 1:length(Equations)
matches=cellfun(@(x) contains(Equations(ii).Equation,x),OutputData.DataName);
indices = find(matches);
% do some other stuff with the matches found, then move onto the next iteration of the loop
end
Error using cellfun
Non-scalar in Uniform output, at index 1, output 1.
Set 'UniformOutput' to false.
Voss
Voss on 9 Apr 2022
It seems like Equations is actually a struct array:
Equations = struct('Equation',{ ...
'(123X + 123Y).^2'; ...
'500 + 456X + 123Z'; ...
'200 * abs(789Z * pi) + 123X'})
Equations = 3×1 struct array with fields:
Equation
OutputData.DataName = {'123A'; '123B'; '123C'; '123X'; '123Y'; '123Z'; '456X'; '456Y'; '456Z'; '789X'; '789Y'; '789Z'};
for ii = 1:length(Equations)
matches = cellfun(@(x) contains(Equations(ii).Equation,x),OutputData.DataName).'
indices = find(matches)
% do some other stuff with the matches found, then move onto the next iteration of the loop
end
matches = 1×12 logical array
0 0 0 1 1 0 0 0 0 0 0 0
indices = 1×2
4 5
matches = 1×12 logical array
0 0 0 0 0 1 1 0 0 0 0 0
indices = 1×2
6 7
matches = 1×12 logical array
0 0 0 1 0 0 0 0 0 0 0 1
indices = 1×2
4 12

Sign in to comment.

Accepted Answer

Paul
Paul on 10 Apr 2022
It looks like using string variables with an inner loop is much faster than a cell array with cellfun, at least here on Answers with the data provided.
Orignal code, modified by @_
Equations = struct('Equation',{ ...
'(123X + 123Y).^2'; ...
'500 + 456X + 123Z'; ...
'200 * abs(789Z * pi) + 123X'});
OutputData.DataName = {'123A'; '123B'; '123C'; '123X'; '123Y'; '123Z'; '456X'; '456Y'; '456Z'; '789X'; '789Y'; '789Z'};
for ii = 1:length(Equations)
matches = cellfun(@(x) contains(Equations(ii).Equation,x),OutputData.DataName).'
indices = find(matches)
% do some other stuff with the matches found, then move onto the next iteration of the loop
end
matches = 1×12 logical array
0 0 0 1 1 0 0 0 0 0 0 0
indices = 1×2
4 5
matches = 1×12 logical array
0 0 0 0 0 1 1 0 0 0 0 0
indices = 1×2
6 7
matches = 1×12 logical array
0 0 0 1 0 0 0 0 0 0 0 1
indices = 1×2
4 12
Convert the cell arrays to strings, and implement an inner loop to compute matches. Verify the results are the same
equations = string({Equations.Equation});
dataname = string(OutputData.DataName);
mathces = nan(1,numel(dataname));
for ii = 1:numel(equations)
for jj = 1:numel(dataname)
matches(jj) = contains(equations(ii),dataname(jj));
end
matches
indices = find(matches)
end
matches = 1×12 logical array
0 0 0 1 1 0 0 0 0 0 0 0
indices = 1×2
4 5
matches = 1×12 logical array
0 0 0 0 0 1 1 0 0 0 0 0
indices = 1×2
6 7
matches = 1×12 logical array
0 0 0 1 0 0 0 0 0 0 0 1
indices = 1×2
4 12
Wrap an outer loop aorund the original code to test timing.
ntrials = 1e5;
tic
for trials = 1:ntrials
for ii = 1:length(Equations)
matches = cellfun(@(x) contains(Equations(ii).Equation,x),OutputData.DataName).';
indices = find(matches);
% do some other stuff with the matches found, then move onto the next iteration of the loop
end
end
toc
Elapsed time is 15.236180 seconds.
tic
for trials = 1:ntrials
for ii = 1:numel(equations)
for jj = 1:numel(dataname)
matches(jj) = contains(equations(ii),dataname(jj));
end
matches;
indices = find(matches);
end
end
toc
Elapsed time is 2.448469 seconds.
I was actually surprised that there isn't a string function that can replace that inner loop, but I couldnt't find one. Maybe it can be done using a particular pattern, but I couldn't figure that out either.

More Answers (0)

Categories

Find more on Loops and Conditional Statements in Help Center and File Exchange

Products


Release

R2016b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!