How to extract matches from results of a regexp match

14 views (last 30 days)
Bill Tubbs
Bill Tubbs on 8 Jun 2022
Edited: Stephen23 on 19 Jun 2022
I'm trying to find the columns of a table that match a pattern. This works:
col_names = {'X_est_9', 'X_est_10', 'Y_est_9', 'Y_est_10', 'E_obs_9', 'E_obs_10'};
result = regexp(col_names, 'E_obs_\d*', 'match')
But the result is a cell array of cells (not sure why):
result =
1×6 cell array
{0×0 cell} {0×0 cell} {0×0 cell} {0×0 cell} {1×1 cell} {1×1 cell}
I just want a cell array of the matched results:
matched_col_names =
1×2 cell array
{'E_obs_9'} {'E_obs_10'}
Must be an easier way than this:
matched_col_names = cellfun(@(x) x, result(~cellfun(@isempty, result)))

Accepted Answer

Stephen23
Stephen23 on 8 Jun 2022
Edited: Stephen23 on 19 Jun 2022
"But the result is a cell array of cells (not sure why):"
Summary: you need to use the ONCE option.
Explanation: There are two things going on in your question. Firstly you used the default ALL option shown here:
which matches all occurances in the input string that match the regular expression, which could be two or more times. Because there could be multiple matches, all of the outputs are nested in cell arrays (you can see this by reading through the output descriptions, too many to copy here).
Because you only want to match the regular expression once (not multiple times), you should specify the ONCE option... this will remove one level of nested cell arrays from the output. If you are planning on using REGEXP, you will find the ONCE option very useful.
Secondly the MATCH output cell array always has the same size as the input cell array. If you provide it with a six-element cell array, then you will get a six-element cell array at the output. So your expected output size is not supported by REGEXP (and for reasons of traceability should not occur).
But you can remove the empty elements yourself, this is quite easy and much more efficient than your code:
col_names = {'X_est_9', 'X_est_10', 'Y_est_9', 'Y_est_10', 'E_obs_9', 'E_obs_10'};
result = regexp(col_names, 'E_obs_\d*', 'match', 'once')
result = 1×6 cell array
{0×0 char} {0×0 char} {0×0 char} {0×0 char} {'E_obs_9'} {'E_obs_10'}
result(cellfun('isempty',result)) = []
result = 1×2 cell array
{'E_obs_9'} {'E_obs_10'}
Bonus: You might find this tool useful when developing regular expressions:
  1 Comment
the cyclist
the cyclist on 8 Jun 2022
Today I learned about the 'once' option (which I did not find, despite looking through the docs). But will I remember?!
:-)

Sign in to comment.

More Answers (2)

the cyclist
the cyclist on 8 Jun 2022
Even when using a single character array input along with the 'match' option, MATLAB has to return outputs in a cell array, to be able to handle cases where there are multiple matches within a single input:
regexp('E_obs_9 E_obs_10','E_obs_\d*','match')
ans = 1×2 cell array
{'E_obs_9'} {'E_obs_10'}
Because you are passing in a cell array of character arrays, you get out a cell array of cell arrays. You get the empty ones because MATLAB has no way of "knowing" that you don't want the empty ones. In particular, if it only output two cells, you would have no way of knowing which two input element that those two outputs corresponded to.
So, I'm afraid that you are stuck doing the post-processing step, as far as I can tell.

Bill Tubbs
Bill Tubbs on 19 Jun 2022
Edited: Bill Tubbs on 19 Jun 2022
Here is a one-line solution—it's based on the answer of Stephen23 but instead of finding the matches, it finds the first indeces of any matches (this is the default for regexp), and then makes a boolean of matches/non-matches and uses it to index the orignal cell array.
matched_col_names = col_names(~cellfun('isempty', regexp(col_names, 'E_obs_\d*')))
matched_col_names =
1×2 cell array
{'E_obs_9'} {'E_obs_10'}

Products


Release

R2019b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!