Combine Multiple Tokens to Match Using regexp

15 views (last 30 days)
Hello, I need to match two single quotes in a string using regexp(); unless I missed it in the official documentation, I have only found this mentioned on an older non-MathWorks website, where the single detail given is this:
\(\)\(CA\)+combines multiple tokens into one
This isn't massively helpful and doesn't provide the results I need; I had assumed it would work like this:
string2 = string1(regexp(string1, '[otherstufftomatch\(''\)]'));
therefore causing the function to try and match two successive quotes. All of the other characters are matched correctly and assigned to string2, but still only a single quote, whereas I need both. It doesn't flag as a syntax error or anything, so my assumption is that this doesn't do what I think it does, and I'm just matching all the characters '\()' individually. For context, here is my code now, which is working but isn't returning the double-quotes. I have tried some of the additional features on regexp like grouping them together using parenthesis, surrounding it with non-word escape characters ('\W''\W') and using the * to indicate matching it multiple times. The difficulty is that putting one ' into the characters to match terminates the string so I've had to put two in there, but I don't think this is doing what I think it does:
rawString = 'I just can''t seem to get this working correctly.';
matchThis = '[AEIOUaeiou., (\W''\W)]';
vowelsOnly = rawString(regexp(rawString, '[AEIOUaeiou., (\W''*\W)]'));
Any chance anybody knows how to do what I need here? Thanks in advance!
  1 Comment
Stephen23
Stephen23 on 2 Feb 2020
Edited: Stephen23 on 2 Feb 2020
"I need to match two single quotes in a string using regexp(); unless I missed it in the official documentation..."
The single quote character has no special meaning at all for regular expressions, so you won't find it mentioned in the regular exprssion documentation (just like you won't find every other non-special character listed by name). But because the single quote is used to define a character vector in MATLAB it needs to be escaped/doubled within a character vector in order to define one single quote character, as the documentation explains: "If the text includes single quotes, use two single quotes within the definition."'
Also note that your code:
string2 = string1(regexp(string1,'...'));
will return at most one character from each match, because the default first output of regexp is "startIndex", which by definition is one index (a subvector starts in one location). I suspect that you might find the "match" output more useful. E.g. here is a simple example of a regular expression that matches multiple digits:
>> str = 'abc456xyz';
>> str(regexp(str,'\d+')) % what you are doing
ans = 4
>> regexp(str,'\d+','match','once') % all matched characters
ans = 456

Sign in to comment.

Answers (1)

Walter Roberson
Walter Roberson on 2 Feb 2020
' normally terminates a character vector but '' encodes a single ' inside a character vector, not two of them in a row.
I suggest either switching to string or using {2} after the ''
Caution: inside [] you cannot construct patterns. () have no special meaning inside []
You could potentially code
"([AEIOUaeiou]|'{2})"
though it does seem odd to me that '' would be considered a vowel?
  2 Comments
Rowan Lawrence
Rowan Lawrence on 2 Feb 2020
Edited: Rowan Lawrence on 2 Feb 2020
Hi Walter, thankyou for the comment! I'm still not very proficient with MATLAB so I didn't know that there was such a distinction between ' and " as a character vector or string; the book I'm reading said I could use either, and that they are basically interchangeable. Potentially trying to prevent some confusion from people who haven't coded before, I imagine.
I think I need to work on my variable-naming a little more! The idea was to match vowels while also preserving the punctuation and whitespace from the original string. Unfortunately as written, your solution still doesn't seem to match two single-quotes; it omits them completely while every other individual character inside the square brackets is still matched. I've tried using the * operator again but this still just omits the quotes from vowelsOnly, regardless of whether the whole string is grouped with (). So I'm a little stumped again, unfortunately!
Thanks again though.
Stephen23
Stephen23 on 2 Feb 2020
@Rowan Lawrence : please upload a .mat file containing some of the strings/character vectors that you are trying to parse, together with the expected output.

Sign in to comment.

Categories

Find more on Characters and Strings in Help Center and File Exchange

Products


Release

R2019b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!