Combine Multiple Tokens to Match Using regexp
14 views (last 30 days)
Show older comments
Hello, I need to match two single quotes in a string using regexp(); unless I missed it in the official documentation, I have only found this mentioned on an older non-MathWorks website, where the single detail given is this:
\(\)\(CA\)+combines multiple tokens into one
This isn't massively helpful and doesn't provide the results I need; I had assumed it would work like this:
string2 = string1(regexp(string1, '[otherstufftomatch\(''\)]'));
therefore causing the function to try and match two successive quotes. All of the other characters are matched correctly and assigned to string2, but still only a single quote, whereas I need both. It doesn't flag as a syntax error or anything, so my assumption is that this doesn't do what I think it does, and I'm just matching all the characters '\()' individually. For context, here is my code now, which is working but isn't returning the double-quotes. I have tried some of the additional features on regexp like grouping them together using parenthesis, surrounding it with non-word escape characters ('\W''\W') and using the * to indicate matching it multiple times. The difficulty is that putting one ' into the characters to match terminates the string so I've had to put two in there, but I don't think this is doing what I think it does:
rawString = 'I just can''t seem to get this working correctly.';
matchThis = '[AEIOUaeiou., (\W''\W)]';
vowelsOnly = rawString(regexp(rawString, '[AEIOUaeiou., (\W''*\W)]'));
Any chance anybody knows how to do what I need here? Thanks in advance!
1 Comment
Stephen23
on 2 Feb 2020
Edited: Stephen23
on 2 Feb 2020
"I need to match two single quotes in a string using regexp(); unless I missed it in the official documentation..."
The single quote character has no special meaning at all for regular expressions, so you won't find it mentioned in the regular exprssion documentation (just like you won't find every other non-special character listed by name). But because the single quote is used to define a character vector in MATLAB it needs to be escaped/doubled within a character vector in order to define one single quote character, as the documentation explains: "If the text includes single quotes, use two single quotes within the definition."'
Also note that your code:
string2 = string1(regexp(string1,'...'));
will return at most one character from each match, because the default first output of regexp is "startIndex", which by definition is one index (a subvector starts in one location). I suspect that you might find the "match" output more useful. E.g. here is a simple example of a regular expression that matches multiple digits:
>> str = 'abc456xyz';
>> str(regexp(str,'\d+')) % what you are doing
ans = 4
>> regexp(str,'\d+','match','once') % all matched characters
ans = 456
Answers (1)
Walter Roberson
on 2 Feb 2020
' normally terminates a character vector but '' encodes a single ' inside a character vector, not two of them in a row.
I suggest either switching to string or using {2} after the ''
Caution: inside [] you cannot construct patterns. () have no special meaning inside []
You could potentially code
"([AEIOUaeiou]|'{2})"
though it does seem odd to me that '' would be considered a vowel?
2 Comments
Stephen23
on 2 Feb 2020
@Rowan Lawrence : please upload a .mat file containing some of the strings/character vectors that you are trying to parse, together with the expected output.
See Also
Categories
Find more on Characters and Strings in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!