regexp
Match regular expression (case sensitive)
Syntax
Description
startIndex = regexp(str,expression)str that
                    matches the character patterns specified by the regular expression. If there are
                    no matches, startIndex is an empty array. If there are
                    substrings that match overlapping pieces of text, only the index of the first
                    match will be returned.
[ returns
the starting and ending indices of all matches.startIndex,endIndex]
= regexp(str,expression)
out = regexp(str,expression,outkey)outkey. For example,
if outkey is 'match', then regexp returns
the substrings that match the expression rather than their starting
indices.
[ returns
the outputs specified by multiple output keywords, in the specified
order. For example, if you specify out1,...,outN]
= regexp(str,expression,outkey1,...,outkeyN)'match','tokens',
then regexp returns substrings that match the entire
expression and tokens that
match parts of the expression.
___ = regexp(___,'forceCellOutput') returns
each output argument as a scalar cell. The cells contain the numeric
arrays or substrings that are described as the outputs of the previous
syntaxes. You can include any of the inputs and request any of the
outputs from previous syntaxes.
Examples
Find words that start with c, end with t, and contain one or more vowels between them. 
str = 'bat cat can car coat court CUT ct CAT-scan'; expression = 'c[aeiou]+t'; startIndex = regexp(str,expression)
startIndex = 1×2
     5    17
The regular expression 'c[aeiou]+t' specifies this pattern: 
- cmust be the first character.
- cmust be followed by one of the characters inside the brackets,- [aeiou].
- The bracketed pattern must occur one or more times, as indicated by the - +operator.
- tmust be the last character, with no characters between the bracketed pattern and the- t.
Values in startIndex indicate the index of the first character of each word that matches the regular expression. The matching word cat starts at index 5, and coat starts at index 17. The words CUT and CAT do not match because they are uppercase. 
Find the location of capital letters and spaces within character vectors in a cell array.
str = {'Madrid, Spain','Romeo and Juliet','MATLAB is great'};
capExpr = '[A-Z]';
spaceExpr = '\s';
capStartIndex = regexp(str,capExpr);
spaceStartIndex = regexp(str,spaceExpr);capStartIndex and spaceStartIndex are cell arrays because the input str is a cell array. 
View the indices for the capital letters.
celldisp(capStartIndex)
 
capStartIndex{1} =
 
     1     9
 
 
capStartIndex{2} =
 
     1    11
 
 
capStartIndex{3} =
 
     1     2     3     4     5     6
 
View the indices for the spaces.
celldisp(spaceStartIndex)
 
spaceStartIndex{1} =
 
     8
 
 
spaceStartIndex{2} =
 
     6    10
 
 
spaceStartIndex{3} =
 
     7    10
 
Capture words within a character vector that contain the letter x. 
str = 'EXTRA! The regexp function helps you relax.'; expression = '\w*x\w*'; matchStr = regexp(str,expression,'match')
matchStr = 1×2 cell
    {'regexp'}    {'relax'}
The regular expression '\w*x\w*' specifies that the character vector: 
- Begins with any number of alphanumeric or underscore characters, - \w*.
- Contains the lowercase letter - x.
- Ends with any number of alphanumeric or underscore characters after the - x, including none, as indicated by- \w*.
Split a character vector into several substrings, where each substring is delimited by a ^ character. 
str = ['Split ^this text into ^several pieces']; expression = '\^'; splitStr = regexp(str,expression,'split')
splitStr = 1×3 cell
    {'Split '}    {'this text into '}    {'several pieces'}
Because the caret symbol has special meaning in regular expressions, precede it with the escape character, a backslash (\). To split a character vector at other delimiters, such as a semicolon, you do not need to include the backslash. 
Capture parts of a character vector that match a regular expression using the 'match' keyword, and the remaining parts that do not match using the 'split' keyword. 
str = 'She sells sea shells by the seashore.'; expression = '[Ss]h.'; [match,noMatch] = regexp(str,expression,'match','split')
match = 1×3 cell
    {'She'}    {'she'}    {'sho'}
noMatch = 1×4 cell
    {0×0 char}    {' sells sea '}    {'lls by the sea'}    {'re.'}
The regular expression '[Ss]h.' specifies that: 
- Sor- sis the first character.
- his the second character.
- The third character can be anything, including a space, as indicated by the dot ( - .).
When the first (or last) character in a character vector matches a regular expression, the first (or last) return value from the 'split' keyword is an empty character vector. 
Optionally, reassemble the original character vector from the substrings.
combinedStr = strjoin(noMatch,match)
combinedStr = 'She sells sea shells by the seashore.'
Find the names of HTML tags by defining a token within a regular expression. Tokens are indicated with parentheses, (). 
str = '<title>My Title</title><p>Here is some text.</p>'; expression = '<(\w+).*>.*</\1>'; [tokens,matches] = regexp(str,expression,'tokens','match');
The regular expression <(\w+).*>.*</\1> specifies this pattern: 
- <(\w+)finds an opening angle bracket followed by one or more alphanumeric or underscore characters. Enclosing- \w+in parentheses captures the name of the HTML tag in a token.
- .*>finds any number of additional characters, such as HTML attributes, and a closing angle bracket.
- </\1>finds the end tag corresponding to the first token (indicated by- \1). The end tag has the form- </tagname>.
View the tokens and matching substrings.
celldisp(tokens)
 
tokens{1}{1} =
 
title
 
 
tokens{2}{1} =
 
p
 
celldisp(matches)
 
matches{1} =
 
<title>My Title</title>
 
 
matches{2} =
 
<p>Here is some text.</p>
 
Parse dates that can appear with either the day or the month first, in these forms: mm/dd/yyyy or dd-mm-yyyy. Use named tokens to identify each part of the date. 
str = '01/11/2000 20-02-2020 03/30/2000 16-04-2020'; expression = ['(?<month>\d+)/(?<day>\d+)/(?<year>\d+)|'... '(?<day>\d+)-(?<month>\d+)-(?<year>\d+)']; tokenNames = regexp(str,expression,'names');
The regular expression specifies this pattern:
- (?<name>\d+)finds one or more numeric digits and assigns the result to the token indicated by name.
- | is the logical - oroperator, which indicates that there are two possible patterns for dates. In the first pattern, slashes (- /) separate the tokens. In the second pattern, hyphens (- -) separate the tokens.
View the named tokens.
for k = 1:length(tokenNames) disp(tokenNames(k)) end
    month: '01'
      day: '11'
     year: '2000'
    month: '02'
      day: '20'
     year: '2020'
    month: '03'
      day: '30'
     year: '2000'
    month: '04'
      day: '16'
     year: '2020'
Find both uppercase and lowercase instances of a word.
By default, regexp performs case-sensitive matching. 
str = 'A character vector with UPPERCASE and lowercase text.'; expression = '\w*case'; matchStr = regexp(str,expression,'match')
matchStr = 1×1 cell array
    {'lowercase'}
The regular expression specifies that the character vector:
- Begins with any number of alphanumeric or underscore characters, - \w*.
- Ends with the literal text - case.
The regexpi function uses the same syntax as regexp, but performs case-insensitive matching. 
matchWithRegexpi = regexpi(str,expression,'match')matchWithRegexpi = 1×2 cell
    {'UPPERCASE'}    {'lowercase'}
Alternatively, disable case-sensitive matching for regexp using the 'ignorecase' option. 
matchWithIgnorecase = regexp(str,expression,'match','ignorecase')
matchWithIgnorecase = 1×2 cell
    {'UPPERCASE'}    {'lowercase'}
For multiple expressions, disable case-sensitive matching for selected expressions using the (?i) search flag. 
expression = {'(?-i)\w*case';...
              '(?i)\w*case'};
matchStr = regexp(str,expression,'match');
celldisp(matchStr) 
matchStr{1}{1} =
 
lowercase
 
 
matchStr{2}{1} =
 
UPPERCASE
 
 
matchStr{2}{2} =
 
lowercase
 
Create a character vector that contains a newline, \n, and parse it using a regular expression. Since regexp returns matchStr as a cell array containing text that has multiple lines, you can take the text out of the cell array to display all lines.
str = sprintf('abc\n de'); expression = '.*'; matchStr = regexp(str,expression,'match'); matchStr{:}
ans = 
    'abc
      de'
By default, the dot (.) matches every character, including the newline, and returns a single match that is equivalent to the original character vector.
Exclude newline characters from the match using the 'dotexceptnewline' option. This returns separate matches for each line of text.
matchStrNoNewline = regexp(str,expression,'match','dotexceptnewline')
matchStrNoNewline = 1×2 cell
    {'abc'}    {' de'}
Find the first or last character of each line using the ^ or $ metacharacters and the 'lineanchors' option.
expression = '.$'; lastInLine = regexp(str,expression,'match','lineanchors')
lastInLine = 1×2 cell
    {'c'}    {'e'}
Find matches within a piece of text and return the output in a scalar cell.
Find words that start with c, end with t, and contain one or more vowels between them. Return the starting indices in a scalar cell.
str = 'bat cat can car coat court CUT ct CAT-scan'; expression = 'c[aeiou]+t'; startIndex = regexp(str,expression,'forceCellOutput')
startIndex = 1×1 cell array
    {[5 17]}
To access the starting indices as a numeric array, index into the cell.
startIndex{1}ans = 1×2
     5    17
Return the matching and nonmatching substrings. Each output is in its own scalar cell.
[match,noMatch] = regexp(str,expression,'match','split','forceCellOutput')
match = 1×1 cell array
    {1×2 cell}
noMatch = 1×1 cell array
    {1×3 cell}
To access the array of matches, index into match.
match{1}ans = 1×2 cell
    {'cat'}    {'coat'}
To access the substrings that do not match, index into noMatch.
noMatch{1}ans = 1×3 cell
    {'bat '}    {' can car '}    {' court CUT ct CAT-scan'}
Input Arguments
Input text, specified as a character vector, a cell array of character vectors, or a string array. Each character vector in a cell array, or each string in a string array, can be of any length and contain any characters.
If str and expression are
string arrays or cell arrays, they must have the same dimensions.
Data Types: string | char | cell
Regular expression, specified as a character vector, a cell
array of character vectors, or a string array. Each expression can
contain characters, metacharacters, operators, tokens, and flags that
specify patterns to match in str.
The following tables describe the elements of regular expressions.
Metacharacters
Metacharacters represent letters, letter ranges, digits, and space characters. Use them to construct a generalized pattern of characters.
| Metacharacter | Description | Example | 
|---|---|---|
| 
 | Any single character, including white space | 
 | 
| 
 | Any character contained within the square brackets. The following characters are treated
                        literally:  | 
 | 
| 
 | Any character not contained within the square brackets. The following characters are treated
                        literally:  | 
 | 
| 
 | Any character in the range of  | 
 | 
| 
 | Any alphabetic, numeric, or underscore character. For
English character sets,  | 
 | 
| 
 | Any character that is not alphabetic, numeric, or underscore.
For English character sets,  | 
 | 
| 
 | Any white-space character; equivalent to  | 
 | 
| 
 | Any non-white-space character; equivalent to   | 
 | 
| 
 | Any numeric digit; equivalent to  | 
 | 
| 
 | Any nondigit character; equivalent to  | 
 | 
| 
 | Character of octal value  | 
 | 
| 
 | Character of hexadecimal value  | 
 | 
Character Representation
| Operator | Description | 
|---|---|
| 
 | Alarm (beep) | 
| 
 | Backspace | 
| 
 | Form feed | 
| 
 | New line | 
| 
 | Carriage return | 
| 
 | Horizontal tab | 
| 
 | Vertical tab | 
| 
 | Any character with special meaning in regular expressions
that you want to match literally (for example, use  | 
Quantifiers
Quantifiers specify the number of times a pattern must occur in the matching text.
| Quantifier | Number of Times Expression Occurs | Example | 
|---|---|---|
| 
 | 0 or more times consecutively. | 
 | 
| 
 | 0 times or 1 time. | 
 | 
| 
 | 1 or more times consecutively. | 
 | 
| 
 | At least  
 | 
 | 
| 
 | At least  
 | 
 | 
| 
 | Exactly  Equivalent
to  | 
 | 
Quantifiers can appear in three modes, described in the following table. q represents any of the quantifiers in the previous table.
| Mode | Description | Example | 
|---|---|---|
| 
 | Greedy expression: match as many characters as possible. | Given the text   | 
| 
 | Lazy expression: match as few characters as necessary. | Given the text  | 
| 
 | Possessive expression: match as much as possible, but do not rescan any portions of the text. | Given the text | 
Grouping Operators
Grouping operators allow you to capture tokens, apply one operator to multiple elements, or disable backtracking in a specific group.
| Grouping Operator | Description | Example | 
|---|---|---|
| 
 | Group elements of the expression and capture tokens. | 
 | 
| 
 | Group, but do not capture tokens. | 
 Without
grouping,  | 
| 
 | Group atomically. Do not backtrack within the group to complete the match, and do not capture tokens. | 
 | 
| 
 | Match expression  If
there is a match with  You can include  | 
 | 
Anchors
Anchors in the expression match the beginning or end of the input text or word.
| Anchor | Matches the... | Example | 
|---|---|---|
| 
 | Beginning of the input text. | 
 | 
| 
 | End of the input text. | 
 | 
| 
 | Beginning of a word. | 
 | 
| 
 | End of a word. | 
 | 
Lookaround Assertions
Lookaround assertions look for patterns that immediately precede or follow the intended match, but are not part of the match.
The pointer remains at the current location, and characters
that correspond to the test expression are not
captured or discarded. Therefore, lookahead assertions can match overlapping
character groups.
| Lookaround Assertion | Description | Example | 
|---|---|---|
| 
 | Look ahead for characters that match  | 
 | 
| 
 | Look ahead for characters that do not match  | 
 | 
| 
 | Look behind for characters that match  | 
 | 
| 
 | Look behind for characters that do not match  | 
 | 
If you specify a lookahead assertion before an
expression, the operation is equivalent to a logical AND.
| Operation | Description | Example | 
|---|---|---|
| 
 | Match both  | 
 | 
| 
 | Match  | 
 | 
Logical and Conditional Operators
Logical and conditional operators allow you to test the state
of a given condition, and then use the outcome to determine which
pattern, if any, to match next. These operators support logical OR,
and if or if/else conditions.
Conditions can be tokens, lookaround operators, or dynamic expressions
of the form (?@cmd). Dynamic expressions must return
a logical or numeric value.
| Conditional Operator | Description | Example | 
|---|---|---|
| 
 | Match expression  If
there is a match with  | 
 | 
| 
 | If condition  | 
 | 
| 
 | If condition  | 
 | 
Token Operators
Tokens are portions of the matched text that you define by enclosing part of the regular expression in parentheses. You can refer to a token by its sequence in the text (an ordinal token), or assign names to tokens for easier code maintenance and readable output.
| Ordinal Token Operator | Description | Example | 
|---|---|---|
| 
 | Capture in a token the characters that match the enclosed expression. | 
 | 
| 
 | Match the  | 
 | 
| 
 | If the  | 
 | 
| Named Token Operator | Description | Example | 
|---|---|---|
| 
 | Capture in a named token the characters that match the enclosed expression. | 
 | 
| 
 | Match the token referred to by  | 
 | 
| 
 | If the named token is found, then match  | 
 | 
Note
If an expression has nested parentheses, MATLAB® captures
tokens that correspond to the outermost set of parentheses. For example,
given the search pattern '(and(y|rew))', MATLAB creates
a token for 'andrew' but not for 'y' or 'rew'.
Dynamic Regular Expressions
Dynamic expressions allow you to execute a MATLAB command or a regular expression to determine the text to match.
The parentheses that enclose dynamic expressions do not create a capturing group.
| Operator | Description | Example | 
|---|---|---|
| 
 | Parse  When parsed,  | 
 | 
| 
 | Execute the MATLAB command represented by  | 
 | 
| 
 | Execute the MATLAB command represented by  | 
 | 
Within dynamic expressions, use the following operators to define replacement text.
| Replacement Operator | Description | 
|---|---|
| 
 | Portion of the input text that is currently a match | 
| 
 | Portion of the input text that precedes the current match | 
| 
 | Portion of the input text that follows the current match
(use  | 
| 
 | 
 | 
| 
 | Named token | 
| 
 | Output returned when MATLAB executes the command,  | 
Comments
| Characters | Description | Example | 
|---|---|---|
| (?#comment) | Insert a comment in the regular expression. The comment text is ignored when matching the input. | 
 | 
Search Flags
Search flags modify the behavior for matching expressions. An
alternative to using a search flag within an expression is to pass
an option input argument.
| Flag | Description | 
|---|---|
| (?-i) | Match letter case (default for  | 
| (?i) | Do not match letter case (default for  | 
| (?s) | Match dot ( | 
| (?-s) | Match dot in the pattern with any character that is not a newline character. | 
| (?-m) | Match the  | 
| (?m) | Match the  | 
| (?-x) | Include space characters and comments when matching (default). | 
| (?x) | Ignore space characters and comments when matching. Use  | 
The expression that the flag modifies can appear either after the parentheses, such as
(?i)\w*
or inside the parentheses and separated from the flag with a
colon (:), such as
(?i:\w*)
The latter syntax allows you to change the behavior for part of a larger expression.
Data Types: char | cell | string
Keyword that indicates which outputs to return, specified as one of the following character vectors.
| Output Keyword | Returns | 
|---|---|
| 
 | Starting indices of all matches,  | 
| 
 | Ending indices of all matches,  | 
| 
 | Starting and ending indices of all tokens | 
| 
 | Text of each substring that matches the pattern in  | 
| 
 | Text of each captured token in  | 
| 
 | Name and text of each named token | 
| 
 | Text of nonmatching substrings of  | 
Data Types: char | string
Search option, specified as a character vector. Options come in pairs: one option that corresponds to the default behavior, and one option that allows you to override the default. Specify only one option from a pair. Options can appear in any order.
| Default | Override | Description | 
|---|---|---|
| 
 | 
 | Match the expression as many times as possible (default), or only once. | 
| 
 | 
 | Suppress warnings (default), or display them. | 
| 
 | 
 | Match letter case (default), or ignore case. | 
| 
 | 
 | Ignore zero length matches (default), or include them. | 
| 
 | 
 | Match dot with any character (default), or all
                                                except newline ( | 
| 
 | 
 | Apply  | 
| 
 | 
 | Include space characters and comments when
                                                matching (default), or ignore them. With
                                                   | 
Data Types: char | string
Output Arguments
Starting indices of each match, returned as a row vector or cell array, as follows:
- If - strand- expressionare both character vectors or string scalars, the output is a row vector (or, if there are no matches, an empty array).
- If either - stror- expressionis a cell array of character vectors or a string array, and the other is a character vector or a string scalar, the output is a cell array of row vectors. The output cell array has the same dimensions as the input array.
- If - strand- expressionare both cell arrays or string arrays, they must have the same dimensions. The output is a cell array with the same dimensions.
Ending index of each match, returned as a row vector or cell array, as follows:
- If - strand- expressionare both character vectors or string scalars, the output is a row vector (or, if there are no matches, an empty array).
- If either - stror- expressionis a cell array of character vectors or a string array, and the other is a character vector or a string scalar, the output is a cell array of row vectors. The output cell array has the same dimensions as the input array.
- If - strand- expressionare both cell arrays or string arrays, they must have the same dimensions. The output is a cell array with the same dimensions.
Information about matches, returned as a numeric, cell, string,
or structure array. The information in the output depends upon the
value you specify for outkey, as follows.
| Output Keyword | Output Description | Output Type and Dimensions | 
|---|---|---|
| 
 | Starting indices of matches | For both  
 
 | 
| 
 | Ending indices of matches | |
| 
 | Starting and ending indices of all tokens | By default, when returning all matches: 
 
 When you specify the  If
a token is expected at a particular index  | 
| 
 | Text of each substring that matches the pattern in  | By default, when returning all matches: 
 
 When you specify the  | 
| 
 | Text of each captured token in   | By default, when returning all matches: 
 
 When you specify the  If
a token is expected at a particular index, but is not found, then MATLAB returns
an empty value for the token,  | 
| 
 | Name and text of each named token | For all matches: 
 | 
| 
 | Text of nonmatching substrings of   | For all matches: 
 | 
More About
Tokens are portions of the matched text that correspond to portions of the regular expression. To create tokens, enclose part of the regular expression in parentheses.
For example, this expression finds a date of the form dd-mmm-yyyy,
including tokens for the day, month, and year.
str = 'Here is a date: 01-Apr-2020'; expression = '(\d+)-(\w+)-(\d+)'; mydate = regexp(str,expression,'tokens'); mydate{:}
ans =
  1×3 cell array
    {'01'}    {'Apr'}    {'2020'}
You can associate names with tokens so that they are more easily identifiable:
str = 'Here is a date: 01-Apr-2020'; expression = '(?<day>\d+)-(?<month>\w+)-(?<year>\d+)'; mydate = regexp(str,expression,'names')
mydate = 
  struct with fields:
      day: '01'
    month: 'Apr'
     year: '2020'
For more information, see Tokens in Regular Expressions.
Tips
Algorithms
MATLAB parses each input character vector or string from left to right, attempting to match the text in the character vector or string with the first element of the regular expression. During this process, MATLAB skips over any text that does not match.
When MATLAB finds the first match, it continues parsing to match the second piece of the expression, and so on.
Extended Capabilities
This function fully supports thread-based environments. For more information, see Run MATLAB Functions in Thread-Based Environment.
Version History
Introduced before R2006a
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)