Patterns to search and match text
A pattern defines rules for matching text with text-searching
extract. You can build a pattern expression using
pattern functions, operators, and literal text. For example, MATLAB® release names, start with
"R", followed by the four-digit
year, and then either
"b". Define a pattern to
match the format of the release names:
pat = "R" + digitsPattern(4) + ("a"|"b");
Match that pattern in a string:
str = ["String was introduced in R2016b." "Pattern was added in R2020b."]; extract(str,pat)
ans = 2x1 string array "R2016b" "R2020b"
Patterns are composed of literal text and other patterns using the
~ operators. You also can create common
patterns using Object Functions, which use rules often
associated with regular expressions:
Character-Matching Patterns – Ranges of letters or digits, wildcards, or whitespaces, such as
Search Rules – How many times the pattern must occur, case sensitivity, optional patterns, and named expressions, such as
Boundaries – Boundaries at the start or end of a run of specific characters, such as
alphanumericBoundary. Boundary patterns can be negated using the
~operator so that matches to the boundary prevents matching of their pattern expression.
Pattern Organization – Define pattern structure and specify how pattern expressions are displayed, such as
pattern also creates pattern functions with the syntax,
pat = pattern(txt), where
txt is literal text that
pat matches. Pattern functions are useful for specifying pattern type for
function argument validation. However, the
pattern function is rarely
needed for other cases because MATLAB text-matching functions accept text inputs.
|Determine if pattern is in strings|
|Determine if pattern matches strings|
|Count occurrences of pattern in strings|
|Determine if strings end with pattern|
|Determine if strings start with pattern|
|Extract substrings from strings|
|Find and replace one or more substrings|
|Replace substrings between start and end points|
|Split strings at delimiters|
|Delete substrings within strings|
|Delete substrings between start and end points|
|Extract substrings after specified positions|
|Extract substrings before specified positions|
|Extract substrings between start and end points|
|Insert strings after specified substrings|
|Insert strings before specified substrings|
|Match digit characters|
|Match letter characters|
|Match letter and digit characters|
|Match characters from list|
|Match whitespace characters|
|Matches as few characters of any type|
Search Rule Patterns
|Make pattern optional to match|
|Match pattern without backtracking|
|Match pattern with case sensitivity|
|Match pattern regardless of case|
|Match pattern as few times as possible|
|Match pattern as many times as possible|
|Match boundary between alphanumeric and non-alphanumeric characters|
|Match boundary between digit characters and non-digit characters|
|Match boundary between letter characters and non-letter characters|
|Match boundary between whitespace characters and non-whitespace characters|
|Match start or end of line|
|Match start or end of text|
|Match boundary before specified pattern|
|Match boundary following specified pattern|
Regular Expression Patterns
Search Text Using Patterns
lettersPattern is a typical character-matching pattern that matches letter characters. Create a pattern that matches one or more letter characters.
txt = ["This" "is a" "1x6" "string" "array" "."]; pat = lettersPattern;
contains to determine if characters matched by
pat are present in each string. The output logical array shows that the first five of the strings in
txt contain letters, but the sixth string does not.
ans = 1x6 logical array 1 1 1 1 1 0
Determine if text starts with the specified pattern. The output logical array shows that four of the strings in
txt start with letters, but two strings do not.
ans = 1x6 logical array 1 1 0 1 1 0
Determine if the string fully matches the specified pattern. The output logical array shows which of the strings in
txt contain nothing but letters.
ans = 1x6 logical array 1 0 0 1 1 0
Count the number of times a pattern matched. The output numerical array shows how many times
lettersPattern matched in each element of
txt. Note that
lettersPattern matches one or more letters so a group of concurrent letters is a single match.
ans = 1×6 1 2 1 1 1 0
Edit Text Using Patterns
digitsPattern is a typical character-matching pattern that matches digit characters. Create a pattern that matches digit characters.
txt = ["1 fish" "2 fish" "[1,0,0] fish" "[0,0,1] fish"]; pat = digitsPattern;
replace to edit pieces of text that match the pattern.
ans = 1x4 string "# fish" "# fish" "[#,#,#] fish" "[#,#,#] fish"
Create a new piece of text by inserting an
"!" character after matched letters.
ans = 1x4 string "1! fish" "2! fish" "[1!,0!,0!] fish" "[0!,0!,1!] fish"
Patterns can be created using the OR operator,
|, with text. Erase text matched by the specified pattern.
txt = erase(txt,"," | "]" | "[")
txt = 1x4 string "1 fish" "2 fish" "100 fish" "001 fish"
pat from the new text.
ans = 1x4 string "1" "2" "100" "001"
Count Characters in Text
Use patterns to count the occurrences of individual characters in a piece of text.
txt = "She sells sea shells by the sea shore.";
pat as a
pattern object that matches individual letters using
alphanumericsPattern. Extract the pattern.
pat = alphanumericsPattern(1); letters = extract(txt,pat);
Display a histogram of the number of occurrences of each letter.
letters = lower(letters); letters = categorical(letters); histogram(letters)
Hide Details When Displaying Complicated Patterns
maskedPattern to display a variable in place of a complicated pattern expression.
Build a pattern that matches simple arithmetic expressions composed of numbers and arithmetic operators.
mathSymbols = asManyOfPattern(digitsPattern | characterListPattern("+-*/="),1)
mathSymbols = pattern Matching: asManyOfPattern(digitsPattern | characterListPattern("+-*/="),1)
Build a pattern that matches arithmetic expressions with whitespaces between characters using
longExpressionPat = asManyOfPattern(mathSymbols + whitespacePattern) + mathSymbols
longExpressionPat = pattern Matching: asManyOfPattern(asManyOfPattern(digitsPattern | characterListPattern("+-*/="),1) + whitespacePattern) + asManyOfPattern(digitsPattern | characterListPattern("+-*/="),1)
The displayed pattern expression is long and difficult to read. Use
maskedPattern to display the variable name,
mathSymbols, in place of the pattern expression.
mathSymbols = maskedPattern(mathSymbols); shortExpressionPat = asManyOfPattern(mathSymbols + whitespacePattern) + mathSymbols
shortExpressionPat = pattern Matching: asManyOfPattern(mathSymbols + whitespacePattern) + mathSymbols Show all details
Create a string containing some arithmetic expressions, and then extract the pattern from the text.
txt = "What is the answer to 1 + 1? Oh, I know! 1 + 1 = 2!"; arithmetic = extract(txt,shortExpressionPat)
arithmetic = 2x1 string "1 + 1" "1 + 1 = 2"
Specify Names and Descriptions for Complicated Patterns
Create a pattern from two named patterns. Naming patterns adds context to the display of the pattern.
Build two patterns: one that matches words that begin and end with the letter D, and one that matches words that begin and end with the letter R.
dWordsPat = letterBoundary + caseInsensitivePattern("d" + lettersPattern + "d") + letterBoundary; rWordsPat = letterBoundary + caseInsensitivePattern("r" + lettersPattern + "r") + letterBoundary;
Build a pattern using the named patterns that finds a word that starts and ends with D followed by a word that starts and ends with R.
dAndRWordsPat = dWordsPat + whitespacePattern + rWordsPat
dAndRWordsPat = pattern Matching: letterBoundary + caseInsensitivePattern("d" + lettersPattern + "d") + letterBoundary + whitespacePattern + letterBoundary + caseInsensitivePattern("r" + lettersPattern + "r") + letterBoundary
This pattern is hard to read and does not convey much information about its purpose. Use
namedPattern to designate the patterns as named patterns that display specified names and descriptions in place of the pattern expressions.
dWordsPat = namedPattern(dWordsPat,"dWords", "Words that start and end with D"); rWordsPat = namedPattern(rWordsPat,"rWords", "Words that start and end with R"); dAndRWordsPat = dWordsPat + whitespacePattern + rWordsPat
dAndRWordsPat = pattern Matching: dWords + whitespacePattern + rWords Using named patterns: dWords: Words that start and end with D rWords: Words that start and end with R Show more details
Create a string and extract the text that matches the pattern.
txt = "Dad, look at the divided river!"; words = extract(txt,dAndRWordsPat)
words = "divided river"
Match Email Addresses
Build an easy to read pattern to match email addresses.
Email addresses follow the structure username@domain.TLD, where username and domain are made up of identifiers separated by periods. Build a pattern that matches identifiers composed of any combination of alphanumeric characters and
"_" characters. Use
maskedPattern to name this pattern
identifier = asManyOfPattern(alphanumericsPattern(1) | "_", 1); identifier = maskedPattern(identifier);
Build patterns to match domains and subdomains comprised of identifiers. Create a pattern that matches TLDs from a specified list.
subdomain = asManyOfPattern(identifier + ".") + identifier; domainName = namedPattern(identifier,"domainName"); tld = "com" | "org" | "gov" | "net" | "edu";
Build a pattern for matching the local part of an email, which matches one or more identifiers separated by periods. Build a pattern for matching the domain, TLD, and any potential subdomains by combining the previously defined patterns. Use
namedPattern to assign each of these patterns to a named pattern.
username = asManyOfPattern(identifier + ".") + identifier; domain = optionalPattern(namedPattern(subdomain) + ".") + ... domainName + "." + ... namedPattern(tld);
Combine all of the patterns into a single pattern expression. Use
namedPattern to assign
emailPattern to named patterns.
emailAddress = namedPattern(username) + "@" + namedPattern(domain); emailPattern = namedPattern(emailAddress)
emailPattern = pattern Matching emailAddress: username + "@" + domain Using named patterns: emailAddress : username + "@" + domain username : asManyOfPattern(identifier + ".") + identifier domain : optionalPattern(subdomain + ".") + domainName + "." + tld subdomain : asManyOfPattern(identifier + ".") + identifier domainName: identifier tld : "com" | "org" | "gov" | "net" | "edu" Show all details
Create a string that contains an email address, and then extract the pattern from the text.
txt = "You can reach me by email at John.Smith@department.organization.org"; extract(txt,emailPattern)
ans = "John.Smith@department.organization.org"
Named patterns allow dot-indexing in order to access named subpatterns. Use dot-indexing to assign a specific value to the named pattern
emailPattern.emailAddress.domain = "mathworks.com"
emailPattern = pattern Matching emailAddress: username + "@" + domain Using named patterns: emailAddress: username + "@" + domain username : asManyOfPattern(identifier + ".") + identifier domain : "mathworks.com" Show all details
Introduced in R2020b