Build Pattern Expressions
Patterns are a tool to aid in searching for and modifying text. Similar to regular expressions, a pattern defines rules for matching text. Patterns can be used with text-searching functions like
extract to specify which portions of text these functions act on. You can build a pattern expression in a way similar to how you would build a mathematical expression, using pattern functions, operators, and literal text. Because building pattern expressions is open ended, patterns can become quite complicated. Building patterns in steps and using functions like
namedPattern can help organize complicated patterns.
Building Simple Patterns
The simplest pattern is built from a single pattern function. For example,
lettersPattern matches any letter characters. There are many pattern functions for matching different types of characters and other features of text. A list of these functions can be found on the
pattern reference page.
txt = "abc123def"; pat = lettersPattern; extract(txt,pat)
ans = 2x1 string "abc" "def"
Patterns combine with other patterns and literal text by using the
plus(+) operator. This operator appends patterns and text together in the order they are defined in the pattern expression. The combined patterns only match text in the same order. In this example, "YYYY/MM/DD" is not a match because a four-letter string must be at the end of the text.
txt = "Dates can be expressed as MM/DD/YYYY, DD/MM/YYYY, or YYYY/MM/DD"; pat = lettersPattern(2) + "/" + lettersPattern(2) + "/" + lettersPattern(4); extract(txt,pat)
ans = 2x1 string "MM/DD/YYYY" "DD/MM/YYYY"
Patterns used with the
or(|) operator specify that only one of the two specified patterns needs to match a section of text. If neither pattern is able to match then the pattern expression fails to match.
txt = "123abc"; pat = lettersPattern|digitsPattern; extract(txt,pat)
ans = 2x1 string "123" "abc"
Some pattern functions take patterns as their input and modify them in some way. For example,
optionalPattern makes a specified pattern match if possible, but the pattern is not required for a successful match.
txt = ["123abc" "abc"]; pat = optionalPattern(digitsPattern) + lettersPattern; extract(txt,pat)
ans = 1x2 string "123abc" "abc"
Boundary patterns are a special type of pattern that do not match characters but rather match the boundaries between a designated character type and other characters or the start or end of that piece of text. For example,
digitBoundary matches the boundaries between digit characters and nondigit characters and between digit characters and the start or end of the text. It does not match digit characters themselves. Boundary patterns are useful as delimiters for functions like
txt = "123abc"; pat = digitBoundary; split(txt,pat)
ans = 3x1 string "" "123" "abc"
Boundary patterns are special amongst patterns because they can be negated using the
not(~) operator. When negated in this way, boundary patterns match before or after characters that did not satisfy the requirements above. For example,
~digitBoundary matches the boundary between:
characters that are both digits
characters that are both nondigits
a nondigit character and the start or end of a piece of text
replace to mark the locations matched by
~digitBoundary with a
txt = "123abc"; pat = ~digitBoundary; replace(txt,pat,"|")
ans = "1|2|3a|b|c|"
Building Complicated Patterns in Steps
Sometimes a simple pattern is not sufficient to solve a problem and a more complicated pattern is needed. As a pattern expression grows it can become difficult to understand what it is matching. One way to simplify building a complicated pattern is building each part of the pattern separately and then combining the parts together into a single pattern expression.
For instance, email addresses use the form local_part@domain.TLD. Each of the three identifiers — local_part, domain, and TLD — must be a combination of digits, letters and underscore characters. To build the full pattern, start by defining a pattern for the identifiers. Build a pattern that matches one letter or digit character or one underscore character.
identCharacters = alphanumericsPattern(1) | "_";
asManyOfPattern to match one or more consecutive instances of
identifier = asManyOfPattern(identCharacters,1);
Next, build a pattern that matches an email containing multiple identifiers.
emailPattern = identifier + "@" + identifier + "." + identifier;
Test the pattern by seeing how well it matches the following example emails.
exampleEmails = ["firstname.lastname@example.org" "email@example.com" "firstname.lastname@example.org"]; matches(exampleEmails,emailPattern)
ans = 3x1 logical array 1 0 0
The pattern fails to match several of the example emails even though all the emails are valid. Both the local_part and domain can be made of a series of identifiers that are separated by periods. Use the
identifier pattern to build a pattern that is capable of matching a series of identifiers.
asManyOfPattern matches as many concurrent appearances of the specified pattern as possible, but if there are none the rest of the pattern is still able to match successfully.
identifierSeries = asManyOfPattern(identifier + ".") + identifier;
Use this pattern to build a new
emailPattern that can match all of the example emails.
emailPattern = identifierSeries + "@" + identifierSeries + "." + identifier; matches(exampleEmails,emailPattern)
ans = 3x1 logical array 1 1 1
Organizing Pattern Display
Complex patterns can sometimes be difficult to read and interpret, especially by those you share them with who are unfamiliar with the pattern's structure. For example, when displayed,
emailPattern is long and difficult to read.
emailPattern = pattern Matching: asManyOfPattern(asManyOfPattern(alphanumericsPattern(1) | "_",1) + ".") + asManyOfPattern(alphanumericsPattern(1) | "_",1) + "@" + asManyOfPattern(asManyOfPattern(alphanumericsPattern(1) | "_",1) + ".") + asManyOfPattern(alphanumericsPattern(1) | "_",1) + "." + asManyOfPattern(alphanumericsPattern(1) | "_",1)
Part of the issue with the display is that there are many repetitions of the
identifier pattern. If the exact details of this pattern are not important to users of the pattern, then the display of the
identifier pattern can be concealed using
maskedPattern. This function creates a new pattern where the display of
identifier is masked and the variable name,
"identifier", is displayed instead. Alternatively, you can specify a different name to be displayed. The details of patterns that are masked in this way can be accessed by clicking "
Show all details" in the displayed pattern.
identifier = maskedPattern(identifier); identifierSeries = asManyOfPattern(identifier + ".") + identifier
identifierSeries = pattern Matching: asManyOfPattern(identifier + ".") + identifier Show all details
Patterns can be further organized using the
namedPattern designates a pattern as a named pattern that changes how the pattern is displayed when combined with other patterns. Email addresses have several important portions, local_part@domain.TLD, which each have their own matching rules. Create a named pattern for each section.
localPart = namedPattern(identifierSeries,"local_part");
Named patterns can be nested, to further delineate parts of a pattern. To nest a named pattern, build a pattern using named patterns and then designate that pattern as a named pattern. For example, Domain.TLD can be divided into the domain, subdomains, and the top level domain (TLD). Create named patterns for each part of domain.TLD.
subdomain = namedPattern(identifierSeries,"subdomain"); domainName = namedPattern(identifier,"domainName"); tld = namedPattern(identifier,"TLD");
Nest the named patterns for the components of domain underneath a single named pattern
domain = optionalPattern(subdomain + ".") + ... domainName + "." + ... tld; domain = namedPattern(domain);
Combine the patterns together into a single named pattern,
emailPattern. In the display of
emailPattern you can see each named pattern and what they match as well as the information on any nested named patterns.
emailPattern = localPart + "@" + domain
emailPattern = pattern Matching: local_part + "@" + domain Using named patterns: local_part : asManyOfPattern(identifier + ".") + identifier domain : optionalPattern(subdomain + ".") + domainName + "." + TLD subdomain : asManyOfPattern(identifier + ".") + identifier domainName: identifier TLD : identifier Show all details
You can access named patterns and nested named patterns by dot-indexing into a pattern. For example, you can access the nested named pattern
subdomain by dot-indexing from
domain and then dot-indexing again into
ans = pattern Matching: asManyOfPattern(identifier + ".") + identifier Show all details
Dot-assignment can be used to change named patterns without needing to rewrite the rest of the pattern expression.
emailPattern.domain = "mathworks.com"
emailPattern = pattern Matching: local_part + "@" + domain Using named patterns: local_part: asManyOfPattern(identifier + ".") + identifier domain : "mathworks.com" Show all details
Copyright 2020 The MathWorks, Inc.