bertTokenizer

WordPiece BERT tokenizer

Since R2023b

Description

A Bidirectional Encoder Representations from Transformers (BERT) neural network WordPiece tokenizer maps text data to sequences of integers.

Creation

Syntax

tokenizer = bertTokenizer(vocabulary)

tokenizer = bertTokenizer(vocabulary,Name=Value)

Description

tokenizer = bertTokenizer(vocabulary) creates a bertTokenizer object for the specified vocabulary.

example

tokenizer = bertTokenizer(vocabulary,Name=Value) sets additional properties using one or more name-value arguments.

Input Arguments

expand all

`vocabulary` — Tokenizer vocabulary
string array | cell array of character vectors

Tokenizer vocabulary, specified as a string array or cell array of character vectors.

The vocabulary must contain the values of the PaddingToken, StartToken, UnknownToken, and SeparatorToken properties.

Data Types: string | cell

Properties

expand all

`IgnoreCase` — Flag to ignore case
`true` or `1` (default) | `false` or `0`

Flag to ignore case, specified as 1 (true) or 0 (false).

To set this property, use the corresponding name-value argument when you create the bertTokenizer object. After you create a bertTokenizer object, this property is read-only.

`StripAccents` — Flag to strip accents
`true` or (`1`) (default) | `false` or (`0`)

Flag to strip accents, specified as 1 (true) or 0 (false).

To set this property, use the corresponding name-value argument when you create the bertTokenizer object. After you create a bertTokenizer object, this property is read-only.

`ContextSize` — Context size
`512` (default) | positive integer

Context size, specified as a positive integer.

The context size is the number of words or subwords that the tokenizer processes when splitting and merging tokens. A larger context size allows the model to consider more surrounding tokens, which can help capture long-range dependencies, but also increases the computational and memory requirements.

To set this property, use the corresponding name-value argument when you create the bertTokenizer object. After you create a bertTokenizer object, this property is read-only.

`PaddingToken` — Padding token
`"[PAD]"` (default) | string scalar

Padding token, specified as a string scalar.

To set this property, use the corresponding name-value argument when you create the bertTokenizer object. After you create a bertTokenizer object, this property is read-only.

Data Types: char | string

`PaddingCode` — Padding code
positive integer

This property is read-only.

Padding code, specified as a positive integer.

Data Types: double

`StartToken` — Start token
`"[CLS]"` (default) | string scalar

Start token, specified as a string scalar.

To set this property, use the corresponding name-value argument when you create the bertTokenizer object. After you create a bertTokenizer object, this property is read-only.

Data Types: char | string

`StartCode` — Start code
positive integer

This property is read-only.

Start code, specified as a positive integer.

Data Types: double

`UnknownToken` — Unknown token
`"[UNK]"` (default) | string scalar

Unknown token, specified as a string scalar.

To set this property, use the corresponding name-value argument when you create the bertTokenizer object. After you create a bertTokenizer object, this property is read-only.

Data Types: char | string

`UnknownCode` — Unknown code
positive integer

This property is read-only.

Unknown code, specified as a positive integer.

Data Types: double

`SeparatorToken` — Separator token
`"[SEP]"` (default) | string scalar

Separator token, specified as a string scalar.

To set this property, use the corresponding name-value argument when you create the bertTokenizer object. After you create a bertTokenizer object, this property is read-only.

Data Types: char | string

`SeparatorCode` — Separator code
positive integer

This property is read-only.

Separator code, specified as a positive integer.

Data Types: double

Object Functions

`encode`	Tokenize and encode text for transformer neural network
`decode`	Convert token codes to tokens
`encodeTokens`	Convert tokens to token codes
`subwordTokenize`	Tokenize text into subwords using BERT tokenizer
`wordTokenize`	Tokenize text into words using tokenizer

Examples

collapse all

Create BERT Tokenizer

Open Live Script

Create a BERT tokenizer that has a vocabulary of the words "math", "science", and "engineering". Include tokens to use as padding, start, unknown, and separator tokens.

vocabulary = ["math" "science" "engineering" "[PAD]" "[CLS]" "[UNK]" "[SEP]"];
tokenizer = bertTokenizer(vocabulary)

tokenizer = 
  bertTokenizer with properties:

        IgnoreCase: 1
      StripAccents: 1
      PaddingToken: "[PAD]"
       PaddingCode: 4
        StartToken: "[CLS]"
         StartCode: 5
      UnknownToken: "[UNK]"
       UnknownCode: 6
    SeparatorToken: "[SEP]"
     SeparatorCode: 7
       ContextSize: 512

Algorithms

expand all

WordPiece Tokenization

The WordPiece tokenization algorithm [2] splits words into subword units and maps common sequences of characters and subwords to a single integer. During tokenization, the algorithm replaces out-of-vocabulary (OOV) words with subword counterparts, which allows models to handle unseen words more effectively. This process creates a set of subword tokens that can better represent common and rare words.

These steps outline how to create a WordPiece tokenizer:

Initialize vocabulary — Create an initial vocabulary of the unique characters in the data.
Count token frequencies — Iterate through the training data and count the frequencies of each token in the vocabulary.
Merge most frequent pairs — Identify the most frequent pair of tokens in the vocabulary and merge them into a single token. Update the vocabulary accordingly.
Repeat counting and merging — Repeat the counting and merging steps until the vocabulary reaches a predefined size or until tokens can no longer merge.

These steps outline how a WordPiece tokenizer tokenizes new text:

Split text — Split text into individual words.
Identify OOV words — Identify any OOV words that are not present in the pretrained vocabulary.
Replace OOV words — Replace the OOV words with their subword counterparts from the vocabulary. For example, by iteratively checking that OOV tokens start with vocabulary tokens.

References

[1] Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding" Preprint, submitted May 24, 2019. https://doi.org/10.48550/arXiv.1810.04805.

[2] Wu, Yonghui, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun et al. "Google's Neural Machine Translation System: Bridging the Gap Between Human and Machine Translation." Preprint, submitted October 8, 2016. https://doi.org/10.48550/arXiv.1609.08144

Version History

Introduced in R2023b

bertTokenizer

Description

Creation

Syntax

Description

Input Arguments

`vocabulary` — Tokenizer vocabulary
string array | cell array of character vectors

Properties

`IgnoreCase` — Flag to ignore case
`true` or `1` (default) | `false` or `0`

`StripAccents` — Flag to strip accents
`true` or (`1`) (default) | `false` or (`0`)

`ContextSize` — Context size
`512` (default) | positive integer

`PaddingToken` — Padding token
`"[PAD]"` (default) | string scalar

`PaddingCode` — Padding code
positive integer

`StartToken` — Start token
`"[CLS]"` (default) | string scalar

`StartCode` — Start code
positive integer

`UnknownToken` — Unknown token
`"[UNK]"` (default) | string scalar

`UnknownCode` — Unknown code
positive integer

`SeparatorToken` — Separator token
`"[SEP]"` (default) | string scalar

`SeparatorCode` — Separator code
positive integer

Object Functions

Examples

Create BERT Tokenizer

Algorithms

WordPiece Tokenization

References

Version History

See Also

Topics

bertTokenizer

Description

Creation

Syntax

Description

Input Arguments

vocabulary — Tokenizer vocabulary string array | cell array of character vectors

Properties

IgnoreCase — Flag to ignore case true or 1 (default) | false or 0

StripAccents — Flag to strip accents true or (1) (default) | false or (0)

ContextSize — Context size 512 (default) | positive integer

PaddingToken — Padding token "[PAD]" (default) | string scalar

PaddingCode — Padding code positive integer

StartToken — Start token "[CLS]" (default) | string scalar

StartCode — Start code positive integer

UnknownToken — Unknown token "[UNK]" (default) | string scalar

UnknownCode — Unknown code positive integer

SeparatorToken — Separator token "[SEP]" (default) | string scalar

SeparatorCode — Separator code positive integer

Object Functions

Examples

Create BERT Tokenizer

Algorithms

WordPiece Tokenization

References

Version History

See Also

Topics

`vocabulary` — Tokenizer vocabulary
string array | cell array of character vectors

`IgnoreCase` — Flag to ignore case
`true` or `1` (default) | `false` or `0`

`StripAccents` — Flag to strip accents
`true` or (`1`) (default) | `false` or (`0`)

`ContextSize` — Context size
`512` (default) | positive integer

`PaddingToken` — Padding token
`"[PAD]"` (default) | string scalar

`PaddingCode` — Padding code
positive integer

`StartToken` — Start token
`"[CLS]"` (default) | string scalar

`StartCode` — Start code
positive integer

`UnknownToken` — Unknown token
`"[UNK]"` (default) | string scalar

`UnknownCode` — Unknown code
positive integer

`SeparatorToken` — Separator token
`"[SEP]"` (default) | string scalar

`SeparatorCode` — Separator code
positive integer