bpeTokenizer

Byte pair encoding tokenizer

Since R2024a

expand all in page

Description

A byte pair encoding (BPE) tokenizer maps text data to sequences of integers.

Creation

Syntax

tokenizer = bpeTokenizer(vocabulary,mergelist)

tokenizer = bpeTokenizer(vocabulary,mergelist,Name=Value)

Description

tokenizer = bpeTokenizer(vocabulary,mergelist) creates a bpeTokenizer object for the specified vocabulary and merge list.

example

tokenizer = bpeTokenizer(vocabulary,mergelist,Name=Value) sets additional properties using one or more name-value arguments.

Input Arguments

expand all

`vocabulary` — Tokenizer vocabulary
string array | cell array of character vectors

Tokenizer vocabulary, specified as a string array or cell array of character vectors.

The vocabulary must contain the values of the PaddingToken, StartToken, UnknownToken, and SeparatorToken properties. The vocabulary must also contain the tokens in the merge list and the tokens that result from merging the tokens in the merge list.

The vocabulary must contain the byte values that make up the tokens. In order to store the vocabulary and merge list in string arrays and text files, they must be represented as printable non-whitespace characters. In particular, bpeTokenizer objects require that some characters must be represented differently to their byte encoding. The format of the byte values in the vocabulary and merge list must be consistent with the formats used in GPT-2 and many other transformer neural networks:

Bytes that correspond to printable ASCII characters are represented by those characters in the vocabulary. For example, the character "a" is represented as "a" in the vocabulary.
The byte value 173 is represented as the byte-value 238 in the vocabulary. That is, the byte 173, which can appear in two-byte characters like "ŭ" (composed as [197 173] in UTF-8 format), is represented as "ġ" (char(238)) in the vocabulary.
Bytes with values 127 through 160, are represented as their own byte value plus 162 in the vocabulary. For example, the byte 140, which can appear in two-byte characters like "Č" (composed as [196 140] in UTF-8 representation), is represented as "Į" (char(140+162)) in the vocabulary.
Bytes with values greater than 160, excluding 173, are represented as their own byte value in the vocabulary. For example, the byte 195, which can appear in two-byte characters like "é" (composed as [195 169] in UTF-8 representation), is represented as "Ã" (char(195)) in the vocabulary.
Bytes with values 0 through 32 are represented as their own byte value plus 255. For example, the space character " " (char(32)) is represented as "Ġ" (char(32+255)) in the vocabulary.

Some characters require multiple bytes to represent them. For example, the emoji character "😎" is represented by the sequence of bytes [240 159 152 142]. To include such characters in the vocabulary, also include the representation of the bytes that compose the character. For example, to include the emoji character "😎" in the vocabulary, also include char(240), char(159+162), char(152+162), and char(142+162). The merge list must also contain pairs such that the character is the result of a series of these merges.

To convert Unicode character representations to sequences of numeric bytes, use the unicode2native function with the encoding "UTF-8".

Data Types: string | cell

`mergelist` — Pairs of tokens to merge
string array | cell array of character vectors

Pairs of tokens to merge, specified as a numPairs-by-2 string array or cell array of character vectors, where numPairs is the number of pairs.

The vocabulary must also contain the tokens that result from merging tokens in the merge list.

Data Types: string | cell

Properties

expand all

`Pretokenizer` — Pretokenizer
`"gpt2"` (default) | `"gpt4"` | `"bert"` | `"whitespace"` | `"mecab"` | `"none"`

Pretokenizer, specified as one of these values:

"gpt2" — Use GPT-2 pretokenizer.
"gpt4" — Use GPT-4 pretokenizer.
"bert" — Use BERT pretokenizer.
"whitespace" — Use a whitespace pretokenizer, that initially splits words at whitespace characters.
"mecab" — Use MeCab pretokenizer.
"none" — Do not pretokenize.

To set this property, use the corresponding name-value argument when you create the bpeTokenizer object. After you create a bpeTokenizer object, this property is read-only.

`IgnoreCase` — Flag to ignore case
`1` (`true`) (default) | `0` (`false`)

Flag to ignore case, specified as 1 (true) or 0 (false).

To set this property, use the corresponding name-value argument when you create the bpeTokenizer object. After you create a bpeTokenizer object, this property is read-only.

`StripAccents` — Flag to strip accents
`1` (`true`) (default) | `0` (`false`)

Flag to strip accents, specified as 1 (true) or 0 (false).

To set this property, use the corresponding name-value argument when you create the bpeTokenizer object. After you create a bpeTokenizer object, this property is read-only.

`ContextSize` — Context size
`512` (default) | positive integer

Context size, specified as a positive integer.

The context size is the number of words or subwords that the tokenizer processes when splitting and merging tokens. A larger context size allows the model to consider more surrounding tokens, which can help capture long-range dependencies, but also increases the computational and memory requirements.

To set this property, use the corresponding name-value argument when you create the bpeTokenizer object. After you create a bpeTokenizer object, this property is read-only.

`PaddingToken` — Padding token
`""` (default) | string scalar

Padding token, specified as a string scalar.

To set this property, use the corresponding name-value argument when you create the bpeTokenizer object. After you create a bpeTokenizer object, this property is read-only.

Data Types: char | string

`PaddingCode` — Padding code
positive integer

This property is read-only.

Padding code, specified as a positive integer.

Data Types: double

`StartToken` — Start token
`""` (default) | string scalar

Start token, specified as a string scalar.

To set this property, use the corresponding name-value argument when you create the bpeTokenizer object. After you create a bpeTokenizer object, this property is read-only.

Data Types: char | string

`StartCode` — Start code
positive integer

This property is read-only.

Start code, specified as a positive integer.

Data Types: double

`UnknownToken` — Unknown token
`""` (default) | string scalar

Unknown token, specified as a string scalar.

To set this property, use the corresponding name-value argument when you create the bpeTokenizer object. After you create a bpeTokenizer object, this property is read-only.

Data Types: char | string

`UnknownCode` — Unknown code
positive integer

This property is read-only.

Unknown code, specified as a positive integer.

Data Types: double

`SeparatorToken` — Separator token
`""` (default) | string scalar

Separator token, specified as a string scalar.

To set this property, use the corresponding name-value argument when you create the bpeTokenizer object. After you create a bpeTokenizer object, this property is read-only.

Data Types: char | string

`SeparatorCode` — Separator code
positive integer

This property is read-only.

Separator code, specified as a positive integer.

Data Types: double

Object Functions

`encode`	Tokenize and encode text for transformer neural network
`decode`	Convert token codes to tokens
`encodeTokens`	Convert tokens to token codes
`wordTokenize`	Tokenize text into words using tokenizer

Examples

collapse all

Create BPE Tokenizer

Open Live Script

Create a BPE tokenizer.

Create an vocabulary containing the characters "a" through "z" and the pairs of repeating vowels "aa", "ee", "ii", "oo", and "uu".

vocabulary = ["a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" ...
    "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"  ...
    "aa" "ee" "ii" "oo" "uu"];

Create a merge list that indicates to merge repeating vowels.

mergelist = [
    "a" "a"
    "e" "e"
    "i" "i"
    "o" "o"
    "u" "u"];

Create a BPE tokenizer with the vocabulary and merge list and specify the whitespace tokenizer.

tokenizer = bpeTokenizer(vocabulary,mergelist,Pretokenizer="whitespace")

tokenizer = 
  bpeTokenizer with properties:

        IgnoreCase: 1
      StripAccents: 1
      PaddingToken: ""
       PaddingCode: NaN
        StartToken: ""
         StartCode: NaN
      UnknownToken: ""
       UnknownCode: NaN
    SeparatorToken: ""
     SeparatorCode: NaN
      Pretokenizer: "whitespace"
       ContextSize: 512

Encode the phrase "a cool breeze frees trees" as a sequence of integers using the tokenizer. These integers index into the tokenizer vocabulary.

str = "a cool breeze frees trees";
tokenCodes = encode(tokenizer,str)

tokenCodes = 1x1 cell array
    {[1 3 30 12 2 18 28 26 5 6 18 28 19 20 18 28 19]}

Algorithms

expand all

Byte Pair Encoding

Byte pair encoding (BPE) is a tokenization algorithm that allows transformer networks to handle a wide range of vocabulary without assigning individual tokens for every possible word. During tokenization, the algorithm replaces out-of-vocabulary (OOV) words with subword counterparts, which allows models to handle unseen words more effectively. This process creates a set of subword tokens that can better represent common and rare words.

These steps outline the algorithm for training a BPE tokenizer:

Start with a corpus of text. For example, a corpus that includes phrases like "use byte pair encoding to tokenize text". Split the text data into words using a specified pretokenization algorithm.
Initialize a vocabulary of bytes. For example, start with a vocabulary of ["a" "b" "c" ... "z"]. For non-ASCII characters like emojis that comprise multiple bytes, start with the byte values that make up the character.
Encode each word in the text data as a sequence of bytes, and represent the words as sequences of integers that index into the vocabulary. For example, represent the word "use" as [21 19 5]. When the encoding of a character is more than one byte, the resulting sequence of bytes can have more elements than the number of characters in the word.
Count the frequency of all adjacent pairs of bytes in the corpus. For example, among the words ["use" "byte" "pair" "encoding" "to" "tokenize" "text"], the token pairs ["t" "e"], ["e" "n"], and ["t" "o"] appear twice, and the remaining pairs appear once.
Identify the most frequent pair and add the corresponding merged token to the vocabulary. In the words represented as sequences of vocabulary indices, replace the corresponding pairs with the index of the new merged token in the vocabulary. Then, add this token pair to the merge list. For example, append the token pair ["t" "e"] to the merge list. Then, add the corresponding merged token "te" to the vocabulary so that it has the index 27. Then, in the text data represented as vocabulary indices, replace the pairs of vocabulary indices [20 5] (which corresponds to ["t" "e"] with the corresponding new vocabulary index:
- The representation [2 25 20 5] for the word "byte" becomes [2 25 27].
- The representation [20 5 24 20] for the word "text" becomes [27 24 20].
Repeat the frequency count and merge operations until you reach a specified number of iterations or vocabulary size. For example, repeating these steps several times leads to merging the pair ["b" "y"] to make the token "by", and then subsequently the pair ["by" "te"] to make the token "byte".

These steps outline how a BPE tokenizer tokenizes new text:

Pretokenization — Split text into individual words.
Byte-encoding — Encode each word into sequences of bytes
Merge — By starting at the top of the merge list and progressing through it, iteratively apply each merge to pairs of tokens when possible.

Version History

Introduced in R2024a

bpeTokenizer

Description

Creation

Syntax

Description

Input Arguments

`vocabulary` — Tokenizer vocabulary
string array | cell array of character vectors

`mergelist` — Pairs of tokens to merge
string array | cell array of character vectors

Properties

`Pretokenizer` — Pretokenizer
`"gpt2"` (default) | `"gpt4"` | `"bert"` | `"whitespace"` | `"mecab"` | `"none"`

`IgnoreCase` — Flag to ignore case
`1` (`true`) (default) | `0` (`false`)

`StripAccents` — Flag to strip accents
`1` (`true`) (default) | `0` (`false`)

`ContextSize` — Context size
`512` (default) | positive integer

`PaddingToken` — Padding token
`""` (default) | string scalar

`PaddingCode` — Padding code
positive integer

`StartToken` — Start token
`""` (default) | string scalar

`StartCode` — Start code
positive integer

`UnknownToken` — Unknown token
`""` (default) | string scalar

`UnknownCode` — Unknown code
positive integer

`SeparatorToken` — Separator token
`""` (default) | string scalar

`SeparatorCode` — Separator code
positive integer

Object Functions

Examples

Create BPE Tokenizer

Algorithms

Byte Pair Encoding

Version History

See Also

Topics

bpeTokenizer

Description

Creation

Syntax

Description

Input Arguments

vocabulary — Tokenizer vocabulary string array | cell array of character vectors

mergelist — Pairs of tokens to merge string array | cell array of character vectors

Properties

Pretokenizer — Pretokenizer "gpt2" (default) | "gpt4" | "bert" | "whitespace" | "mecab" | "none"

IgnoreCase — Flag to ignore case 1 (true) (default) | 0 (false)

StripAccents — Flag to strip accents 1 (true) (default) | 0 (false)

ContextSize — Context size 512 (default) | positive integer

PaddingToken — Padding token "" (default) | string scalar

PaddingCode — Padding code positive integer

StartToken — Start token "" (default) | string scalar

StartCode — Start code positive integer

UnknownToken — Unknown token "" (default) | string scalar

UnknownCode — Unknown code positive integer

SeparatorToken — Separator token "" (default) | string scalar

SeparatorCode — Separator code positive integer

Object Functions

Examples

Create BPE Tokenizer

Algorithms

Byte Pair Encoding

Version History

See Also

Topics

`vocabulary` — Tokenizer vocabulary
string array | cell array of character vectors

`mergelist` — Pairs of tokens to merge
string array | cell array of character vectors

`Pretokenizer` — Pretokenizer
`"gpt2"` (default) | `"gpt4"` | `"bert"` | `"whitespace"` | `"mecab"` | `"none"`

`IgnoreCase` — Flag to ignore case
`1` (`true`) (default) | `0` (`false`)

`StripAccents` — Flag to strip accents
`1` (`true`) (default) | `0` (`false`)

`ContextSize` — Context size
`512` (default) | positive integer

`PaddingToken` — Padding token
`""` (default) | string scalar

`PaddingCode` — Padding code
positive integer

`StartToken` — Start token
`""` (default) | string scalar

`StartCode` — Start code
positive integer

`UnknownToken` — Unknown token
`""` (default) | string scalar

`UnknownCode` — Unknown code
positive integer

`SeparatorToken` — Separator token
`""` (default) | string scalar

`SeparatorCode` — Separator code
positive integer