Main Content

bpeTokenizer

Byte pair encoding tokenizer

Since R2024a

    Description

    A byte pair encoding (BPE) tokenizer maps text data to sequences of integers.

    Creation

    Description

    tokenizer = bpeTokenizer(vocabulary,mergelist) creates a bpeTokenizer object for the specified vocabulary and merge list.

    example

    tokenizer = bpeTokenizer(vocabulary,mergelist,Name=Value) sets additional properties using one or more name-value arguments.

    Input Arguments

    expand all

    Tokenizer vocabulary, specified as a string array or cell array of character vectors.

    The vocabulary must contain the values of the PaddingToken, StartToken, UnknownToken, and SeparatorToken properties. The vocabulary must also contain the tokens in the merge list and the tokens that result from merging the tokens in the merge list.

    The vocabulary must contain the byte values that make up the tokens. In order to store the vocabulary and merge list in string arrays and text files, they must be represented as printable non-whitespace characters. In particular, bpeTokenizer objects require that some characters must be represented differently to their byte encoding. The format of the byte values in the vocabulary and merge list must be consistent with the formats used in GPT-2 and many other transformer neural networks:

    • Bytes that correspond to printable ASCII characters are represented by those characters in the vocabulary. For example, the character "a" is represented as "a" in the vocabulary.

    • The byte value 173 is represented as the byte-value 238 in the vocabulary. That is, the byte 173, which can appear in two-byte characters like "ŭ" (composed as [197 173] in UTF-8 format), is represented as "ġ" (char(238)) in the vocabulary.

    • Bytes with values 127 through 160, are represented as their own byte value plus 162 in the vocabulary. For example, the byte 140, which can appear in two-byte characters like "Č" (composed as [196 140] in UTF-8 representation), is represented as "Į" (char(140+162)) in the vocabulary.

    • Bytes with values greater than 160, excluding 173, are represented as their own byte value in the vocabulary. For example, the byte 195, which can appear in two-byte characters like "é" (composed as [195 169] in UTF-8 representation), is represented as "Ã" (char(195)) in the vocabulary.

    • Bytes with values 0 through 32 are represented as their own byte value plus 255. For example, the space character " " (char(32)) is represented as "Ġ" (char(32+255)) in the vocabulary.

    Some characters require multiple bytes to represent them. For example, the emoji character "😎" is represented by the sequence of bytes [240 159 152 142]. To include such characters in the vocabulary, also include the representation of the bytes that compose the character. For example, to include the emoji character "😎" in the vocabulary, also include char(240), char(159+162), char(152+162), and char(142+162). The merge list must also contain pairs such that the character is the result of a series of these merges.

    To convert Unicode character representations to sequences of numeric bytes, use the unicode2native function with the encoding "UTF-8".

    Data Types: string | cell

    Pairs of tokens to merge, specified as a numPairs-by-2 string array or cell array of character vectors, where numPairs is the number of pairs.

    The vocabulary must also contain the tokens that result from merging tokens in the merge list.

    Data Types: string | cell

    Properties

    expand all

    Pretokenizer, specified as one of these values:

    • "gpt2" — Use GPT-2 pretokenizer.

    • "gpt4" — Use GPT-4 pretokenizer.

    • "bert" — Use BERT pretokenizer.

    • "whitespace" — Use a whitespace pretokenizer, that initially splits words at whitespace characters.

    • "mecab" — Use MeCab pretokenizer.

    • "none" — Do not pretokenize.

    To set this property, use the corresponding name-value argument when you create the bpeTokenizer object. After you create a bpeTokenizer object, this property is read-only.

    Flag to ignore case, specified as 1 (true) or 0 (false).

    To set this property, use the corresponding name-value argument when you create the bpeTokenizer object. After you create a bpeTokenizer object, this property is read-only.

    Flag to strip accents, specified as 1 (true) or 0 (false).

    To set this property, use the corresponding name-value argument when you create the bpeTokenizer object. After you create a bpeTokenizer object, this property is read-only.

    Context size, specified as a positive integer.

    The context size is the number of words or subwords that the tokenizer processes when splitting and merging tokens. A larger context size allows the model to consider more surrounding tokens, which can help capture long-range dependencies, but also increases the computational and memory requirements.

    To set this property, use the corresponding name-value argument when you create the bpeTokenizer object. After you create a bpeTokenizer object, this property is read-only.

    Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

    Padding token, specified as a string scalar.

    To set this property, use the corresponding name-value argument when you create the bpeTokenizer object. After you create a bpeTokenizer object, this property is read-only.

    Data Types: char | string

    This property is read-only.

    Padding code, specified as a positive integer.

    Data Types: double

    Start token, specified as a string scalar.

    To set this property, use the corresponding name-value argument when you create the bpeTokenizer object. After you create a bpeTokenizer object, this property is read-only.

    Data Types: char | string

    This property is read-only.

    Start code, specified as a positive integer.

    Data Types: double

    Unknown token, specified as a string scalar.

    To set this property, use the corresponding name-value argument when you create the bpeTokenizer object. After you create a bpeTokenizer object, this property is read-only.

    Data Types: char | string

    This property is read-only.

    Unknown code, specified as a positive integer.

    Data Types: double

    Separator token, specified as a string scalar.

    To set this property, use the corresponding name-value argument when you create the bpeTokenizer object. After you create a bpeTokenizer object, this property is read-only.

    Data Types: char | string

    This property is read-only.

    Separator code, specified as a positive integer.

    Data Types: double

    Object Functions

    encodeTokenize and encode text for transformer neural network
    decodeConvert token codes to tokens
    encodeTokensConvert tokens to token codes
    wordTokenizeTokenize text into words using tokenizer

    Examples

    collapse all

    Create a BPE tokenizer.

    Create an vocabulary containing the characters "a" through "z" and the pairs of repeating vowels "aa", "ee", "ii", "oo", and "uu".

    vocabulary = ["a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" ...
        "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"  ...
        "aa" "ee" "ii" "oo" "uu"];

    Create a merge list that indicates to merge repeating vowels.

    mergelist = [
        "a" "a"
        "e" "e"
        "i" "i"
        "o" "o"
        "u" "u"];

    Create a BPE tokenizer with the vocabulary and merge list and specify the whitespace tokenizer.

    tokenizer = bpeTokenizer(vocabulary,mergelist,Pretokenizer="whitespace")
    tokenizer = 
      bpeTokenizer with properties:
    
            IgnoreCase: 1
          StripAccents: 1
          PaddingToken: ""
           PaddingCode: NaN
            StartToken: ""
             StartCode: NaN
          UnknownToken: ""
           UnknownCode: NaN
        SeparatorToken: ""
         SeparatorCode: NaN
          Pretokenizer: "whitespace"
           ContextSize: 512
    
    

    Encode the phrase "a cool breeze frees trees" as a sequence of integers using the tokenizer. These integers index into the tokenizer vocabulary.

    str = "a cool breeze frees trees";
    tokenCodes = encode(tokenizer,str)
    tokenCodes = 1x1 cell array
        {[1 3 30 12 2 18 28 26 5 6 18 28 19 20 18 28 19]}
    
    

    Algorithms

    expand all

    Version History

    Introduced in R2024a