visionTransformer

Pretrained vision transformer (ViT) neural network

Since R2023b

collapse all in page

Syntax

[net,classNames] = visionTransformer

[net,classNames] = visionTransformer(modelName)

[net,classNames] = visionTransformer(___,Name=Value)

Description

example

[net,classNames] = visionTransformer returns a base-sized ViT neural network (86.8 million parameters) with a patch size of 16. The network is fine tuned using the ImageNet 2012 data set at a resolution of 384-by-384.

This feature requires a Deep Learning Toolbox™license and the Computer Vision Toolbox™ Model for Vision Transformer Network support package. You can download this support package from the Add-On Explorer. For more information, see Get and Manage Add-Ons.

[net,classNames] = visionTransformer(modelName) returns the ViT neural network with the specified model name.

[net,classNames] = visionTransformer(___,Name=Value) specifies additional options using one or more name-value arguments.

Examples

collapse all

Load ViT Neural Network

Open Live Script

Load a pretrained ViT neural network using the visionTransformer function. If the Computer Vision Toolbox Model for Vision Transformer Network support package is not installed, then the function provides a link to the required support package in the Add-On Explorer. To install the support package, click the link, and then click Install.

Load a pretrained ViT neural network and the class names. If the required support package is installed, then the function returns a dlnetwork object and a string array of class names.

[net,classNames] = visionTransformer;

View the neural network.

net

net = 
  dlnetwork with properties:

         Layers: [143x1 nnet.cnn.layer.Layer]
    Connections: [167x2 table]
     Learnables: [200x3 table]
          State: [0x3 table]
     InputNames: {'imageinput'}
    OutputNames: {'softmax'}
    Initialized: 1

  View summary with summary.

View the number of classes.

numClasses = numel(classNames)

numClasses = 1000

Input Arguments

collapse all

`modelName` — Model name
`"base-16-imagenet-384"` (default) | `"small-16-imagenet-384"` | `"tiny-16-imagenet-384"`

Model name, specified as one of these values:

"base-16-imagenet-384" — Base-sized model (86.8 million parameters) with a patch size of 16. The network is fine-tuned using the ImageNet 2012 data set at a resolution of 384-by-384.
"small-16-imagenet-384" — Small-sized model (22.1 million parameters) with a patch size of 16. The network is fine-tuned using the ImageNet 2012 data set at a resolution of 384-by-384.
"tiny-16-imagenet-384" — Tiny-sized model (5.7 million parameters) with a patch size of 16. The network is fine-tuned using the ImageNet 2012 data set at a resolution of 384-by-384.

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: visionTransformer(DropoutProbability=0.2) returns a pretrained vision transformer neural network with a dropout probability set of 0.2.

`DropoutProbability` — Probability of dropping out input elements in dropout layers
`0.1` (default) | scalar in the range [0, 1)

Probability of dropping out input elements in dropout layers, specified as a scalar in the range [0, 1).

When you train a neural network with dropout layers, the layer randomly sets input elements to zero using the dropout mask rand(size(X)) < p, where X is the layer input and p is the layer dropout probability. The layer then scales the remaining elements by 1/(1-p).

This operation helps to prevent the network from overfitting [2], [3]. A higher number results in the network dropping more elements during training. At prediction time, the output of the layer is equal to its input.

`AttentionDropoutProbability` — Probability of dropping out input elements in attention layers
`0.1` (default) | scalar in the range [0, 1)

Probability of dropping out input elements in attention layers, specified as a scalar in the range [0, 1).

When you train a neural network with attention layers, the layer randomly sets attention scores to zero using the dropout mask rand(size(scores)) < p, where scores is the layer input and p is the layer dropout probability. The layer then scales the remaining elements by 1/(1-p).

Output Arguments

collapse all

`net` — Pretrained ViT neural network
`dlnetwork` object

Pretrained ViT neural network, returned as a dlnetwork (Deep Learning Toolbox) object.

`classNames` — Class names
string array

Class names, returned as a string array.

References

[1] Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani et al. "An Image is Worth 16x16 words: Transformers for Image Recognition at Scale." Preprint, submitted June 3, 2021. https://doi.org/10.48550/arXiv.2010.11929.

[2] Srivastava, Nitish, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. "Dropout: A Simple Way to Prevent Neural Networks from Overfitting." The Journal of Machine Learning Research 15, no. 1 (January 1, 2014): 1929–58

[3] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "ImageNet Classification with Deep Convolutional Neural Networks." Communications of the ACM 60, no. 6 (May 24, 2017): 84–90. https://doi.org/10.1145/3065386

Version History

Introduced in R2023b

visionTransformer

Syntax

Description

Examples

Load ViT Neural Network

Input Arguments

`modelName` — Model name
`"base-16-imagenet-384"` (default) | `"small-16-imagenet-384"` | `"tiny-16-imagenet-384"`

Name-Value Arguments

`DropoutProbability` — Probability of dropping out input elements in dropout layers
`0.1` (default) | scalar in the range [0, 1)

`AttentionDropoutProbability` — Probability of dropping out input elements in attention layers
`0.1` (default) | scalar in the range [0, 1)

Output Arguments

`net` — Pretrained ViT neural network
`dlnetwork` object

`classNames` — Class names
string array

References

Version History

See Also

Topics

visionTransformer

Syntax

Description

Examples

Load ViT Neural Network

Input Arguments

modelName — Model name "base-16-imagenet-384" (default) | "small-16-imagenet-384" | "tiny-16-imagenet-384"

Name-Value Arguments

DropoutProbability — Probability of dropping out input elements in dropout layers 0.1 (default) | scalar in the range [0, 1)

AttentionDropoutProbability — Probability of dropping out input elements in attention layers 0.1 (default) | scalar in the range [0, 1)

Output Arguments

net — Pretrained ViT neural network dlnetwork object

classNames — Class names string array

References

Version History

See Also

Topics

`modelName` — Model name
`"base-16-imagenet-384"` (default) | `"small-16-imagenet-384"` | `"tiny-16-imagenet-384"`

`DropoutProbability` — Probability of dropping out input elements in dropout layers
`0.1` (default) | scalar in the range [0, 1)

`AttentionDropoutProbability` — Probability of dropping out input elements in attention layers
`0.1` (default) | scalar in the range [0, 1)

`net` — Pretrained ViT neural network
`dlnetwork` object

`classNames` — Class names
string array