# selfAttentionLayer

## Description

A self-attention layer computes single-head or multihead self-attention of its input.

The layer:

Computes the queries, keys, and values from the input

Computes the scaled dot-product attention across heads using the queries, keys, and values

Merges the results from the heads

Performs a linear transformation on the merged result

## Creation

### Syntax

### Description

creates a self-attention layer and sets the `layer`

= selfAttentionLayer(numHeads,numKeyChannels)`NumHeads`

and `NumKeyChannels`

properties.

sets the optional `layer`

= selfAttentionLayer(numHeads,numKeyChannels,`Name=Value`

)`NumValueChannels`

, `OutputSize`

, `HasPaddingMaskInput`

, `AttentionMask`

, `DropoutProbability`

, `HasScoresOutput`

, Parameters and Initialization, Learning Rate and Regularization, and `Name`

properties.

## Properties

### Self-Attention

`NumHeads`

— Number of attention heads

positive integer

This property is read-only.

Number of attention heads, specified as a positive integer that evenly divides
`NumKeyChannels`

.

**Data Types: **`single`

| `double`

| `int8`

| `int16`

| `int32`

| `int64`

| `uint8`

| `uint16`

| `uint32`

| `uint64`

`NumKeyChannels`

— Number of channels for keys and queries

positive integer

This property is read-only.

Number of channels for the keys and queries, specified as a positive integer that
is divisible by `NumHeads`

.

**Data Types: **`single`

| `double`

| `int8`

| `int16`

| `int32`

| `int64`

| `uint8`

| `uint16`

| `uint32`

| `uint64`

| `char`

| `string`

`NumValueChannels`

— Number of channels for values

`"auto"`

(default) | positive integer

Number of channels for the values, specified as one of these values:

`"auto"`

— Use`NumKeyChannels`

.Positive integer — Use the specified number of channels. This value must be divisible by

`NumHeads`

.

**Data Types: **`single`

| `double`

| `int8`

| `int16`

| `int32`

| `int64`

| `uint8`

| `uint16`

| `uint32`

| `uint64`

| `char`

| `string`

`OutputSize`

— Number of channels of layer output

`"auto"`

(default) | positive integer

Number of channels of the layer output, specified as one of these values:

`"auto"`

— Use the number of channels in the layer input.Positive integer — Use the specified number of channels.

**Data Types: **`single`

| `double`

| `int8`

| `int16`

| `int32`

| `int64`

| `uint8`

| `uint16`

| `uint32`

| `uint64`

| `char`

| `string`

`HasPaddingMaskInput`

— Flag indicating whether layer has mask input

`0`

(`false`

) (default) | `1`

(`true`

)

Flag indicating whether the layer has an input that represents the padding mask,
specified as `0`

(`false`

) or `1`

(`true`

).

If the `HasPaddingMaskInput`

property is `0`

(`false`

), then the layer has one input with the name `'in'`

, which corresponds to the input data. In this case, the layer treats all elements as data.

If the `HasPaddingMaskInput`

property is `1`

(`true`

), then the layer has two inputs with the names
`'in'`

and `'mask'`

, which correspond to the input
data and the mask, respectively. In this case, the padding mask is an array of ones and
zeros. The layer prevents or allows attention to elements in the key-value pairs when the
corresponding element in the mask is zero or one, respectively.

`AttentionMask`

— Mask preventing attention to elements in key-value pairs

`"none"`

(default) | `"causal"`

Mask preventing attention to elements in key-value pairs, specified as one of these values:

`"none"`

— Do not prevent attention to elements based on their positions. If`HasPaddingMaskInput`

is`1`

(`true`

), then the layer prevents attention to padding elements only.`"causal"`

— Prevent elements in position`M`

from attending to elements in position`N`

, where`N`

is greater than`M`

. Use this option for auto-regressive models.

`DropoutProbability`

— Dropout probability for attention scores

`0`

(default) | scalar in the range [0, 1)

Dropout probability for attention scores, specified as a scalar in the range [0, 1).

**Data Types: **`single`

| `double`

| `int8`

| `int16`

| `int32`

| `int64`

| `uint8`

| `uint16`

| `uint32`

| `uint64`

`HasScoresOutput`

— Flag indicating whether layer has scores output

`0`

(`false`

) (default) | `1`

(`true`

)

Flag indicating whether the layer has an output that represents the attention scores, specified as one of these values:

`0`

(`false`

) — The layer does not have an output that represents the attention scores.`1`

(`true`

) — The layer has an output with the name`"scores"`

that represents the attention scores.

`InputSize`

— Number of input channels

`"auto"`

(default) | positive integer

This property is read-only.

Number of input channels, specified as one of these values:

`"auto"`

— Automatically determine the number of input channels when you initialize the networkPositive integer — Configure the layer for the specified number of input channels.

`InputSize`

and the number of channels in the layer input data must match.

**Data Types: **`single`

| `double`

| `int8`

| `int16`

| `int32`

| `int64`

| `uint8`

| `uint16`

| `uint32`

| `uint64`

| `char`

| `string`

### Parameters and Initialization

`WeightsInitializer`

— Function to initialize weights

`"glorot"`

(default) | `"he"`

| `"narrow-normal"`

| `"zeros"`

| `"ones"`

| function handle

Function to initialize the query, key, value, and output weights, specified as one of these values:

`"glorot"`

– Initialize the weights with the Glorot initializer (also known as Xavier initializer) [2]. The Glorot initializer independently samples from a uniform distribution with zero mean and a variance of`2/(numIn + numOut)`

. The values of`numIn`

and`numOut`

depend on the weight matrix:Weight `numIn`

`numOut`

Query `InputSize`

`NumKeyChannels`

Key `InputSize`

`NumKeyChannels`

Value `InputSize`

`NumValueChannels`

Output `NumValueChannels`

`OutputSize`

`"he"`

– Initialize the weights with the He initializer [3]. The He initializer samples from a normal distribution with zero mean and a variance of`2/numIn`

. The values of`numIn`

and`numOut`

depend on the weight matrix:Weight `numIn`

`numOut`

Query `InputSize`

`NumKeyChannels`

Key `InputSize`

`NumKeyChannels`

Value `InputSize`

`NumValueChannels`

Output `NumValueChannels`

`OutputSize`

`"narrow-normal"`

— Initialize the weights by independently sampling from a normal distribution with zero mean and a standard deviation of 0.01.`"zeros"`

— Initialize the weights with zeros.`"ones"`

— Initialize the weights with ones.Function handle — Initialize the weights with a custom function. If you specify a function handle, then the function must have the form

`weights = func(sz)`

, where`sz`

is the size of the weights. For an example, see Specify Custom Weight Initialization Function.

The layer only initializes the weights when the corresponding weights property is empty.

**Data Types: **`char`

| `string`

| `function_handle`

`BiasInitializer`

— Function to initialize biases

`"zeros"`

(default) | `"narrow-normal"`

| `"ones"`

| function handle

Function to initialize the query, key, value, and output biases, specified as one of these values:

`"zeros"`

— Initialize the biases with zeros.`"ones"`

— Initialize the biases with ones.`"narrow-normal"`

— Initialize the biases by independently sampling from a normal distribution with zero mean and a standard deviation of 0.01.Function handle — Initialize the biases with a custom function. If you specify a function handle, then the function must have the form

`bias = func(sz)`

, where`sz`

is the size of the biases.

The layer only initializes the biases when the corresponding bias property is empty.

**Data Types: **`char`

| `string`

| `function_handle`

`QueryWeights`

— Query weights

`[]`

(default) | matrix

Query weights, specified as a `NumKeyChannels`

-by-`numInputChannels`

matrix or
`[]`

, where `numInputChannels`

is the number of
channels in the layer input.

**Data Types: **`single`

| `double`

`KeyWeights`

— Key weights

`[]`

(default) | matrix

Key weights, specified as a `NumKeyChannels`

-by-`numInputChannels`

matrix or
`[]`

, where `numInputChannels`

is the number of
channels in the layer input.

**Data Types: **`single`

| `double`

`ValueWeights`

— Value weights

`[]`

(default) | matrix

Value weights, specified as a `NumValueChannels`

-by-`numInputChannels`

matrix or
`[]`

, where `numInputChannels`

is the number of
channels in the layer input.

**Data Types: **`single`

| `double`

`OutputWeights`

— Output weights

`[]`

(default) | matrix

Output weights, specified as an `OutputSize`

-by-`NumValueChannels`

matrix or `[]`

.

**Data Types: **`single`

| `double`

`QueryBias`

— Query biases

`[]`

(default) | vector

Query biases, specified as a `NumKeyChannels`

-by-`1`

vector or
`[]`

.

**Data Types: **`single`

| `double`

`KeyBias`

— Key biases

`[]`

(default) | vector

Key biases, specified as a `NumKeyChannels`

-by-`1`

vector or
`[]`

.

**Data Types: **`single`

| `double`

`ValueBias`

— Value biases

`[]`

(default) | vector

Value biases, specified as a `NumValueChannels`

-by-`1`

vector or
`[]`

.

**Data Types: **`single`

| `double`

`OutputBias`

— Output biases

`[]`

(default) | vector

Output biases, specified as an `OutputSize`

-by-`1`

vector or
`[]`

.

**Data Types: **`single`

| `double`

### Learning Rate and Regularization

`WeightLearnRateFactor`

— Learning rate factor for weights

`1`

(default) | nonnegative scalar

Learning rate factor for the query, key, value, and output weights, specified as a nonnegative scalar.

The software multiplies this factor by the global learning rate to determine the learning rate for the weights in this layer. For example, if `WeightLearnRateFactor`

is `2`

, then the learning rate for the weights in this layer is twice the current global learning rate. The software determines the global learning rate based on the settings you specify using the `trainingOptions`

function.

**Data Types: **`single`

| `double`

| `int8`

| `int16`

| `int32`

| `int64`

| `uint8`

| `uint16`

| `uint32`

| `uint64`

`BiasLearnRateFactor`

— Learning rate factor for biases

`1`

(default) | nonnegative scalar

Learning rate factor for the query, key, value, and output biases, specified as a nonnegative scalar.

The software multiplies this factor by the global learning rate
to determine the learning rate for the biases in this layer. For example, if
`BiasLearnRateFactor`

is `2`

, then the learning rate for
the biases in the layer is twice the current global learning rate. The software determines the
global learning rate based on the settings you specify using the `trainingOptions`

function.

**Data Types: **`single`

| `double`

| `int8`

| `int16`

| `int32`

| `int64`

| `uint8`

| `uint16`

| `uint32`

| `uint64`

`WeightL2Factor`

— *L*_{2} regularization factor for weights

`1`

(default) | nonnegative scalar

_{2}

*L _{2}* regularization factor for the query,
key, value, and output weights, specified as a nonnegative scalar.

The software multiplies this factor by the global *L _{2}* regularization factor to determine the

*L*regularization for the weights in this layer. For example, if

_{2}`WeightL2Factor`

is `2`

, then the *L*regularization for the weights in this layer is twice the global

_{2}*L*regularization factor. You can specify the global

_{2}*L*regularization factor using the

_{2}`trainingOptions`

function.**Data Types: **`single`

| `double`

| `int8`

| `int16`

| `int32`

| `int64`

| `uint8`

| `uint16`

| `uint32`

| `uint64`

`BiasL2Factor`

— *L*_{2} regularization factor for biases

`0`

(default) | nonnegative scalar

_{2}

*L _{2}* regularization factor for the query,
key, value, and output biases, specified as a nonnegative scalar.

The software multiplies this factor by the global
*L _{2}* regularization factor to
determine the

*L*regularization for the biases in this layer. For example, if

_{2}`BiasL2Factor`

is `2`

, then
the *L*regularization for the biases in this layer is twice the global

_{2}*L*regularization factor. The software determines the global

_{2}*L*regularization factor based on the settings you specify using the

_{2}`trainingOptions`

function.**Data Types: **`single`

| `double`

| `int8`

| `int16`

| `int32`

| `int64`

| `uint8`

| `uint16`

| `uint32`

| `uint64`

### Layer

`Name`

— Layer name

`''`

(default) | character vector | string scalar

Layer name, specified as a character vector or a string scalar.
For `Layer`

array input, the `trainNetwork`

, `assembleNetwork`

, `layerGraph`

, and
`dlnetwork`

functions automatically assign
names to layers with the name `''`

.

**Data Types: **`char`

| `string`

`NumInputs`

— Number of inputs

`1`

| `2`

This property is read-only.

Number of inputs to the layer.

If the `HasPaddingMaskInput`

property is `0`

(`false`

), then the layer has one input with the name `'in'`

, which corresponds to the input data. In this case, the layer treats all elements as data.

If the `HasPaddingMaskInput`

property is `1`

(`true`

), then the layer has two inputs with the names
`'in'`

and `'mask'`

, which correspond to the input
data and the mask, respectively. In this case, the padding mask is an array of ones and
zeros. The layer prevents or allows attention to elements in the key-value pairs when the
corresponding element in the mask is zero or one, respectively.

**Data Types: **`double`

`InputNames`

— Input names

`{'in'}`

| `{'in','mask'}`

This property is read-only.

Input names of the layer.

If the `HasPaddingMaskInput`

property is `0`

(`false`

), then the layer has one input with the name `'in'`

, which corresponds to the input data. In this case, the layer treats all elements as data.

If the `HasPaddingMaskInput`

property is `1`

(`true`

), then the layer has two inputs with the names
`'in'`

and `'mask'`

, which correspond to the input
data and the mask, respectively. In this case, the padding mask is an array of ones and
zeros. The layer prevents or allows attention to elements in the key-value pairs when the
corresponding element in the mask is zero or one, respectively.

`NumOuputs`

— Number of outputs

`1`

| `2`

This property is read-only.

Number of outputs of the layer.

If the `HasScoresOutput`

property is `0`

(`false`

), then the layer has one output with the name `'out'`

, which corresponds to the output data.

If the `HasScoresOutput`

property is `1`

(`true`

), then the layer has two inputs with the names `'out'`

and `'scores'`

, which correspond to the output data and the attention scores, respectively.

**Data Types: **`double`

`OuputNames`

— Output names

`{'out'}`

| `{'out','scores'}`

This property is read-only.

Output names of the layer.

If the `HasScoresOutput`

property is `0`

(`false`

), then the layer has one output with the name `'out'`

, which corresponds to the output data.

If the `HasScoresOutput`

property is `1`

(`true`

), then the layer has two inputs with the names `'out'`

and `'scores'`

, which correspond to the output data and the attention scores, respectively.

## Examples

### Create Self-Attention Layer

Create a self-attention layer with eight heads and 256 key and query channels.

layer = selfAttentionLayer(8,256)

layer = SelfAttentionLayer with properties: Name: '' AttentionMask: 'none' HasPaddingMaskInput: 0 HasScoresOutput: 0 Hyperparameters InputSize: 'auto' NumHeads: 8 NumKeyChannels: 256 NumValueChannels: 'auto' OutputSize: 'auto' DropoutProbability: 0 Learnable Parameters QueryWeights: [] KeyWeights: [] ValueWeights: [] OutputWeights: [] QueryBias: [] KeyBias: [] ValueBias: [] OutputBias: [] Show all properties

Include a self-attention layer in a layer array.

layers = [ sequenceInputLayer(12) selfAttentionLayer(4,12) layerNormalizationLayer fullyConnectedLayer(9) softmaxLayer];

## References

[1] Vaswani, Ashish, Noam Shazeer,
Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia
Polosukhin. "Attention is all you need." *Advances in neural information processing
systems* 30 (December 2017): 6000-6010. https://papers.nips.cc/paper/7181-attention-is-all-you-need.

[2] Glorot,
Xavier, and Yoshua Bengio. "Understanding the Difficulty of Training Deep Feedforward Neural
Networks." In *Proceedings of the Thirteenth International Conference on Artificial
Intelligence and Statistics*, 249–356. Sardinia, Italy: AISTATS, 2010. https://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf

[3] He, Kaiming,
Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Delving Deep into Rectifiers: Surpassing Human-Level
Performance on ImageNet Classification." In *Proceedings of the 2015 IEEE
International Conference on Computer Vision*, 1026–1034. Washington, DC: IEEE
Computer Vision Society, 2015. https://doi.org/10.1109/ICCV.2015.123

## Version History

**Introduced in R2023a**

## Open Example

You have a modified version of this example. Do you want to open this example with your edits?

## MATLAB Command

You clicked a link that corresponds to this MATLAB command:

Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.

Select a Web Site

Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .

You can also select a web site from the following list:

## How to Get Best Site Performance

Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.

### Americas

- América Latina (Español)
- Canada (English)
- United States (English)

### Europe

- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)

- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)