Attention layer: Number of parameters doesn't change when changing number of heads

Question

Imran on 9 Jan 2024

0
Link

Direct link to this question

https://se.mathworks.com/matlabcentral/answers/2068031-attention-layer-number-of-parameters-doesn-t-change-when-changing-number-of-heads

Commented: 铖 on 5 Jul 2024

Changing the number of heads attribute of an attention layer from the Matlab deep learning toolbox doesn't seem to affect the resulting number of learnable parameters.

The following code will result in 1793 total paramters

% Number of heads for multi head attention layer
num_heads = 1;
% Number of key channels for querry, key and value
num_keys = 256;
% Number of classes
num_classes = 5;
% Define architecture
network_layers = [
    sequenceInputLayer(1)
    selfAttentionLayer(num_heads,num_keys)
    fullyConnectedLayer(num_classes)
    softmaxLayer
    classificationLayer];
% Define layer graph
net = layerGraph;
net = addLayers(net,network_layers);
% Plot network structure
analyzeNetwork(net)

When changing the number of heads to e.g. 16, the number of learnable paramters doesn't change.

% Number of heads for multi head attention layer
num_heads = 16;

Why is that?

Shouldn't the number of learnable paramters of the attention layer increase proportional to the number of heads?

Any help is highly appreciated!

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Angelo Yeo on 10 Jan 2024

0
Link

Direct link to this answer

https://se.mathworks.com/matlabcentral/answers/2068031-attention-layer-number-of-parameters-doesn-t-change-when-changing-number-of-heads#answer_1386541

This is expected. Increasing or decreasing the number of heads of multi-head attention does not change the total number of learnable parameters. This is because multi-headed attention divides the embedding dimensionality (or model dimensionality) by the number of heads. Multiple heads are for parallel computation and this does not result in the change of # parameters.

Let's assume a single-head attention model with model dimensionality d. When this model proejcts embeddeings to a single triplet of Q, K, V tensors, it will produce

parameters excluding biases.

Let's also assume another model with multi-head attention with k attention heads. When this model project embeddings to triplets of d/k-dimentions Q, K, V tensors, it will produce

parameters excluding biases.

3 Comments
Show 1 older commentHide 1 older comment

Angelo Yeo on 5 Jul 2024

Edited: Angelo Yeo on 5 Jul 2024

Open in MATLAB Online

Hi @铖, the code below might reproduce what you have experienced. So your question is why the learnable parameters for two different models with different number of heads of self-attention layers have the same dimension.

% Number of heads for multi head attention layer
num_heads1 = 1;
num_heads2 = 4;
% Number of key channels for querry, key and value
num_keys = 64;
% Number of classes
num_classes = 5;
% Define architectures
network_layers1 = [
    sequenceInputLayer(1)
    selfAttentionLayer(num_heads1,num_keys)
    fullyConnectedLayer(num_classes)
    softmaxLayer];
network_layers2 = [
    sequenceInputLayer(1)
    selfAttentionLayer(num_heads2,num_keys)
    fullyConnectedLayer(num_classes)
    softmaxLayer];
% Define layer graph
net1 = dlnetwork(network_layers1);
net2 = dlnetwork(network_layers2);
net1.Learnables
ans = 10x3 table
         Layer            Parameter            Value     
    _______________    _______________    _______________

    "selfattention"    "QueryWeights"     {64x1  dlarray}
    "selfattention"    "KeyWeights"       {64x1  dlarray}
    "selfattention"    "ValueWeights"     {64x1  dlarray}
    "selfattention"    "OutputWeights"    { 1x64 dlarray}
    "selfattention"    "QueryBias"        {64x1  dlarray}
    "selfattention"    "KeyBias"          {64x1  dlarray}
    "selfattention"    "ValueBias"        {64x1  dlarray}
    "selfattention"    "OutputBias"       { 1x1  dlarray}
    "fc"               "Weights"          { 5x1  dlarray}
    "fc"               "Bias"             { 5x1  dlarray}
net2.Learnables
ans = 10x3 table
         Layer            Parameter            Value     
    _______________    _______________    _______________

    "selfattention"    "QueryWeights"     {64x1  dlarray}
    "selfattention"    "KeyWeights"       {64x1  dlarray}
    "selfattention"    "ValueWeights"     {64x1  dlarray}
    "selfattention"    "OutputWeights"    { 1x64 dlarray}
    "selfattention"    "QueryBias"        {64x1  dlarray}
    "selfattention"    "KeyBias"          {64x1  dlarray}
    "selfattention"    "ValueBias"        {64x1  dlarray}
    "selfattention"    "OutputBias"       { 1x1  dlarray}
    "fc"               "Weights"          { 5x1  dlarray}
    "fc"               "Bias"             { 5x1  dlarray}

To answer this question, it's good to understand what "multi-head attention" does. In essence, it divides the embedding dimensions and helps to perform calculations in parallel. Let's take a look at the picture below (from Attention Is All You Need, Vaswani et al., 2017).

Do you see the "h"? The "h" indicates the number of heads and it means how many parts you want to split the embedding dimensions into. And you see the "Concat" block? Yes, so, after calculating "Scaled Dot-Product Attention", you concatenate the divided tokens.

Let's say you have a sentence "Anthony Hopkins admired Michael Bay as a great director." input to the self attention layer, and let's assume that you have defined the embedding dimension as 512. Then the embedded vector for "Anthony Hopkins admired Michael Bay as a great director." will be represented as a 9 x 512 matrix where "9" represents the number of tokens and "512" represents the embedding dimension. If you have set the number of heads as 8, there would be eight matrices of 9x(512/8) = 9x64 dimension. These eight matrices will be concatenated after calculating attention score. The example and pictue are taken from K_1.3. Multi-head Attention, deep dive_EN - Deep Learning Bible - 3. Natural Language Processing - Eng. (wikidocs.net).

In conclusion, the number of heads does not affect the number of learnable parameters in self-attention layer.

铖 on 5 Jul 2024

Hello, thank you very much for your reply. I have understood the dimension issue of the multi head attention mechanism. Furthermore, my problem is actually that when using layer=selfAttentionLayer (4,64) or layer=selfAttentionLayer (1,64) for training, they trained exactly the same weights, which means that the values trained by these two different codes are exactly the same! This surprised me very much. I am using the version Matlab2023B. Thank you very much for your help!

Sign in to comment.

Attention layer: Number of parameters doesn't change when changing number of heads

0 Comments
Show -2 older commentsHide -2 older comments

Answers (1)

3 Comments
Show 1 older commentHide 1 older comment

See Also

Categories

Tags

Products

Community Treasure Hunt

Attention layer: Number of parameters doesn't change when changing number of heads

0 Comments Show -2 older commentsHide -2 older comments

Answers (1)

3 Comments Show 1 older commentHide 1 older comment

See Also

Categories

Tags

Products

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

3 Comments
Show 1 older commentHide 1 older comment