Attention layer: Number of parameters doesn't change when changing number of heads
14 views (last 30 days)
Show older comments
Changing the number of heads attribute of an attention layer from the Matlab deep learning toolbox doesn't seem to affect the resulting number of learnable parameters.
The following code will result in 1793 total paramters
% Number of heads for multi head attention layer
num_heads = 1;
% Number of key channels for querry, key and value
num_keys = 256;
% Number of classes
num_classes = 5;
% Define architecture
network_layers = [
sequenceInputLayer(1)
selfAttentionLayer(num_heads,num_keys)
fullyConnectedLayer(num_classes)
softmaxLayer
classificationLayer];
% Define layer graph
net = layerGraph;
net = addLayers(net,network_layers);
% Plot network structure
analyzeNetwork(net)

When changing the number of heads to e.g. 16, the number of learnable paramters doesn't change.
% Number of heads for multi head attention layer
num_heads = 16;

Why is that?
Shouldn't the number of learnable paramters of the attention layer increase proportional to the number of heads?
Any help is highly appreciated!
0 Comments
Answers (1)
Angelo Yeo
on 10 Jan 2024
This is expected. Increasing or decreasing the number of heads of multi-head attention does not change the total number of learnable parameters. This is because multi-headed attention divides the embedding dimensionality (or model dimensionality) by the number of heads. Multiple heads are for parallel computation and this does not result in the change of # parameters.
Let's assume a single-head attention model with model dimensionality d. When this model proejcts embeddeings to a single triplet of Q, K, V tensors, it will produce
parameters excluding biases.
Let's also assume another model with multi-head attention with k attention heads. When this model project embeddings to triplets of d/k-dimentions Q, K, V tensors, it will produce
parameters excluding biases.
3 Comments
Angelo Yeo
on 5 Jul 2024
Edited: Angelo Yeo
on 5 Jul 2024
Hi @铖, the code below might reproduce what you have experienced. So your question is why the learnable parameters for two different models with different number of heads of self-attention layers have the same dimension.
% Number of heads for multi head attention layer
num_heads1 = 1;
num_heads2 = 4;
% Number of key channels for querry, key and value
num_keys = 64;
% Number of classes
num_classes = 5;
% Define architectures
network_layers1 = [
sequenceInputLayer(1)
selfAttentionLayer(num_heads1,num_keys)
fullyConnectedLayer(num_classes)
softmaxLayer];
network_layers2 = [
sequenceInputLayer(1)
selfAttentionLayer(num_heads2,num_keys)
fullyConnectedLayer(num_classes)
softmaxLayer];
% Define layer graph
net1 = dlnetwork(network_layers1);
net2 = dlnetwork(network_layers2);
net1.Learnables
net2.Learnables
To answer this question, it's good to understand what "multi-head attention" does. In essence, it divides the embedding dimensions and helps to perform calculations in parallel. Let's take a look at the picture below (from Attention Is All You Need, Vaswani et al., 2017).

Do you see the "h"? The "h" indicates the number of heads and it means how many parts you want to split the embedding dimensions into. And you see the "Concat" block? Yes, so, after calculating "Scaled Dot-Product Attention", you concatenate the divided tokens.
Let's say you have a sentence "Anthony Hopkins admired Michael Bay as a great director." input to the self attention layer, and let's assume that you have defined the embedding dimension as 512. Then the embedded vector for "Anthony Hopkins admired Michael Bay as a great director." will be represented as a 9 x 512 matrix where "9" represents the number of tokens and "512" represents the embedding dimension. If you have set the number of heads as 8, there would be eight matrices of 9x(512/8) = 9x64 dimension. These eight matrices will be concatenated after calculating attention score. The example and pictue are taken from K_1.3. Multi-head Attention, deep dive_EN - Deep Learning Bible - 3. Natural Language Processing - Eng. (wikidocs.net).

In conclusion, the number of heads does not affect the number of learnable parameters in self-attention layer.
铖
on 5 Jul 2024
Hello, thank you very much for your reply. I have understood the dimension issue of the multi head attention mechanism. Furthermore, my problem is actually that when using layer=selfAttentionLayer (4,64) or layer=selfAttentionLayer (1,64) for training, they trained exactly the same weights, which means that the values trained by these two different codes are exactly the same! This surprised me very much. I am using the version Matlab2023B. Thank you very much for your help!
See Also
Categories
Find more on Deep Learning Toolbox in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!