When you train a network using layers, layer graphs, or dlnetwork
objects, the software automatically initializes the learnable parameters according to the
layer initialization properties. When you define a deep learning model as a function, you
must initialize the learnable parameters manually.
How you initialize learnable parameters (for example, weights and biases) can have a big impact on how quickly a deep learning model converges.
Tip
This topic explains how to initialize learnable parameters for a deep learning model
defined a function in a custom training loop. To learn how to specify the learnable
parameter initialization for a deep learning layer, use the corresponding layer
property. For example, to set the weights initializer of a convolution2dLayer
object, use the WeightsInitializer
property.
This table shows the default initializations for the learnable parameters for each layer, and provides links that show how to initialize learnable parameters for model functions by using the same initialization.
Layer  Learnable Parameter  Default Initialization 

convolution2dLayer  Weights  Glorot Initialization 
Bias  Zeros Initialization  
convolution3dLayer  Weights  Glorot Initialization 
Bias  Zeros Initialization  
groupedConvolution2dLayer  Weights  Glorot Initialization 
Bias  Zeros Initialization  
transposedConv2dLayer  Weights  Glorot Initialization 
Bias  Zeros Initialization  
transposedConv3dLayer  Weights  Glorot Initialization 
Bias  Zeros Initialization  
fullyConnectedLayer  Weights  Glorot Initialization 
Bias  Zeros Initialization  
batchNormalizationLayer  Offset  Zeros Initialization 
Scale  Ones Initialization  
lstmLayer  Input weights  Glorot Initialization 
Recurrent weights  Orthogonal Initialization  
Bias  Unit Forget Gate Initialization  
gruLayer  Input weights  Glorot Initialization 
Recurrent weights  Orthogonal Initialization  
Bias  Zeros Initialization  
wordEmbeddingLayer  Weights  Gaussian Initialization, with mean 0 and standard deviation 0.01 
When initializing learnable parameters for model functions, you must specify parameters of the correct size. The size of the learnable parameters depends on the type of deep learning operation.
Operation  Learnable Parameter  Size 

batchnorm  Offset 

Scale 
 
dlconv  Weights 

Bias  One of the following:
 
dlconv (grouped)  Weights 

Bias  One of the following:
 
dltranspconv  Weights 

Bias  One of the following:
 
dltranspconv (grouped)  Weights 

Bias  One of the following:
 
fullyconnect  Weights 

Bias 
 
gru  Input weights 

Recurrent weights 
 
Bias 
 
lstm  Input weights 

Recurrent weights 
 
Bias 

The Glorot (also known as Xavier) initializer [1] samples weights from the uniform distribution with bounds $$\left[\sqrt{\frac{6}{{N}_{o}+{N}_{i}}},\sqrt{\frac{6}{{N}_{o}+{N}_{i}}}\right]$$, where the values of N_{o} and N_{i} depend on the type of deep learning operation.
Operation  Learnable Parameter  N_{o}  N_{i} 

dlconv  Weights 


dlconv (grouped)  Weights 


dltranspconv  Weights 


dltranspconv (grouped)  Weights 


fullyconnect  Weights  Number of output channels of the operation  Number of input channels of the operation 
gru  Input weights  3*numHiddenUnits , where
numHiddenUnits is the number of hidden units
of the operation  Number of input channels of the operation 
Recurrent weights  3*numHiddenUnits , where
numHiddenUnits is the number of hidden units
of the operation  Number of hidden units of the operation  
lstm  Input weights  4*numHiddenUnits , where
numHiddenUnits is the number of hidden units
of the operation  Number of input channels of the operation 
Recurrent weights  4*numHiddenUnits , where
numHiddenUnits is the number of hidden units
of the operation  Number of hidden units of the operation 
To initialize learnable parameters using the Glorot initializer easily, you can define
a custom function. The function initializeGlorot
takes as input the
size of the learnable parameters sz
and the values
N_{o} and
N_{i} (numOut
and
numIn
, respectively), and returns the sampled weights as a
dlarray
object with underlying type
'single'
.
function weights = initializeGlorot(sz,numOut,numIn) Z = 2*rand(sz,'single')  1; bound = sqrt(6 / (numIn + numOut)); weights = bound * Z; weights = dlarray(weights); end
Initialize the weights for a convolutional operation with 128 filters of size 5by5 and 3 input channels.
filterSize = [5 5]; numChannels = 3; numFilters = 128; sz = [filterSize numChannels numFilters]; numOut = prod(filterSize) * numFilters; numIn = prod(filterSize) * numFilters; parameters.conv.Weights = initializeGlorot(sz,numOut,numIn);
The He initializer [2] samples weights from the normal distribution with zero mean and variance $$\frac{2}{{N}_{i}}$$, where the value N_{i} depends on the type of deep learning operation.
Operation  Learnable Parameter  N_{i} 

dlconv  Weights 

dltranspconv  Weights 

fullyconnect  Weights  Number of input channels of the operation 
gru  Input weights  Number of input channels of the operation 
Recurrent weights  Number of hidden units of the operation.  
lstm  Input weights  Number of input channels of the operation 
Recurrent weights  Number of hidden units of the operation. 
To initialize learnable parameters using the He initializer easily, you can define a
custom function. The function initializeHe
takes as input the size of
the learnable parameters sz
, and the value
N_{i}, and returns the sampled weights as
a dlarray
object with underlying type
'single'
.
function weights = initializeHe(sz,numIn) weights = randn(sz,'single') * sqrt(2/numIn); weights = dlarray(weights); end
Initialize the weights for a convolutional operation with 128 filters of size 5by5 and 3 input channels.
filterSize = [5 5]; numChannels = 3; numFilters = 128; sz = [filterSize numChannels numFilters]; numIn = prod(filterSize) * numFilters; parameters.conv.Weights = initializeHe(sz,numIn);
The Gaussian initializer samples weights from a normal distribution.
To initialize learnable parameters using the Gaussian initializer easily, you can
define a custom function. The function initializeGaussian
takes as
input the size of the learnable parameters sz
, the distribution mean
mu
, and the distribution standard deviation
sigma
, and returns the sampled weights as a
dlarray
object with underlying type
'single'
.
function weights = initializeGaussian(sz,mu,sigma) weights = randn(sz,'single')*sigma + mu; weights = dlarray(weights); end
Initialize the weights for an embedding operation with a dimension of 300 and vocabulary size of 5000 using the Gaussian initializer with mean 0 and standard deviation 0.01.
embeddingDimension = 300; vocabularySize = 5000; mu = 0; sigma = 0.01; sz = [embeddingDimension vocabularySize]; parameters.emb.Weights = initializeGaussian(sz,mu,sigma);
The uniform initializer samples weights from a uniform distribution.
To initialize learnable parameters using the uniform initializer easily, you can
define a custom function. The function initializeUniform
takes as
input the size of the learnable parameters sz
, and the distribution
bound bound
, and returns the sampled weights as a
dlarray
object with underlying type
'single'
.
function parameter = initializeUniform(sz,bound) Z = 2*rand(sz,'single')  1; parameter = bound * Z; parameter = dlarray(parameter); end
Initialize the weights for an attention mechanism with size 100by100 and bound 0.1 using the uniform initializer.
sz = [100 100]; bound = 0.1; parameters.attentionn.Weights = initializeUniform(sz,bound);
The orthogonal initializer returns the orthogonal matrix Q given by the QR decomposition of Z = QR, where Z is sampled from a unit normal distribution and the size of Z matches the size of the learnable parameter.
To initialize learnable parameters using the orthogonal initializer easily, you can
define a custom function. The function initializeOrthogonal
takes as
input the size of the learnable parameters sz
, and returns the
orthogonal matrix as a dlarray
object with underlying type
'single'
.
function parameter = initializeOrthogonal(sz) Z = randn(sz,'single'); [Q,R] = qr(Z,0); D = diag(R); Q = Q * diag(D ./ abs(D)); parameter = dlarray(Q); end
Initialize the recurrent weights for an LSTM operation with 100 hidden units using the orthogonal initializer.
numHiddenUnits = 100; sz = [4*numHiddenUnits numHiddenUnits]; parameters.lstm.RecurrentWeights = initializeOrthogonal(sz);
The unit forget gate initializer initializes the bias for an LSTM operation such that the forget gate component of the biases are ones and the remaining entries are zeros.
To initialize learnable parameters using the orthogonal initializer easily, you can
define a custom function. The function initializeUnitForgetGate
takes
as input the number of hidden units in the LSTM operation, and returns the bias as a
dlarray
object with underlying type
'single'
.
function bias = initializeUnitForgetGate(numHiddenUnits) bias = zeros(4*numHiddenUnits,1,'single'); idx = numHiddenUnits+1:2*numHiddenUnits; bias(idx) = 1; bias = dlarray(bias); end
Initialize the bias of an LSTM operation with 100 hidden units using the unit forget gate initializer.
numHiddenUnits = 100;
parameters.lstm.Bias = initializeUnitForgetGate(numHiddenUnits,'single');
To initialize learnable parameters with ones easily, you can define a custom function.
The function initializeOnes
takes as input the size of the learnable
parameters sz
, and returns the parameters as a
dlarray
object with underlying type
'single'
.
function parameter = initializeOnes(sz) parameter = ones(sz,'single'); parameter = dlarray(parameter); end
Initialize the scale for a batch normalization operation with 128 input channels with ones.
numChannels = 128; sz = [numChannels 1]; parameters.bn.Scale = initializeOnes(sz);
To initialize learnable parameters with zeros easily, you can define a custom
function. The function initializeZeros
takes as input the size of the
learnable parameters sz
, and returns the parameters as a
dlarray
object with underlying type
'single'
.
function parameter = initializeZeros(sz) parameter = zeros(sz,'single'); parameter = dlarray(parameter); end
Initialize the offset for a batch normalization operation with 128 input channels with zeros.
numChannels = 128; sz = [numChannels 1]; parameters.bn.Offset = initializeZeros(sz);
It is recommended to store the learnable parameters for a given model function in a single object, such as a struct, table, or cell array. For an example showing how to initialize learnable parameters as a struct, see Train Network Using Model Function.
If you train your model using training data that is stored on the GPU, the learnable parameters of the model function are converted to gpuArray objects and stored on the GPU.
Before saving your learnable parameters, it is recommended practice to gather all
parameters, in case they are loaded onto a machine without a GPU. Use dlupdate
to gather learnable parameters stored as a struct, table,
or cell array. For example, if you have network learnable parameters stored on the
GPU in the struct, table, or cell array parameters
, you can
gather all parameters on the to CPU by using the following
code:
parameters = dlupdate(@gather,parameters);
If you load learnable parameters that are not on the GPU, you can move the
parameters onto the GPU using dlupdate
. Doing so ensures that
your network executes on the GPU for training and inference, regardless of where the
input data is stored. To move the learnable parameters onto the GPU, use the
dlupdate
function:
parameters = dlupdate(@gpuArray,parameters);
[1] Glorot, Xavier, and Yoshua Bengio. "Understanding the Difficulty of Training Deep Feedforward Neural Networks." In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 249–356. Sardinia, Italy: AISTATS, 2010.
[2] He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Delving Deep into Rectifiers: Surpassing HumanLevel Performance on ImageNet Classification." In Proceedings of the 2015 IEEE International Conference on Computer Vision, 1026–1034. Washington, DC: IEEE Computer Vision Society, 2015.
dlarray
 dlfeval
 dlgradient
 dlnetwork