Main Content

Specify Layers of Convolutional Neural Network

The first step of creating and training a new convolutional neural network (ConvNet) is to define the network architecture. This topic explains the details of ConvNet layers, and the order they appear in a ConvNet. For a complete list of deep learning layers and how to create them, see List of Deep Learning Layers. To learn about LSTM networks for sequence classification and regression, see Long Short-Term Memory Neural Networks. To learn how to create your own custom layers, see Define Custom Deep Learning Layers.

The network architecture can vary depending on the types and numbers of layers included. The types and number of layers included depends on the particular application or data. For example, classification networks typically have a softmax layer and a classification layer, whereas regression networks must have a regression layer at the end of the network. A smaller network with only one or two convolutional layers might be sufficient to learn on a small number of grayscale image data. On the other hand, for more complex data with millions of colored images, you might need a more complicated network with multiple convolutional and fully connected layers.

To specify the architecture of a deep network with all layers connected sequentially, create an array of layers directly. For example, to create a deep network which classifies 28-by-28 grayscale images into 10 classes, specify the layer array

layers = [
    imageInputLayer([28 28 1])  
    convolution2dLayer(3,16,'Padding',1)
    batchNormalizationLayer
    reluLayer    
    maxPooling2dLayer(2,'Stride',2) 
    convolution2dLayer(3,32,'Padding',1)
    batchNormalizationLayer
    reluLayer 
    fullyConnectedLayer(10)
    softmaxLayer
    classificationLayer];
layers is an array of Layer objects. You can then use layers as an input to the training function trainNetwork.

To specify the architecture of a neural network with all layers connected sequentially, create an array of layers directly. To specify the architecture of a network where layers can have multiple inputs or outputs, use a LayerGraph object.

Image Input Layer

Create an image input layer using imageInputLayer.

An image input layer inputs images to a network and applies data normalization.

Specify the image size using the inputSize argument. The size of an image corresponds to the height, width, and the number of color channels of that image. For example, for a grayscale image, the number of channels is 1, and for a color image it is 3.

Convolutional Layer

A 2-D convolutional layer applies sliding convolutional filters to 2-D input. Create a 2-D convolutional layer using convolution2dLayer.

The convolutional layer consists of various components.1

Filters and Stride

A convolutional layer consists of neurons that connect to subregions of the input images or the outputs of the previous layer. The layer learns the features localized by these regions while scanning through an image. When creating a layer using the convolution2dLayer function, you can specify the size of these regions using the filterSize input argument.

For each region, the trainnet and trainNetwork functions compute a dot product of the weights and the input, and then adds a bias term. A set of weights that is applied to a region in the image is called a filter. The filter moves along the input image vertically and horizontally, repeating the same computation for each region. In other words, the filter convolves the input.

This image shows a 3-by-3 filter scanning through the input. The lower map represents the input and the upper map represents the output.

Animation showing a sliding 3-by-3 filter. At each step, the filter spans a patch of an input image (the lower map) and has output corresponding to single pixel of the output image (the upper map). The input is a 4-by-4 image. The ouptut is a 2-by-2 image.

The step size with which the filter moves is called a stride. You can specify the step size with the Stride name-value pair argument. The local regions that the neurons connect to can overlap depending on the filterSize and 'Stride' values.

This image shows a 3-by-3 filter scanning through the input with a stride of 2. The lower map represents the input and the upper map represents the output.

Animation showing a sliding 3-by-3 filter with stride 2. At each step, the filter moves two pixels. The input is a 5-by-5 image. The ouptut is a 2-by-2 image.

The number of weights in a filter is h * w * c, where h is the height, and w is the width of the filter, respectively, and c is the number of channels in the input. For example, if the input is a color image, the number of color channels is 3. The number of filters determines the number of channels in the output of a convolutional layer. Specify the number of filters using the numFilters argument with the convolution2dLayer function.

Dilated Convolution

A dilated convolution is a convolution in which the filters are expanded by spaces inserted between the elements of the filter. Specify the dilation factor using the 'DilationFactor' property.

Use dilated convolutions to increase the receptive field (the area of the input which the layer can see) of the layer without increasing the number of parameters or computation.

The layer expands the filters by inserting zeros between each filter element. The dilation factor determines the step size for sampling the input or equivalently the upsampling factor of the filter. It corresponds to an effective filter size of (Filter Size – 1) .* Dilation Factor + 1. For example, a 3-by-3 filter with the dilation factor [2 2] is equivalent to a 5-by-5 filter with zeros between the elements.

This image shows a 3-by-3 filter dilated by a factor of two scanning through the input. The lower map represents the input and the upper map represents the output.

Animation showing a sliding dilated 3-by-3 filter. The filter spans a 5-by-5 region because it has a one pixel gap between each pixel. The input is a 7-by-7 image. The output is a 3-by-3 image.

Feature Maps

As a filter moves along the input, it uses the same set of weights and the same bias for the convolution, forming a feature map. Each feature map is the result of a convolution using a different set of weights and a different bias. Hence, the number of feature maps is equal to the number of filters. The total number of parameters in a convolutional layer is ((h*w*c + 1)*Number of Filters), where 1 is the bias.

Padding

You can also apply padding to input image borders vertically and horizontally using the 'Padding' name-value pair argument. Padding is values appended to the borders of a the input to increase its size. By adjusting the padding, you can control the output size of the layer.

This image shows a 3-by-3 filter scanning through the input with padding of size 1. The lower map represents the input and the upper map represents the output.

Animation showing a sliding 3-by-3 filter over a padded image. The input image is padded such that it is one pixel larger in each direction. When the filter slides over the input image, it can cover the padding regions. The input is a 5-by-5 image. The padded input is a 7-by-7 image. The output is a 5-by-5 image.

Output Size

The output height and width of a convolutional layer is (Input Size – ((Filter Size – 1)*Dilation Factor + 1) + 2*Padding)/Stride + 1. This value must be an integer for the whole image to be fully covered. If the combination of these options does not lead the image to be fully covered, the software by default ignores the remaining part of the image along the right and bottom edges in the convolution.

Number of Neurons

The product of the output height and width gives the total number of neurons in a feature map, say Map Size. The total number of neurons (output size) in a convolutional layer is Map Size*Number of Filters.

For example, suppose that the input image is a 32-by-32-by-3 color image. For a convolutional layer with eight filters and a filter size of 5-by-5, the number of weights per filter is 5 * 5 * 3 = 75, and the total number of parameters in the layer is (75 + 1) * 8 = 608. If the stride is 2 in each direction and padding of size 2 is specified, then each feature map is 16-by-16. This is because (32 – 5 + 2 * 2)/2 + 1 = 16.5, and some of the outermost padding to the right and bottom of the image is discarded. Finally, the total number of neurons in the layer is 16 * 16 * 8 = 2048.

Usually, the results from these neurons pass through some form of nonlinearity, such as rectified linear units (ReLU).

Learning Parameters

You can adjust the learning rates and regularization options for the layer using name-value pair arguments while defining the convolutional layer. If you choose not to specify these options, then the trainnet and trainNetwork functions use the global training options defined with the trainingOptions function. For details on global and layer training options, see Set Up Parameters and Train Convolutional Neural Network.

Number of Layers

A convolutional neural network can consist of one or multiple convolutional layers. The number of convolutional layers depends on the amount and complexity of the data.

Batch Normalization Layer

Create a batch normalization layer using batchNormalizationLayer.

A batch normalization layer normalizes a mini-batch of data across all observations for each channel independently. To speed up training of the convolutional neural network and reduce the sensitivity to network initialization, use batch normalization layers between convolutional layers and nonlinearities, such as ReLU layers.

The layer first normalizes the activations of each channel by subtracting the mini-batch mean and dividing by the mini-batch standard deviation. Then, the layer shifts the input by a learnable offset β and scales it by a learnable scale factor γ. β and γ are themselves learnable parameters that are updated during network training.

Batch normalization layers normalize the activations and gradients propagating through a neural network, making network training an easier optimization problem. To take full advantage of this fact, you can try increasing the learning rate. Since the optimization problem is easier, the parameter updates can be larger and the network can learn faster. You can also try reducing the L2 and dropout regularization. With batch normalization layers, the activations of a specific image during training depend on which images happen to appear in the same mini-batch. To take full advantage of this regularizing effect, try shuffling the training data before every training epoch. To specify how often to shuffle the data during training, use the 'Shuffle' name-value pair argument of trainingOptions.

ReLU Layer

Create a ReLU layer using reluLayer.

A ReLU layer performs a threshold operation to each element of the input, where any value less than zero is set to zero.

Convolutional and batch normalization layers are usually followed by a nonlinear activation function such as a rectified linear unit (ReLU), specified by a ReLU layer. A ReLU layer performs a threshold operation to each element, where any input value less than zero is set to zero, that is,

f(x)={x,x00,x<0.

The ReLU layer does not change the size of its input.

There are other nonlinear activation layers that perform different operations and can improve the network accuracy for some applications. For a list of activation layers, see Activation Layers.

Cross Channel Normalization (Local Response Normalization) Layer

Create a cross channel normalization layer using crossChannelNormalizationLayer.

A channel-wise local response (cross-channel) normalization layer carries out channel-wise normalization.

This layer performs a channel-wise local response normalization. It usually follows the ReLU activation layer. This layer replaces each element with a normalized value it obtains using the elements from a certain number of neighboring channels (elements in the normalization window). That is, for each element x in the input, the trainnet and trainNetwork functions compute a normalized value x' using

x'=x(K+α*sswindowChannelSize)β,

where K, α, and β are the hyperparameters in the normalization, and ss is the sum of squares of the elements in the normalization window [2]. You must specify the size of the normalization window using the windowChannelSize argument of the crossChannelNormalizationLayer function. You can also specify the hyperparameters using the Alpha, Beta, and K name-value pair arguments.

The previous normalization formula is slightly different than what is presented in [2]. You can obtain the equivalent formula by multiplying the alpha value by the windowChannelSize.

Max and Average Pooling Layers

A 2-D max pooling layer performs downsampling by dividing the input into rectangular pooling regions, then computing the maximum of each region. Create a max pooling layer using maxPooling2dLayer.

A 2-D average pooling layer performs downsampling by dividing the input into rectangular pooling regions, then computing the average of each region. Create an average pooling layer using averagePooling2dLayer.

Pooling layers follow the convolutional layers for down-sampling, hence, reducing the number of connections to the following layers. They do not perform any learning themselves, but reduce the number of parameters to be learned in the following layers. They also help reduce overfitting.

A max pooling layer returns the maximum values of rectangular regions of its input. The size of the rectangular regions is determined by the poolSize argument of maxPoolingLayer. For example, if poolSize is [2 3], then the layer returns the maximum value in regions of height 2 and width 3.

An average pooling layer outputs the average values of rectangular regions of its input. The size of the rectangular regions is determined by the poolSize argument of averagePoolingLayer. For example, if poolSize is [2 3], then the layer returns the average value of regions of height 2 and width 3.

Pooling layers scan through the input horizontally and vertically in step sizes you can specify using the 'Stride' name-value pair argument. If the pool size is smaller than or equal to the stride, then the pooling regions do not overlap.

For nonoverlapping regions (Pool Size and Stride are equal), if the input to the pooling layer is n-by-n, and the pooling region size is h-by-h, then the pooling layer down-samples the regions by h [6]. That is, the output of a max or average pooling layer for one channel of a convolutional layer is n/h-by-n/h. For overlapping regions, the output of a pooling layer is (Input SizePool Size + 2*Padding)/Stride + 1.

Dropout Layer

Create a dropout layer using dropoutLayer.

A dropout layer randomly sets input elements to zero with a given probability.

At training time, the layer randomly sets input elements to zero given by the dropout mask rand(size(X))<Probability, where X is the layer input and then scales the remaining elements by 1/(1-Probability). This operation effectively changes the underlying network architecture between iterations and helps prevent the network from overfitting [7], [2]. A higher number results in more elements being dropped during training. At prediction time, the output of the layer is equal to its input.

Similar to max or average pooling layers, no learning takes place in this layer.

Fully Connected Layer

Create a fully connected layer using fullyConnectedLayer.

A fully connected layer multiplies the input by a weight matrix and then adds a bias vector.

As the name suggests, all neurons in a fully connected layer connect to all the neurons in the previous layer. This layer combines all of the features (local information) learned by the previous layers across the image to identify the larger patterns. For classification problems, the last fully connected layer combines the features to classify the images. This is the reason that the outputSize argument of the last fully connected layer of the network is equal to the number of classes of the data set. For regression problems, the output size must be equal to the number of response variables.

You can also adjust the learning rate and the regularization parameters for this layer using the related name-value pair arguments when creating the fully connected layer. If you choose not to adjust them, then the software uses the global training parameters defined by the trainingOptions function. For details on global and layer training options, see Set Up Parameters and Train Convolutional Neural Network.

If the input to the layer is a sequence (for example, in an LSTM network), then the fully connected layer acts independently on each time step. For example, if the layer before the fully connected layer outputs an array X of size D-by-N-by-S, then the fully connected layer outputs an array Z of size outputSize-by-N-by-S. At time step t, the corresponding entry of Z is WXt+b, where Xt denotes time step t of X.

Output Layers

Softmax and Classification Layers

A softmax layer applies a softmax function to the input. Create a softmax layer using softmaxLayer.

A classification layer computes the cross-entropy loss for classification and weighted classification tasks with mutually exclusive classes. Create a classification layer using classificationLayer.

For classification problems, a softmax layer and then a classification layer usually follow the final fully connected layer.

The output unit activation function is the softmax function:

yr(x)=exp(ar(x))j=1kexp(aj(x)),

where 0yr1 and j=1kyj=1.

The softmax function is the output unit activation function after the last fully connected layer for multi-class classification problems:

P(cr|x,θ)=P(x,θ|cr)P(cr)j=1kP(x,θ|cj)P(cj)=exp(ar(x,θ))j=1kexp(aj(x,θ)),

where 0P(cr|x,θ)1 and j=1kP(cj|x,θ)=1. Moreover, ar=ln(P(x,θ|cr)P(cr)), P(x,θ|cr) is the conditional probability of the sample given class r, and P(cr) is the class prior probability.

The softmax function is also known as the normalized exponential and can be considered the multi-class generalization of the logistic sigmoid function [8].

For typical classification networks, the classification layer usually follows a softmax layer. In the classification layer, trainNetwork takes the values from the softmax function and assigns each input to one of the K mutually exclusive classes using the cross entropy function for a 1-of-K coding scheme [8]:

loss=1Nn=1Ni=1Kwitnilnyni,

where N is the number of samples, K is the number of classes, wi is the weight for class i, tni is the indicator that the nth sample belongs to the ith class, and yni is the output for sample n for class i, which in this case, is the value from the softmax function. In other words, yni is the probability that the network associates the nth input with class i.

Regression Layer

Create a regression layer using regressionLayer.

A regression layer computes the half-mean-squared-error loss for regression tasks. For typical regression problems, a regression layer must follow the final fully connected layer.

For a single observation, the mean-squared-error is given by:

MSE=i=1R(tiyi)2R,

where R is the number of responses, ti is the target output, and yi is the network’s prediction for response i.

For image and sequence-to-one regression networks, the loss function of the regression layer is the half-mean-squared-error of the predicted responses, not normalized by R:

loss=12i=1R(tiyi)2.

For image-to-image regression networks, the loss function of the regression layer is the half-mean-squared-error of the predicted responses for each pixel, not normalized by R:

loss=12p=1HWC(tpyp)2,

where H, W, and C denote the height, width, and number of channels of the output respectively, and p indexes into each element (pixel) of t and y linearly.

For sequence-to-sequence regression networks, the loss function of the regression layer is the half-mean-squared-error of the predicted responses for each time step, not normalized by R:

loss=12Si=1Sj=1R(tijyij)2,

where S is the sequence length.

When training, the software calculates the mean loss over the observations in the mini-batch.

References

[1] Murphy, K. P. Machine Learning: A Probabilistic Perspective. Cambridge, Massachusetts: The MIT Press, 2012.

[2] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "ImageNet Classification with Deep Convolutional Neural Networks." Communications of the ACM 60, no. 6 (May 24, 2017): 84–90. https://doi.org/10.1145/3065386

[3] LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D., et al. ''Handwritten Digit Recognition with a Back-propagation Network.'' In Advances of Neural Information Processing Systems, 1990.

[4] LeCun, Y., L. Bottou, Y. Bengio, and P. Haffner. ''Gradient-based Learning Applied to Document Recognition.'' Proceedings of the IEEE. Vol 86, pp. 2278–2324, 1998.

[5] Nair, V. and G. E. Hinton. "Rectified linear units improve restricted boltzmann machines." In Proc. 27th International Conference on Machine Learning, 2010.

[6] Nagi, J., F. Ducatelle, G. A. Di Caro, D. Ciresan, U. Meier, A. Giusti, F. Nagi, J. Schmidhuber, L. M. Gambardella. ''Max-Pooling Convolutional Neural Networks for Vision-based Hand Gesture Recognition''. IEEE International Conference on Signal and Image Processing Applications (ICSIPA2011), 2011.

[7] Srivastava, Nitish, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. "Dropout: A Simple Way to Prevent Neural Networks from Overfitting." The Journal of Machine Learning Research 15, no. 1 (January 1, 2014): 1929–58

[8] Bishop, C. M. Pattern Recognition and Machine Learning. Springer, New York, NY, 2006.

[9] Ioffe, Sergey, and Christian Szegedy. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” Preprint, submitted March 2, 2015. https://arxiv.org/abs/1502.03167.

See Also

| | | | | | | | | | | | | | |

Related Topics


1 Image credit: Convolution arithmetic (License)