Create YOLO v2 Object Detection Network
This example shows how to modify a pretrained MobileNet v2 network to create a YOLO v2 object detection network.
The procedure to convert a pretrained network into a YOLO v2 network is similar to the transfer learning procedure for image classification:
Load the pretrained network.
Select a layer from the pretrained network to use for feature extraction.
Remove all layers after the feature extraction layer.
Add new layers to support the object detection task.
Load Pretrained Network
Load a pretrained MobileNet v2 network using
mobilenetv2. This requires the Deep Learning Toolbox Model for MobileNet v2 Network™ support package. If this support package is not installed, then the function provides a download link. After you load the network, convert the network into a
layerGraph object so that you can manipulate the layers.
net = mobilenetv2(); lgraph = layerGraph(net);
Update Network Input Size
Update the network input size to meet the training data requirements. For example, assume the training data are 300-by-300 RGB images. Set the input size.
imageInputSize = [300 300 3];
Next, create a new image input layer with the same name as the original layer.
imgLayer = imageInputLayer(imageInputSize,"Name","input_1")
imgLayer = ImageInputLayer with properties: Name: 'input_1' InputSize: [300 300 3] Hyperparameters DataAugmentation: 'none' Normalization: 'zerocenter' NormalizationDimension: 'auto' Mean: 
Replace the old image input layer with the new image input layer.
lgraph = replaceLayer(lgraph,"input_1",imgLayer);
Select Feature Extraction Layer
A YOLO v2 feature extraction layer is most effective when the output feature width and height are between 8 and 16 times smaller than the input image. This amount of downsampling is a trade-off between spatial resolution and output-feature quality. You can use the
analyzeNetwork function or the Deep Network Designer app to determine the output sizes of layers within a network. Note that selecting an optimal feature extraction layer requires empirical evaluation.
Set the feature extraction layer to
"block_12_add". The output size of this layer is about 16 times smaller than the input image size of 300-by-300.
featureExtractionLayer = "block_12_add";
Remove Layers After Feature Extraction Layer
Next, remove the layers after the feature extraction layer. You can do so by importing the network into the Deep Network Designer app, manually removing the layers, and exporting the modified network to your workspace.
For this example, load the modified network, which has been added to this example as a supporting file.
modified = load("mobilenetv2Block12Add.mat"); lgraph = modified.mobilenetv2Block12Add;
Create YOLO v2 Detection Sub-Network
The detection subnetwork consists of groups of serially connected convolution, ReLU, and batch normalization layers. These layers are followed by a
yolov2TransformLayer and a
First, create two groups of serially connected convolution, ReLU, and batch normalization layers. Set the convolution layer filter size to 3-by-3 and the number of filters to match the number of channels in the feature extraction layer output. Specify
"same" padding in the convolution layer to preserve the input size.
filterSize = [3 3]; numFilters = 96; detectionLayers = [ convolution2dLayer(filterSize,numFilters,"Name","yolov2Conv1","Padding", "same", "WeightsInitializer",@(sz)randn(sz)*0.01) batchNormalizationLayer("Name","yolov2Batch1") reluLayer("Name","yolov2Relu1") convolution2dLayer(filterSize,numFilters,"Name","yolov2Conv2","Padding", "same", "WeightsInitializer",@(sz)randn(sz)*0.01) batchNormalizationLayer("Name","yolov2Batch2") reluLayer("Name","yolov2Relu2") ]
detectionLayers = 6x1 Layer array with layers: 1 'yolov2Conv1' Convolution 96 3x3 convolutions with stride [1 1] and padding 'same' 2 'yolov2Batch1' Batch Normalization Batch normalization 3 'yolov2Relu1' ReLU ReLU 4 'yolov2Conv2' Convolution 96 3x3 convolutions with stride [1 1] and padding 'same' 5 'yolov2Batch2' Batch Normalization Batch normalization 6 'yolov2Relu2' ReLU ReLU
Next, create the final portion of the detection subnetwork, which has a convolution layer followed by a
yolov2TransformLayer and a
yolov2OutputLayer. The output of convolution layer predicts the following for each anchor box:
The object class probabilities.
The x and y location offset.
The width and height offset.
Specify the anchor boxes and number of classes and compute the number of filters for the convolution layer.
numClasses = 5; anchorBoxes = [ 16 16 32 16 ]; numAnchors = size(anchorBoxes,1); numPredictionsPerAnchor = 5; numFiltersInLastConvLayer = numAnchors*(numClasses+numPredictionsPerAnchor);
yolov2OutputLayer to the detection subnetwork.
detectionLayers = [ detectionLayers convolution2dLayer(1,numFiltersInLastConvLayer,"Name","yolov2ClassConv",... "WeightsInitializer", @(sz)randn(sz)*0.01) yolov2TransformLayer(numAnchors,"Name","yolov2Transform") yolov2OutputLayer(anchorBoxes,"Name","yolov2OutputLayer") ]
detectionLayers = 9x1 Layer array with layers: 1 'yolov2Conv1' Convolution 96 3x3 convolutions with stride [1 1] and padding 'same' 2 'yolov2Batch1' Batch Normalization Batch normalization 3 'yolov2Relu1' ReLU ReLU 4 'yolov2Conv2' Convolution 96 3x3 convolutions with stride [1 1] and padding 'same' 5 'yolov2Batch2' Batch Normalization Batch normalization 6 'yolov2Relu2' ReLU ReLU 7 'yolov2ClassConv' Convolution 20 1x1 convolutions with stride [1 1] and padding [0 0 0 0] 8 'yolov2Transform' YOLO v2 Transform Layer YOLO v2 Transform Layer with 2 anchors 9 'yolov2OutputLayer' YOLO v2 Output YOLO v2 Output with 2 anchors
Complete YOLO v2 Detection Network
Attach the detection subnetwork to the feature extraction network.
lgraph = addLayers(lgraph,detectionLayers); lgraph = connectLayers(lgraph,featureExtractionLayer,"yolov2Conv1");
analyzeNetwork(lgraph) to check the network and then train a YOLO v2 object detector using the