Main Content

Deploy Semantic Segmentation Network Using Dilated Convolutions on FPGA

This example shows how to deploy a trained semantic segmentation network that uses dilated convolutions to a Xilinx® Zynq® Ultrascale+™ ZCU102 SoC development kit. Semantic segmentation networks like DeepLab [1] make extensive use of dilated convolutions, also known as atrous convolutions because they can increase the receptive field of the layer without increasing the number of parameters or computations.

A semantic segmentation network classifies every pixel in an image,which results in an image that is segmented by class. Applications for semantic segmentation include road segmentation for autonomous driving and cancer cell segmentation for medical diagnosis.

The network attached to this example was created in the example Getting Started with Semantic Segmentation Using Deep Learning (Computer Vision Toolbox).

To obtain an improved frames per second (FPS) performance, you can quantize the network. This example shows how to calibrate and deploy a quantized network and then compare the performance between the quantized network and a single data type network.

Load the Pretrained Network

To load the pretrained semantic segmentation network, enter:

load("trainedSemanticSegmentationNet.mat")

Use the analyzeNetwork function to view a graphical representation of the network and detailed parameter settings for the layers in the network.

analyzeNetwork(net)

Define FPGA Board Interface

Define the target FPGA board programming interface by using the dlhdl.Target object. Specify that the interface is for a Xilinx board with an Ethernet interface.

hTarget = dlhdl.Target('Xilinx','Interface','Ethernet');

To use the JTAG interface, install Xilinx™ Vivado™ Design Suite 2022.1. To set the Xilinx Vivado tool path, enter:

hdlsetuptoolpath('ToolName', 'Xilinx Vivado', 'ToolPath', 'C:\Xilinx\Vivado\2022.1\bin\vivado.bat');

Prepare Network for Deployment

Prepare the network for deployment by creating a dlhdl.Workflow object and specifying the network and bitstream name. Ensure that the bitstream name matches the data type and FPGA board. In this example, the target FPGA board is the Xilinx ZCU102 SOC board and the bitstream uses a single data type.

wfObj = dlhdl.Workflow('network', net, 'Bitstream', 'zcu102_single','Target',hTarget);

To run the example on a Xilinx ZC706 board, enter:

hW = dlhdl.Workflow('Network', snet, 'Bitstream', 'zc706_single','Target',hTarget);

Compile Network

Run the compile method of the dlhdl.Workflow object to compile the network and generate the instructions, weights, and biases for deployment. Because the total number of frames exceeds the default value of 30, set the InputFrameNumberLimit to 64 to run predictions in chunks of 64 frames to prevent timeouts.

dn = compile(wfObj,'InputFrameNumberLimit',64)
### Compiling network for Deep Learning FPGA prototyping ...
### Targeting FPGA bitstream zcu102_single.
### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'
### Notice: The layer 'imageinput' of type 'ImageInputLayer' is split into an image input layer 'imageinput' and an addition layer 'imageinput_norm' for normalization on hardware.
### The network includes the following layers:
     1   'imageinput'    Image Input                  32×32×1 images with 'zerocenter' normalization                                         (SW Layer)
     2   'conv_1'        2-D Convolution              32 3×3×1 convolutions with stride [1  1] and padding 'same'                            (HW Layer)
     3   'relu_1'        ReLU                         ReLU                                                                                   (HW Layer)
     4   'conv_2'        2-D Convolution              32 3×3×32 convolutions with stride [1  1], dilation factor [2  2], and padding 'same'  (HW Layer)
     5   'relu_2'        ReLU                         ReLU                                                                                   (HW Layer)
     6   'conv_3'        2-D Convolution              32 3×3×32 convolutions with stride [1  1], dilation factor [4  4], and padding 'same'  (HW Layer)
     7   'relu_3'        ReLU                         ReLU                                                                                   (HW Layer)
     8   'conv_4'        2-D Convolution              2 1×1×32 convolutions with stride [1  1] and padding [0  0  0  0]                      (HW Layer)
     9   'softmax'       Softmax                      softmax                                                                                (SW Layer)
    10   'classoutput'   Pixel Classification Layer   Class weighted cross-entropy loss with classes 'triangle' and 'background'             (SW Layer)
                                                                                                                                           
### Notice: The layer 'softmax' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software.
### Notice: The layer 'classoutput' with type 'nnet.cnn.layer.PixelClassificationLayer' is implemented in software.
### Compiling layer group: conv_1>>conv_4 ...
### Compiling layer group: conv_1>>conv_4 ... complete.

### Allocating external memory buffers:

          offset_name          offset_address    allocated_space 
    _______________________    ______________    ________________

    "InputDataOffset"           "0x00000000"     "4.0 MB"        
    "OutputResultOffset"        "0x00400000"     "4.0 MB"        
    "SchedulerDataOffset"       "0x00800000"     "4.0 MB"        
    "SystemBufferOffset"        "0x00c00000"     "28.0 MB"       
    "InstructionDataOffset"     "0x02800000"     "4.0 MB"        
    "ConvWeightDataOffset"      "0x02c00000"     "4.0 MB"        
    "EndOffset"                 "0x03000000"     "Total: 48.0 MB"

### Network compilation complete.
dn = struct with fields:
             weights: [1×1 struct]
        instructions: [1×1 struct]
           registers: [1×1 struct]
    syncInstructions: [1×1 struct]
        constantData: {{}  [-247.8812 0 0 0 -247.8812 0 0 0 -247.8812 0 0 0 -247.8812 0 0 0 -247.8812 0 0 0 -247.8812 0 0 0 -247.8812 0 0 0 -247.8812 0 0 0 -247.8812 0 0 0 -247.8812 0 0 0 -247.8812 0 0 0 -247.8812 0 0 0 -247.8812 0 0 0 -247.8812 0 0 … ]}

Program Bitstream onto FPGA and Download Network Weights

To deploy the network on the Xilinx ZCU102 SoC hardware, run the deploy method of the dlhdl.Workflow object. This function uses the output of the compile function to program the FPGA board and download the network weights and biases. The deploy function programs the FPGA device and displays progress messages, and the required time to deploy the network.

deploy(wfObj)
### Programming FPGA Bitstream using Ethernet...
### Attempting to connect to the hardware board at 192.168.1.101...
### Connection successful
### Programming FPGA device on Xilinx SoC hardware board at 192.168.1.101...
### Copying FPGA programming files to SD card...
### Setting FPGA bitstream and devicetree for boot...
# Copying Bitstream zcu102_single.bit to /mnt/hdlcoder_rd
# Set Bitstream to hdlcoder_rd/zcu102_single.bit
# Copying Devicetree devicetree_dlhdl.dtb to /mnt/hdlcoder_rd
# Set Devicetree to hdlcoder_rd/devicetree_dlhdl.dtb
# Set up boot for Reference Design: 'AXI-Stream DDR Memory Access : 3-AXIM'
### Rebooting Xilinx SoC at 192.168.1.101...
### Reboot may take several seconds...
### Attempting to connect to the hardware board at 192.168.1.101...
### Connection successful
### Programming the FPGA bitstream has been completed successfully.
### Loading weights to Conv Processor.
### Conv Weights loaded. Current time is 09-Nov-2022 12:03:53

Load Test Image

Read the example image

imgTest = imread('triangleTest.jpg');
figure
imshow(imgTest)

Run Prediction for Test Image

Segment the test image using semanticseg_FPGA and display the results using labeloverlay.

networkInputSize = net.Layers(1).InputSize(1:2);
imgTestSize = size(imgTest);

assert(all(mod(imgTestSize(1:2), networkInputSize)) == 0, 'The 2D image input size should be a multiple of network input size');

numberOfBlocks = imgTestSize./networkInputSize;
totalBlocks = prod(numberOfBlocks);

splitImage = mat2cell(imgTest, networkInputSize(1)*ones(1, numberOfBlocks(1)), networkInputSize(1)*ones(1, numberOfBlocks(2)));
multiFrameInput = zeros([networkInputSize 1 totalBlocks]);
for i=1:totalBlocks
    multiFrameInput(:,:,:,i) = splitImage{i};
end


result = semanticseg_FPGA(multiFrameInput, wfObj.Network, wfObj);
### Finished writing input activations.
### Running in multi-frame mode with 64 inputs.


              Deep Learning Processor Profiler Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                     254070                  0.00115                      64           16265936            865.6
    imageinput_norm           7574                  0.00003 
    conv_1                   26062                  0.00012 
    conv_2                   97124                  0.00044 
    conv_3                   97116                  0.00044 
    conv_4                   26175                  0.00012 
 * The clock frequency of the DL processor is: 220MHz

The performance of the single data type network is 865.6 frames per second. Concatenate the images to highlight the triangles identified in the input image.

concatenatedResult = [];
for i=1:numberOfBlocks(2)
    subset = result(:,:,numberOfBlocks(1)*(i-1)+1:i*numberOfBlocks(1));
    verticalConcatenation = [];
    for j=1:numberOfBlocks(1)
        verticalConcatenation = [verticalConcatenation; subset(:,:,j)];
    end
    concatenatedResult = [concatenatedResult verticalConcatenation];
end

croppedFinal = labeloverlay(imgTest, concatenatedResult);
figure
imshow(croppedFinal)

Prepare the Quantized Network for Deployment

Load the data. The data set includes 32-by-32 triangle images.

dataFolder = fullfile(toolboxdir('vision'),'visiondata','triangleImages');
imageFolderTrain = fullfile(dataFolder,'trainingImages');

Create an image datastore for the images by using an imageDatastore object.

imdsTrain = imageDatastore(imageFolderTrain);

Create a quantized network by using dlquantizer. Set the target execution environment to FPGA.

dlQuantObj = dlquantizer(net,'ExecutionEnvironment','FPGA');

Use the calibrate function to exercise the network and collect range information for the learnable parameters in the network layers.

dlQuantObj.calibrate(imdsTrain); 

Prepare the network for deployment by creating a dlhdl.Workflow object. Specify the network and bitstream name. Ensure that the bitstream name matches the data type and FPGA board. In this example the target FPGA board is the Xilinx ZCU102 SOC board. The bitstream uses an int8 data type.

wfObj_int8 = dlhdl.Workflow('Network', dlQuantObj, 'Bitstream', 'zcu102_int8', 'Target', hTarget);

Compile and Deploy the Quantized Network

Run the compile method of the dlhdl.Workflow object to compile the network and generate the instructions, weights, and biases for deployment. Because the total number of frames exceeds the default value of 30, set the InputFrameNumberLimit to 64 to run predictions in chunks of 64 frames to prevent timeouts.

dn = compile(wfObj_int8,'InputFrameNumberLimit',64)
### Compiling network for Deep Learning FPGA prototyping ...
### Targeting FPGA bitstream zcu102_int8.
### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'
### The network includes the following layers:
     1   'imageinput'    Image Input                  32×32×1 images with 'zerocenter' normalization                                         (SW Layer)
     2   'conv_1'        2-D Convolution              32 3×3×1 convolutions with stride [1  1] and padding 'same'                            (HW Layer)
     3   'relu_1'        ReLU                         ReLU                                                                                   (HW Layer)
     4   'conv_2'        2-D Convolution              32 3×3×32 convolutions with stride [1  1], dilation factor [2  2], and padding 'same'  (HW Layer)
     5   'relu_2'        ReLU                         ReLU                                                                                   (HW Layer)
     6   'conv_3'        2-D Convolution              32 3×3×32 convolutions with stride [1  1], dilation factor [4  4], and padding 'same'  (HW Layer)
     7   'relu_3'        ReLU                         ReLU                                                                                   (HW Layer)
     8   'conv_4'        2-D Convolution              2 1×1×32 convolutions with stride [1  1] and padding [0  0  0  0]                      (HW Layer)
     9   'softmax'       Softmax                      softmax                                                                                (SW Layer)
    10   'classoutput'   Pixel Classification Layer   Class weighted cross-entropy loss with classes 'triangle' and 'background'             (SW Layer)
                                                                                                                                           
### Notice: The layer 'imageinput' with type 'nnet.cnn.layer.ImageInputLayer' is implemented in software.
### Notice: The layer 'softmax' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software.
### Notice: The layer 'classoutput' with type 'nnet.cnn.layer.PixelClassificationLayer' is implemented in software.
### Compiling layer group: conv_1>>conv_4 ...
### Compiling layer group: conv_1>>conv_4 ... complete.

### Allocating external memory buffers:

          offset_name          offset_address    allocated_space 
    _______________________    ______________    ________________

    "InputDataOffset"           "0x00000000"     "4.0 MB"        
    "OutputResultOffset"        "0x00400000"     "4.0 MB"        
    "SchedulerDataOffset"       "0x00800000"     "0.0 MB"        
    "SystemBufferOffset"        "0x00800000"     "28.0 MB"       
    "InstructionDataOffset"     "0x02400000"     "4.0 MB"        
    "ConvWeightDataOffset"      "0x02800000"     "4.0 MB"        
    "EndOffset"                 "0x02c00000"     "Total: 44.0 MB"

### Network compilation complete.
dn = struct with fields:
             weights: [1×1 struct]
        instructions: [1×1 struct]
           registers: [1×1 struct]
    syncInstructions: [1×1 struct]
        constantData: {}

To deploy the network on the Xilinx ZCU102 SoC hardware, run the deploy method of the dlhdl.Workflow object. This function uses the output of the compile function to program the FPGA board and download the network weights and biases. The deploy function programs the FPGA device and displays progress messages, and the required time to deploy the network.

deploy(wfObj_int8)
### Programming FPGA Bitstream using Ethernet...
### Attempting to connect to the hardware board at 192.168.1.101...
### Connection successful
### Programming FPGA device on Xilinx SoC hardware board at 192.168.1.101...
### Copying FPGA programming files to SD card...
### Setting FPGA bitstream and devicetree for boot...
# Copying Bitstream zcu102_int8.bit to /mnt/hdlcoder_rd
# Set Bitstream to hdlcoder_rd/zcu102_int8.bit
# Copying Devicetree devicetree_dlhdl.dtb to /mnt/hdlcoder_rd
# Set Devicetree to hdlcoder_rd/devicetree_dlhdl.dtb
# Set up boot for Reference Design: 'AXI-Stream DDR Memory Access : 3-AXIM'
### Rebooting Xilinx SoC at 192.168.1.101...
### Reboot may take several seconds...
### Attempting to connect to the hardware board at 192.168.1.101...
### Connection successful
### Programming the FPGA bitstream has been completed successfully.
### Loading weights to Conv Processor.
### Conv Weights loaded. Current time is 09-Nov-2022 12:06:24

Run Prediction

Segment the test image using semanticseg_FPGA and display the results using labeloverlay.

networkInputSize = net.Layers(1).InputSize(1:2);
imgTestSize = size(imgTest);


assert(all(mod(imgTestSize(1:2), networkInputSize)) == 0, 'The 2D image input size should be a multiple of network input size');

numberOfBlocks = imgTestSize./networkInputSize;
totalBlocks = prod(numberOfBlocks);


splitImage = mat2cell(imgTest, networkInputSize(1)*ones(1, numberOfBlocks(1)), networkInputSize(1)*ones(1, numberOfBlocks(2)));
multiFrameInput = zeros([networkInputSize 1 totalBlocks]);
for i=1:totalBlocks
    multiFrameInput(:,:,:,i) = splitImage{i};
end


result_int8 = semanticseg_FPGA(multiFrameInput, wfObj_int8.Network, wfObj_int8);
### Finished writing input activations.
### Running in multi-frame mode with 64 inputs.


              Deep Learning Processor Profiler Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                      85510                  0.00034                      64            5475905           2921.9
    conv_1                   13576                  0.00005 
    conv_2                   29679                  0.00012 
    conv_3                   29933                  0.00012 
    conv_4                   12303                  0.00005 
 * The clock frequency of the DL processor is: 250MHz

The quantized network has a performance of 2921.9 frames per second. Concatenate the images to highlight the triangles identified in the input image.

concatenatedResult_int8 = [];
for i=1:numberOfBlocks(2)
    subset = result_int8(:,:,numberOfBlocks(1)*(i-1)+1:i*numberOfBlocks(1));
    verticalConcatenation_int8 = [];
    for j=1:numberOfBlocks(1)
        verticalConcatenation_int8 = [verticalConcatenation_int8; subset(:,:,j)];
    end
    concatenatedResult_int8 = [concatenatedResult_int8 verticalConcatenation_int8];
end

croppedFinal_int8 = labeloverlay(imgTest, concatenatedResult_int8);
figure
imshow(croppedFinal_int8)

References

[1] Chen, Liang-Chieh, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. “Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation.” arXiv, August 22, 2018. http://arxiv.org/abs/1802.02611.

See Also

| | | | | |