Custom Deep Learning Processor Generation to Meet Performance Requirements

This example uses:

This example shows how to create a custom processor configuration and estimate the performance of a pretrained series network. You can then modify parameters of the custom processor configuration and re-estimate the performance. Once you have achieved your performance requirements you can generate a custom bitstream by using the custom processor configuration.

Prerequisites

Deep Learning HDL Toolbox™Support Package for Xilinx FPGA and SoC
Deep Learning Toolbox™
Deep Learning HDL Toolbox™
Deep Learning Toolbox Model Compression Library
MATLAB Coder Interface for Deep Learning

Load Pretrained Series Network

To load the pretrained series network LogoNet, enter:

net = getLogoNetwork;

Define Training and Validation Data Sets

This example uses the logos_dataset data set. The data set consists of 320 images. Create an augmentedImageDatastore object to use for training and validation.

curDir = pwd;
unzip('logos_dataset.zip');

imds = imageDatastore('logos_dataset', ...
    'IncludeSubfolders',true, ...
    'LabelSource','foldernames');

[imdsTrain,imdsValidation] = splitEachLabel(imds,0.7,'randomized');

Create Custom Processor Configuration

To create a custom processor configuration, use the dlhdl.ProcessorConfig object. For more information, see dlhdl.ProcessorConfig. To learn about modifiable parameters of the processor configuration, see getModuleProperty and setModuleProperty.

hPC = dlhdl.ProcessorConfig;
hPC.TargetFrequency = 220;
hPC

hPC = 
                    Processing Module "conv"
                            ModuleGeneration: 'on'
                          LRNBlockGeneration: 'off'
                 SegmentationBlockGeneration: 'on'
                       WeightBlockGeneration: 'off'
                            ConvThreadNumber: 16
                             InputMemorySize: [227 227 3]
                            OutputMemorySize: [227 227 3]
                            FeatureSizeLimit: 2048

                      Processing Module "fc"
                            ModuleGeneration: 'on'
                      SoftmaxBlockGeneration: 'off'
                              FCThreadNumber: 4
                             InputMemorySize: 25088
                            OutputMemorySize: 4096

                  Processing Module "custom"
                            ModuleGeneration: 'on'
                                    Addition: 'on'
                                   MishLayer: 'off'
                              Multiplication: 'on'
                                    Resize2D: 'off'
                                     Sigmoid: 'off'
                                  SwishLayer: 'off'
                                   TanhLayer: 'off'
                             InputMemorySize: 40
                            OutputMemorySize: 120

              Processor Top Level Properties
                              RunTimeControl: 'register'
                               RunTimeStatus: 'register'
                          InputStreamControl: 'register'
                         OutputStreamControl: 'register'
                                SetupControl: 'register'
                           ProcessorDataType: 'single'
                            UseVendorLibrary: 'on'

                     System Level Properties
                              TargetPlatform: 'Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit'
                             TargetFrequency: 220
                               SynthesisTool: 'Xilinx Vivado'
                             ReferenceDesign: 'AXI-Stream DDR Memory Access : 3-AXIM'
                     SynthesisToolChipFamily: 'Zynq UltraScale+'
                     SynthesisToolDeviceName: 'xczu9eg-ffvb1156-2-e'
                    SynthesisToolPackageName: ''
                     SynthesisToolSpeedValue: ''

Estimate LogoNet Performance

To estimate the performance of the LogoNet series network, use the estimatePerformance function of the dlhdl.ProcessorConfig object. The function returns the estimated layer latency, network latency, and network performance in frames per second (Frames/s).

hPC.estimatePerformance(net)

### Notice: The layer 'imageinput' of type 'ImageInputLayer' is split into an image input layer 'imageinput' and an addition layer 'imageinput_norm' for normalization on hardware.
### The network includes the following layers:
     1   'imageinput'    Image Input             227×227×3 images with 'zerocenter' normalization and 'randfliplr' augmentations  (SW Layer)
     2   'conv_1'        2-D Convolution         96 5×5×3 convolutions with stride [1  1] and padding [0  0  0  0]                (HW Layer)
     3   'relu_1'        ReLU                    ReLU                                                                             (HW Layer)
     4   'maxpool_1'     2-D Max Pooling         3×3 max pooling with stride [2  2] and padding [0  0  0  0]                      (HW Layer)
     5   'conv_2'        2-D Convolution         128 3×3×96 convolutions with stride [1  1] and padding [0  0  0  0]              (HW Layer)
     6   'relu_2'        ReLU                    ReLU                                                                             (HW Layer)
     7   'maxpool_2'     2-D Max Pooling         3×3 max pooling with stride [2  2] and padding [0  0  0  0]                      (HW Layer)
     8   'conv_3'        2-D Convolution         384 3×3×128 convolutions with stride [1  1] and padding [0  0  0  0]             (HW Layer)
     9   'relu_3'        ReLU                    ReLU                                                                             (HW Layer)
    10   'maxpool_3'     2-D Max Pooling         3×3 max pooling with stride [2  2] and padding [0  0  0  0]                      (HW Layer)
    11   'conv_4'        2-D Convolution         128 3×3×384 convolutions with stride [2  2] and padding [0  0  0  0]             (HW Layer)
    12   'relu_4'        ReLU                    ReLU                                                                             (HW Layer)
    13   'maxpool_4'     2-D Max Pooling         3×3 max pooling with stride [2  2] and padding [0  0  0  0]                      (HW Layer)
    14   'fc_1'          Fully Connected         2048 fully connected layer                                                       (HW Layer)
    15   'relu_5'        ReLU                    ReLU                                                                             (HW Layer)
    16   'fc_2'          Fully Connected         2048 fully connected layer                                                       (HW Layer)
    17   'relu_6'        ReLU                    ReLU                                                                             (HW Layer)
    18   'fc_3'          Fully Connected         32 fully connected layer                                                         (HW Layer)
    19   'softmax'       Softmax                 softmax                                                                          (SW Layer)
    20   'classoutput'   Classification Output   crossentropyex with 'adidas' and 31 other classes                                (SW Layer)
                                                                                                                                
### Notice: The layer 'softmax' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software.
### Notice: The layer 'classoutput' with type 'nnet.cnn.layer.ClassificationOutputLayer' is implemented in software.


              Deep Learning Processor Estimator Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                   39245087                  0.17839                       1           39245087              5.6
    imageinput_norm         275732                  0.00125 
    conv_1                 6836112                  0.03107 
    maxpool_1              3706776                  0.01685 
    conv_2                10461413                  0.04755 
    maxpool_2              1174098                  0.00534 
    conv_3                 9392181                  0.04269 
    maxpool_3              1230834                  0.00559 
    conv_4                 1768564                  0.00804 
    maxpool_4                24482                  0.00011 
    fc_1                   2651287                  0.01205 
    fc_2                   1696631                  0.00771 
    fc_3                     26977                  0.00012 
 * The clock frequency of the DL processor is: 220MHz

The estimated frames per second is 5.5 Frames/s. To improve the network performance, modify the custom processor convolution module kernel data type, convolution processor thread number, fully connected module kernel data type, and fully connected module thread number. For more information about these processor parameters, see getModuleProperty and setModuleProperty.

Create Modified Custom Processor Configuration

hPCNew = dlhdl.ProcessorConfig;
hPCNew.TargetFrequency = 300;
hPCNew.ProcessorDataType = 'int8';
hPCNew.setModuleProperty('conv', 'ConvThreadNumber', 64);
hPCNew.setModuleProperty('fc', 'FCThreadNumber',   16);
hPCNew.UseVendorLibrary = 'off';
hPCNew

hPCNew = 
                    Processing Module "conv"
                            ModuleGeneration: 'on'
                          LRNBlockGeneration: 'off'
                 SegmentationBlockGeneration: 'on'
                       WeightBlockGeneration: 'off'
                            ConvThreadNumber: 64
                             InputMemorySize: [227 227 3]
                            OutputMemorySize: [227 227 3]
                            FeatureSizeLimit: 2048

                      Processing Module "fc"
                            ModuleGeneration: 'on'
                      SoftmaxBlockGeneration: 'off'
                              FCThreadNumber: 16
                             InputMemorySize: 25088
                            OutputMemorySize: 4096

                  Processing Module "custom"
                            ModuleGeneration: 'on'
                                    Addition: 'on'
                                   MishLayer: 'off'
                              Multiplication: 'on'
                                    Resize2D: 'off'
                                     Sigmoid: 'off'
                                  SwishLayer: 'off'
                                   TanhLayer: 'off'
                             InputMemorySize: 40
                            OutputMemorySize: 120

              Processor Top Level Properties
                              RunTimeControl: 'register'
                               RunTimeStatus: 'register'
                          InputStreamControl: 'register'
                         OutputStreamControl: 'register'
                                SetupControl: 'register'
                           ProcessorDataType: 'int8'
                            UseVendorLibrary: 'off'

                     System Level Properties
                              TargetPlatform: 'Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit'
                             TargetFrequency: 300
                               SynthesisTool: 'Xilinx Vivado'
                             ReferenceDesign: 'AXI-Stream DDR Memory Access : 3-AXIM'
                     SynthesisToolChipFamily: 'Zynq UltraScale+'
                     SynthesisToolDeviceName: 'xczu9eg-ffvb1156-2-e'
                    SynthesisToolPackageName: ''
                     SynthesisToolSpeedValue: ''

Quantize LogoNet Series Network

To quantize the LogoNet network, enter:

imageData = imageDatastore(fullfile(curDir,'logos_dataset'),...
 'IncludeSubfolders',true,'FileExtensions','.JPG','LabelSource','foldernames');
imageData_reduced = imageData.subset(1:20);
dlquantObj = dlquantizer(net,'ExecutionEnvironment','FPGA');
dlquantObj.calibrate(imageData_reduced)

Warning: Support for GPU devices with compute capability 6.1 will be removed in a future MATLAB release. For more information on GPU support, see <a href="matlab:web('http://www.mathworks.com/help/parallel-computing/gpu-computing-requirements.html','-browser')">GPU Computing Requirements</a>.

ans=35×5 table
    "conv_1_Weights"    'conv_1'    "Weights"        -0.0490        0.0394
       "conv_1_Bias"    'conv_1'       "Bias"         1.0000        1.0028
    "conv_2_Weights"    'conv_2'    "Weights"        -0.0555        0.0619
       "conv_2_Bias"    'conv_2'       "Bias"    -6.1171e-04        0.0023
    "conv_3_Weights"    'conv_3'    "Weights"        -0.0459        0.0469
       "conv_3_Bias"    'conv_3'       "Bias"        -0.0014        0.0015
    "conv_4_Weights"    'conv_4'    "Weights"        -0.0460        0.0510
       "conv_4_Bias"    'conv_4'       "Bias"        -0.0016        0.0038
      "fc_1_Weights"      'fc_1'    "Weights"        -0.0514        0.0543
         "fc_1_Bias"      'fc_1'       "Bias"    -5.2319e-04    8.4454e-04
      "fc_2_Weights"      'fc_2'    "Weights"        -0.0502        0.0516
         "fc_2_Bias"      'fc_2'       "Bias"        -0.0018        0.0019
      "fc_3_Weights"      'fc_3'    "Weights"        -0.0507        0.0468
         "fc_3_Bias"      'fc_3'       "Bias"        -0.0295        0.0249
      ⋮

Estimate LogoNet Performance

To estimate the performance of the LogoNet series network, use the estimatePerformance function of the dlhdl.ProcessorConfig object. The function returns the estimated layer latency, network latency, and network performance in frames per second (Frames/s).

hPCNew.estimatePerformance(dlquantObj)

### The network includes the following layers:
     1   'imageinput'    Image Input             227×227×3 images with 'zerocenter' normalization and 'randfliplr' augmentations  (SW Layer)
     2   'conv_1'        2-D Convolution         96 5×5×3 convolutions with stride [1  1] and padding [0  0  0  0]                (HW Layer)
     3   'relu_1'        ReLU                    ReLU                                                                             (HW Layer)
     4   'maxpool_1'     2-D Max Pooling         3×3 max pooling with stride [2  2] and padding [0  0  0  0]                      (HW Layer)
     5   'conv_2'        2-D Convolution         128 3×3×96 convolutions with stride [1  1] and padding [0  0  0  0]              (HW Layer)
     6   'relu_2'        ReLU                    ReLU                                                                             (HW Layer)
     7   'maxpool_2'     2-D Max Pooling         3×3 max pooling with stride [2  2] and padding [0  0  0  0]                      (HW Layer)
     8   'conv_3'        2-D Convolution         384 3×3×128 convolutions with stride [1  1] and padding [0  0  0  0]             (HW Layer)
     9   'relu_3'        ReLU                    ReLU                                                                             (HW Layer)
    10   'maxpool_3'     2-D Max Pooling         3×3 max pooling with stride [2  2] and padding [0  0  0  0]                      (HW Layer)
    11   'conv_4'        2-D Convolution         128 3×3×384 convolutions with stride [2  2] and padding [0  0  0  0]             (HW Layer)
    12   'relu_4'        ReLU                    ReLU                                                                             (HW Layer)
    13   'maxpool_4'     2-D Max Pooling         3×3 max pooling with stride [2  2] and padding [0  0  0  0]                      (HW Layer)
    14   'fc_1'          Fully Connected         2048 fully connected layer                                                       (HW Layer)
    15   'relu_5'        ReLU                    ReLU                                                                             (HW Layer)
    16   'fc_2'          Fully Connected         2048 fully connected layer                                                       (HW Layer)
    17   'relu_6'        ReLU                    ReLU                                                                             (HW Layer)
    18   'fc_3'          Fully Connected         32 fully connected layer                                                         (HW Layer)
    19   'softmax'       Softmax                 softmax                                                                          (SW Layer)
    20   'classoutput'   Classification Output   crossentropyex with 'adidas' and 31 other classes                                (SW Layer)
                                                                                                                                
### Notice: The layer 'imageinput' with type 'nnet.cnn.layer.ImageInputLayer' is implemented in software.
### Notice: The layer 'softmax' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software.
### Notice: The layer 'classoutput' with type 'nnet.cnn.layer.ClassificationOutputLayer' is implemented in software.


              Deep Learning Processor Estimator Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                   13791559                  0.04597                       1           13791559             21.8
    conv_1                 3488979                  0.01163 
    maxpool_1              1852524                  0.00618 
    conv_2                 2940919                  0.00980 
    maxpool_2               586833                  0.00196 
    conv_3                 2584863                  0.00862 
    maxpool_3               615201                  0.00205 
    conv_4                  612220                  0.00204 
    maxpool_4                12217                  0.00004 
    fc_1                    665265                  0.00222 
    fc_2                    425425                  0.00142 
    fc_3                      7113                  0.00002 
 * The clock frequency of the DL processor is: 300MHz

The estimated frames per second is 21.7 Frames/s.

Generate Custom Processor and Bitstream

Use the new custom processor configuration to build and generate a custom processor and bitstream. Use the custom bitstream to deploy the LogoNet network to your target FPGA board.

hdlsetuptoolpath('ToolName', 'Xilinx Vivado', 'ToolPath', 'C:\Xilinx\Vivado\2020.2\bin\vivado.bat');
dlhdl.buildProcessor(hPCNew);

To learn how to use the generated bitstream file, see Generate Custom Bitstream.

The generated bitstream in this example is similar to the zcu102_int8 bitstream. To deploy the quantized LogoNet network using the zcu102_int8 bitstream, see Classify Images on FPGA Using Quantized Neural Network.

Custom Deep Learning Processor Generation to Meet Performance Requirements

Prerequisites

Load Pretrained Series Network

Define Training and Validation Data Sets

Create Custom Processor Configuration

Estimate LogoNet Performance

Create Modified Custom Processor Configuration

Quantize LogoNet Series Network

Estimate LogoNet Performance

Generate Custom Processor and Bitstream

See Also

Topics