Generate Custom Bitstream to Meet Custom Deep Learning Network Requirements

This example uses:

Deep Learning HDL Toolbox Deep Learning HDL Toolbox
Deep Learning HDL Toolbox Support Package for Xilinx FPGA and SoC Devices Deep Learning HDL Toolbox Support Package for Xilinx FPGA and SoC Devices
Deep Learning Toolbox Deep Learning Toolbox

Deploy your custom network that only has layers with the convolution module output format or only layers with the fully connected module output format by generating a resource optimized custom bitstream that satisfies your performance and resource requirements. Bitstream generated using the default deep learning processor configuration consists of the convolution (conv), fully connected (fc), and adder modules. The generated default bitstreams could exceed your resource utilization requirements which could drive up costs. To generate a bitstream that consists of only the layers in your custom deep learning network, modify the deep learning processor configuration by using the setModuleProperty function of the dlhdl.ProcessorConfig object.

In this example, you have a network that has only layers that have the fully connected module output format. Generate a custom bitstream that consists of the fully connected module only by removing the convolution and adder modules from the deep learning processor configuration. To remove the convolution and adder modules:

Turn off the ModuleGeneration property for the individual modules in the deep learning processor configuration.
Use the optimizeConfigurationForNetwork function. The function takes the deep learning network object as the input and returns an optimized custom deep learning processor configuration.
Rapidly verify the resource utilization of the optimized deep learning processor configuration by using the estimateResources function.

Setup Synthesis Toolpath

To set up the Xilinx® Vivado® tool path, enter:

% hdlsetuptoolpath('ToolName', 'Xilinx Vivado', 'ToolPath', 'C:\Xilinx\Vivado\2022.1\bin\vivado.bat');

Create Custom Processor Configuration

Create a custom processor configuration. Save the configuration to hPC.

hPC = dlhdl.ProcessorConfig

hPC = 
                    Processing Module "conv"
                            ModuleGeneration: 'on'
                          LRNBlockGeneration: 'off'
                 SegmentationBlockGeneration: 'on'
                            ConvThreadNumber: 16
                             InputMemorySize: [227 227 3]
                            OutputMemorySize: [227 227 3]
                            FeatureSizeLimit: 2048

                      Processing Module "fc"
                            ModuleGeneration: 'on'
                      SoftmaxBlockGeneration: 'off'
                              FCThreadNumber: 4
                             InputMemorySize: 25088
                            OutputMemorySize: 4096

                  Processing Module "custom"
                            ModuleGeneration: 'on'
                                    Addition: 'on'
                                   MishLayer: 'off'
                              Multiplication: 'on'
                                    Resize2D: 'off'
                                     Sigmoid: 'off'
                                  SwishLayer: 'off'
                                   TanhLayer: 'off'
                             InputMemorySize: 40
                            OutputMemorySize: 120

              Processor Top Level Properties
                              RunTimeControl: 'register'
                               RunTimeStatus: 'register'
                          InputStreamControl: 'register'
                         OutputStreamControl: 'register'
                                SetupControl: 'register'
                           ProcessorDataType: 'single'

                     System Level Properties
                              TargetPlatform: 'Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit'
                             TargetFrequency: 200
                               SynthesisTool: 'Xilinx Vivado'
                             ReferenceDesign: 'AXI-Stream DDR Memory Access : 3-AXIM'
                     SynthesisToolChipFamily: 'Zynq UltraScale+'
                     SynthesisToolDeviceName: 'xczu9eg-ffvb1156-2-e'
                    SynthesisToolPackageName: ''
                     SynthesisToolSpeedValue: ''

Optimize Processor Configuration for a Custom Fully Connected (FC) Layer only Network

To optimize your processor configuration, create a custom fully connected layer only network. Call the custom network fcnet.

layers = [ ...
    imageInputLayer([28 28 3],'Normalization','none','Name','input')
    fullyConnectedLayer(10,'Name','fc')];
layers(2).Weights = rand(10,28*28*3);
layers(2).Bias = rand(10,1);
fcnet = dlnetwork(layers);
plot(fcnet);

Figure contains an axes object. The axes object contains an object of type graphplot.

Retrieve the resource utilization for the default custom processor configuration by using estimateResources. Retrieve the performance for the custom network fcnet by using estimatePerformance.

hPC.estimateResources

              Deep Learning Processor Estimator Resource Results

                             DSPs          Block RAM*     LUTs(CLB/ALUT)  
                        -------------    -------------    ------------- 
Available                    2520              912           274080
                        -------------    -------------    ------------- 
DL_Processor                389( 16%)        508( 56%)     216119( 79%)
* Block RAM represents Block RAM tiles in Xilinx devices and Block RAM bits in Intel devices

hPC.estimatePerformance(fcnet)

### An output layer called 'Output1_fc' of type 'nnet.cnn.layer.RegressionOutputLayer' has been added to the provided network. This layer performs no operation during prediction and thus does not affect the output of the network.
### The network includes the following layers:
     1   'input'        Image Input         28×28×3 images            (SW Layer)
     2   'fc'           Fully Connected     10 fully connected layer  (HW Layer)
     3   'Output1_fc'   Regression Output   mean-squared-error        (SW Layer)
                                                                    
### Notice: The layer 'input' with type 'nnet.cnn.layer.ImageInputLayer' is implemented in software.
### Notice: The layer 'Output1_fc' with type 'nnet.cnn.layer.RegressionOutputLayer' is implemented in software.


              Deep Learning Processor Estimator Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                      16127                  0.00008                       1              16127          12401.6
    fc                       16127                  0.00008 
 * The clock frequency of the DL processor is: 200MHz

The target device resource counts are:

Digital signal processor (DSP) slice count — 240
Block random access memory (BRAM) count — 128

The estimated performance is 12401.6 frames per second (FPS). The estimated resource use counts are:

Digital signal processor (DSP) slice count — 389
Block random access memory (BRAM) count — 508

The estimated DSP slice count and BRAM count use exceeds the target device resource budget. Customize the bitstream configuration to reduce resource use by customizing the processor configuration.

Customize Processor Configuration by Using `ModuleGeneration` Property

Create a deep learning network processor configuration object. Save it to hPC_moduleoff. Turn off the convolution and adder modules in the custom deep learning processor configuration.

hPC_moduleoff = dlhdl.ProcessorConfig;
hPC_moduleoff.setModuleProperty('conv','ModuleGeneration','off');
hPC_moduleoff.setModuleProperty('adder','ModuleGeneration','off');

Retrieve the resource utilization for the default custom processor configuration by using estimateResources. Retrieve the performance for the custom network fcnet by using estimatePerformance.

hPC_moduleoff.estimateResources

              Deep Learning Processor Estimator Resource Results

                             DSPs          Block RAM*     LUTs(CLB/ALUT)  
                        -------------    -------------    ------------- 
Available                    2520              912           274080
                        -------------    -------------    ------------- 
DL_Processor                 17(  1%)         44(  5%)      25760( 10%)
* Block RAM represents Block RAM tiles in Xilinx devices and Block RAM bits in Intel devices

hPC_moduleoff.estimatePerformance(fcnet)

### An output layer called 'Output1_fc' of type 'nnet.cnn.layer.RegressionOutputLayer' has been added to the provided network. This layer performs no operation during prediction and thus does not affect the output of the network.
### The network includes the following layers:
     1   'input'        Image Input         28×28×3 images            (SW Layer)
     2   'fc'           Fully Connected     10 fully connected layer  (HW Layer)
     3   'Output1_fc'   Regression Output   mean-squared-error        (SW Layer)
                                                                    
### Notice: The layer 'input' with type 'nnet.cnn.layer.ImageInputLayer' is implemented in software.
### Notice: The layer 'Output1_fc' with type 'nnet.cnn.layer.RegressionOutputLayer' is implemented in software.


              Deep Learning Processor Estimator Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                      16127                  0.00008                       1              16127          12401.6
    fc                       16127                  0.00008 
 * The clock frequency of the DL processor is: 200MHz

The target device resource counts are:

Digital signal processor (DSP) slice count — 240
Block random access memory (BRAM) count — 128

The estimated performance is 12401.6 frames per second (FPS). The estimated resource use counts are:

Digital signal processor (DSP) slice count — 17
Block random access memory (BRAM) count — 44

The estimated resources of the customized bitstream match the user target device resource budget. The estimated performance matches the target network performance.

Customize Processor Configuration by Using `optimizeConfigurationForNetwork`

Create a deep learning network processor configuration object. Save it to hPC_optimized. Generate an optimized deep learning processor configuration by using the optimizeConfigurationForNetwork function.

hPC_optimized = dlhdl.ProcessorConfig;
hPC_optimized.optimizeConfigurationForNetwork(fcnet);

### Optimizing processor configuration for deep learning network...
### An output layer called 'Output1_fc' of type 'nnet.cnn.layer.RegressionOutputLayer' has been added to the provided network. This layer performs no operation during prediction and thus does not affect the output of the network.
### Note: Processing module "conv" property "ModuleGeneration" changed from "true" to "false".
### Note: Processing module "fc" property "InputMemorySize" changed from "25088" to "2352".
### Note: Processing module "fc" property "OutputMemorySize" changed from "4096" to "128".
### Note: Processing module "custom" property "ModuleGeneration" changed from "true" to "false".

                    Processing Module "conv"
                            ModuleGeneration: 'off'

                      Processing Module "fc"
                            ModuleGeneration: 'on'
                      SoftmaxBlockGeneration: 'off'
                              FCThreadNumber: 4
                             InputMemorySize: 2352
                            OutputMemorySize: 128

                  Processing Module "custom"
                            ModuleGeneration: 'off'

              Processor Top Level Properties
                              RunTimeControl: 'register'
                               RunTimeStatus: 'register'
                          InputStreamControl: 'register'
                         OutputStreamControl: 'register'
                                SetupControl: 'register'
                           ProcessorDataType: 'single'

                     System Level Properties
                              TargetPlatform: 'Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit'
                             TargetFrequency: 200
                               SynthesisTool: 'Xilinx Vivado'
                             ReferenceDesign: 'AXI-Stream DDR Memory Access : 3-AXIM'
                     SynthesisToolChipFamily: 'Zynq UltraScale+'
                     SynthesisToolDeviceName: 'xczu9eg-ffvb1156-2-e'
                    SynthesisToolPackageName: ''
                     SynthesisToolSpeedValue: ''

### Optimizing processor configuration for deep learning network complete.

Retrieve the resource utilization for the default custom processor configuration by using estimateResources. Retrieve the performance for the custom network fcnet by using estimatePerformance.

hPC_optimized.estimateResources

              Deep Learning Processor Estimator Resource Results

                             DSPs          Block RAM*     LUTs(CLB/ALUT)  
                        -------------    -------------    ------------- 
Available                    2520              912           274080
                        -------------    -------------    ------------- 
DL_Processor                 17(  1%)         20(  3%)      25760( 10%)
* Block RAM represents Block RAM tiles in Xilinx devices and Block RAM bits in Intel devices

hPC_optimized.estimatePerformance(fcnet)

### An output layer called 'Output1_fc' of type 'nnet.cnn.layer.RegressionOutputLayer' has been added to the provided network. This layer performs no operation during prediction and thus does not affect the output of the network.
### The network includes the following layers:
     1   'input'        Image Input         28×28×3 images            (SW Layer)
     2   'fc'           Fully Connected     10 fully connected layer  (HW Layer)
     3   'Output1_fc'   Regression Output   mean-squared-error        (SW Layer)
                                                                    
### Notice: The layer 'input' with type 'nnet.cnn.layer.ImageInputLayer' is implemented in software.
### Notice: The layer 'Output1_fc' with type 'nnet.cnn.layer.RegressionOutputLayer' is implemented in software.


              Deep Learning Processor Estimator Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                      16127                  0.00008                       1              16127          12401.6
    fc                       16127                  0.00008 
 * The clock frequency of the DL processor is: 200MHz

The target device resource counts are:

Digital signal processor (DSP) slice count — 240
Block random access memory (BRAM) count — 128

The estimated performance is 12401.6 frames per second (FPS). The estimated resource use counts are:

Digital signal processor (DSP) slice count — 17
Block random access memory (BRAM) count — 20

The estimated resources of the customized bitstream match the user target device resource budget. The estimated performance matches the target network performance.