Main Content

Generate Custom Bitstream to Meet Custom Deep Learning Network Requirements

Deploy your custom network that only has layers with the convolution module output format or only layers with the fully connected module output format by generating a resource optimized custom bitstream that satisfies your performance and resource requirements. Bitstream generated using the default deep learning processor configuration consists of the convolution (conv), fully connected (fc), and adder modules. The generated default bitstreams could exceed your resource utilization requirements which could drive up costs. To generate a bitstream that consists of only the layers in your custom deep learning network, modify the deep learning processor configuration by using the setModuleProperty function of the dlhdl.ProcessorConfig object.

In this example, you have a network that has only layers that have the fully connected module output format. Generate a custom bitstream that consists of the fully connected module only by removing the convolution and adder modules from the deep learning processor configuration. To remove the convolution and adder modules:

  • Turn off the ModuleGeneration property for the individual modules in the deep learning processor configuration.

  • Use the optimizeConfigurationForNetwork function. The function takes the deep learning network object as the input and returns an optimized custom deep learning processor configuration.

  • Rapidly verify the resource utilization of the optimized deep learning processor configuration by using the estimateResources function.

Prerequisites

  • Deep Learning HDL Toolbox™ Support Package for Xilinx™ FPGA and SoC

  • Deep Learning Toolbox™

  • Deep Learning HDL Toolbox™

Create Custom Processor Configuration

Create a custom processor configuration. Save the configuration to hPC.

hPC = dlhdl.ProcessorConfig
hPC = 
                    Processing Module "conv"
                            ModuleGeneration: 'on'
                          LRNBlockGeneration: 'off'
                 SegmentationBlockGeneration: 'on'
                            ConvThreadNumber: 16
                             InputMemorySize: [227 227 3]
                            OutputMemorySize: [227 227 3]
                            FeatureSizeLimit: 2048

                      Processing Module "fc"
                            ModuleGeneration: 'on'
                      SoftmaxBlockGeneration: 'off'
                      SigmoidBlockGeneration: 'off'
                              FCThreadNumber: 4
                             InputMemorySize: 25088
                            OutputMemorySize: 4096

                  Processing Module "custom"
                            ModuleGeneration: 'on'
                                    Addition: 'on'
                              Multiplication: 'on'
                                    Resize2D: 'off'
                                     Sigmoid: 'off'
                                   TanhLayer: 'off'
                             InputMemorySize: 40
                            OutputMemorySize: 120

              Processor Top Level Properties
                              RunTimeControl: 'register'
                               RunTimeStatus: 'register'
                          InputStreamControl: 'register'
                         OutputStreamControl: 'register'
                                SetupControl: 'register'
                           ProcessorDataType: 'single'

                     System Level Properties
                              TargetPlatform: 'Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit'
                             TargetFrequency: 200
                               SynthesisTool: 'Xilinx Vivado'
                             ReferenceDesign: 'AXI-Stream DDR Memory Access : 3-AXIM'
                     SynthesisToolChipFamily: 'Zynq UltraScale+'
                     SynthesisToolDeviceName: 'xczu9eg-ffvb1156-2-e'
                    SynthesisToolPackageName: ''
                     SynthesisToolSpeedValue: ''

Optimize Processor Configuration for a Custom Fully Connected (FC) Layer only Network

To optimize your processor configuration, create a custom fully connected layer only network. Call the custom network fcnet.

layers = [ ...
    imageInputLayer([28 28 3],'Normalization','none','Name','input')
    fullyConnectedLayer(10,'Name','fc')
    regressionLayer('Name','output')];
layers(2).Weights = rand(10,28*28*3);
layers(2).Bias = rand(10,1);
fcnet = assembleNetwork(layers);
plot(fcnet);

Figure contains an axes object. The axes object contains an object of type graphplot.

Retrieve the resource utilization for the default custom processor configuration by using estimateResources. Retrieve the performance for the custom network fcnet by using estimatePerformance.

hPC.estimateResources
              Deep Learning Processor Estimator Resource Results

                             DSPs          Block RAM*     LUTs(CLB/ALUT)  
                        -------------    -------------    ------------- 
Available                    2520              912           274080
                        -------------    -------------    ------------- 
DL_Processor                381( 16%)        508( 56%)     216119( 79%)
* Block RAM represents Block RAM tiles in Xilinx devices and Block RAM bits in Intel devices
hPC.estimatePerformance(fcnet)
### The network includes the following layers:
     1   'input'    Image Input         28×28×3 images            (SW Layer)
     2   'fc'       Fully Connected     10 fully connected layer  (HW Layer)
     3   'output'   Regression Output   mean-squared-error        (SW Layer)
                                                                
### Notice: The layer 'input' with type 'nnet.cnn.layer.ImageInputLayer' is implemented in software.
### Notice: The layer 'output' with type 'nnet.cnn.layer.RegressionOutputLayer' is implemented in software.


              Deep Learning Processor Estimator Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                     137574                  0.00069                       1             137574           1453.8
    ____fc                  137574                  0.00069 
 * The clock frequency of the DL processor is: 200MHz

The target device resource counts are:

  • Digital signal processor (DSP) slice count — 240

  • Block random access memory (BRAM) count — 128

The estimated performance is 1454 frames per second (FPS). The estimated resource use counts are:

  • Digital signal processor (DSP) slice count — 381

  • Block random access memory (BRAM) count — 508

The estimated DSP slice count and BRAM count use exceeds the target device resource budget. Customize the bitstream configuration to reduce resource use by customizing the processor configuration.

Customize Processor Configuration by Using ModuleGeneration Property

Create a deep learning network processor configuration object. Save it to hPC_moduleoff. Turn off the convolution and adder modules in the custom deep learning processor configuration.

hPC_moduleoff = dlhdl.ProcessorConfig;
hPC_moduleoff.setModuleProperty('conv','ModuleGeneration','off');
hPC_moduleoff.setModuleProperty('adder','ModuleGeneration','off');

Retrieve the resource utilization for the default custom processor configuration by using estimateResources. Retrieve the performance for the custom network fcnet by using estimatePerformance.

hPC_moduleoff.estimateResources
              Deep Learning Processor Estimator Resource Results

                             DSPs          Block RAM*     LUTs(CLB/ALUT)  
                        -------------    -------------    ------------- 
Available                    2520              912           274080
                        -------------    -------------    ------------- 
DL_Processor                 17(  1%)         44(  5%)      25760( 10%)
* Block RAM represents Block RAM tiles in Xilinx devices and Block RAM bits in Intel devices
hPC_moduleoff.estimatePerformance(fcnet)
### The network includes the following layers:
     1   'input'    Image Input         28×28×3 images            (SW Layer)
     2   'fc'       Fully Connected     10 fully connected layer  (HW Layer)
     3   'output'   Regression Output   mean-squared-error        (SW Layer)
                                                                
### Notice: The layer 'input' with type 'nnet.cnn.layer.ImageInputLayer' is implemented in software.
### Notice: The layer 'output' with type 'nnet.cnn.layer.RegressionOutputLayer' is implemented in software.


              Deep Learning Processor Estimator Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                     137574                  0.00069                       1             137574           1453.8
    ____fc                  137574                  0.00069 
 * The clock frequency of the DL processor is: 200MHz

The target device resource counts are:

  • Digital signal processor (DSP) slice count — 240

  • Block random access memory (BRAM) count — 128

The estimated performance is 1454 frames per second (FPS). The estimated resource use counts are:

  • Digital signal processor (DSP) slice count — 17

  • Block random access memory (BRAM) count — 44

The estimated resources of the customized bitstream match the user target device resource budget. The estimated performance matches the target network performance.

Customize Processor Configuration by Using optimizeConfigurationForNetwork

Create a deep learning network processor configuration object. Save it to hPC_optimized. Generate an optimized deep learning processor configuration by using the optimizeConfigurationForNetwork function.

hPC_optimized = dlhdl.ProcessorConfig;
hPC_optimized.optimizeConfigurationForNetwork(fcnet);
### Optimizing processor configuration for deep learning network begin.
### Note: Processing module "conv" property "ModuleGeneration" changed from "true" to "false".
### Note: Processing module "fc" property "InputMemorySize" changed from "25088" to "2352".
### Note: Processing module "fc" property "OutputMemorySize" changed from "4096" to "128".
### Note: Processing module "custom" property "ModuleGeneration" changed from "true" to "false".

                    Processing Module "conv"
                            ModuleGeneration: 'off'

                      Processing Module "fc"
                            ModuleGeneration: 'on'
                      SoftmaxBlockGeneration: 'off'
                      SigmoidBlockGeneration: 'off'
                              FCThreadNumber: 4
                             InputMemorySize: 2352
                            OutputMemorySize: 128

                  Processing Module "custom"
                            ModuleGeneration: 'off'

              Processor Top Level Properties
                              RunTimeControl: 'register'
                               RunTimeStatus: 'register'
                          InputStreamControl: 'register'
                         OutputStreamControl: 'register'
                                SetupControl: 'register'
                           ProcessorDataType: 'single'

                     System Level Properties
                              TargetPlatform: 'Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit'
                             TargetFrequency: 200
                               SynthesisTool: 'Xilinx Vivado'
                             ReferenceDesign: 'AXI-Stream DDR Memory Access : 3-AXIM'
                     SynthesisToolChipFamily: 'Zynq UltraScale+'
                     SynthesisToolDeviceName: 'xczu9eg-ffvb1156-2-e'
                    SynthesisToolPackageName: ''
                     SynthesisToolSpeedValue: ''

### Optimizing processor configuration for deep learning network complete.

Retrieve the resource utilization for the default custom processor configuration by using estimateResources. Retrieve the performance for the custom network fcnet by using estimatePerformance.

hPC_optimized.estimateResources
              Deep Learning Processor Estimator Resource Results

                             DSPs          Block RAM*     LUTs(CLB/ALUT)  
                        -------------    -------------    ------------- 
Available                    2520              912           274080
                        -------------    -------------    ------------- 
DL_Processor                 17(  1%)         20(  3%)      25760( 10%)
* Block RAM represents Block RAM tiles in Xilinx devices and Block RAM bits in Intel devices
hPC_optimized.estimatePerformance(fcnet)
### The network includes the following layers:
     1   'input'    Image Input         28×28×3 images            (SW Layer)
     2   'fc'       Fully Connected     10 fully connected layer  (HW Layer)
     3   'output'   Regression Output   mean-squared-error        (SW Layer)
                                                                
### Notice: The layer 'input' with type 'nnet.cnn.layer.ImageInputLayer' is implemented in software.
### Notice: The layer 'output' with type 'nnet.cnn.layer.RegressionOutputLayer' is implemented in software.


              Deep Learning Processor Estimator Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                     137574                  0.00069                       1             137574           1453.8
    ____fc                  137574                  0.00069 
 * The clock frequency of the DL processor is: 200MHz

The target device resource counts are:

  • Digital signal processor (DSP) slice count — 240

  • Block random access memory (BRAM) count — 128

The estimated performance is 1454 frames per second (FPS). The estimated resource use counts are:

  • Digital signal processor (DSP) slice count — 17

  • Block random access memory (BRAM) count — 20

The estimated resources of the customized bitstream match the user target device resource budget. The estimated performance matches the target network performance.

Generate Custom Bitstream

Generate a custom bitstream using the processor configuration that matches your performance and resource requirements.

To deploy fcnet using the bitstream generated by using the ModuleOff property, uncomment this line of code:

%   dlhdl.buildProcessor(hPC_moduleoff)

To deploy fcnet using the bitstream generated by using the optimizeConfigurationForNetwork function, uncomment this line of code:

%   dlhdl.buildProcessor(hPC_optimized)

See Also

| | | |

Related Topics