Automatically Schedule `for`-Loops for Neighborhood Processing Subsystems

Open Live Script

Code generated from Neighborhood Processing Subsystem, Pixel Processing Subsystem, and Array Processing Subsystem blocks contain for-loop nests, which can be computationally expensive. You can produce efficient nested loop code using automatic scheduling. Automatically scheduling loop nests can significantly improve the execution speed of the generated code.

When you enable automatic scheduling, the code generator determines the optimal sequence of transformations for a loop nest created from a subsystem block. Depending on factors such as input size, data type, and layout, as well as memory access patterns and other computations, loop transformations might or might not enhance performance. The code generator evaluates the impact of these factors in terms of parallelization and cache performance to determine which loop transformations, if any, are executed. Typically, speedup occurs when the size of the input to the block is sufficiently large. Consequently, automatically scheduling is useful for high-throughput applications such as image processing and computer vision.

This example shows how to automatically schedule for-loops in the generated code of a Neighborhood Processing Subsystem block that sums elements of a large matrix.

Configure a Neighborhood Processing Subsystem Block to Sum Elements of a Matrix

Configure a Neighborhood Processing Subsystem block to return a matrix whose elements are the sum of the 3-by-3 neighborhood surrounding each input element.

1. Create or open a model that contains a Neighborhood Processing Subsystem block.

model = 'AutomaticallyScheduleForLoops';
open_system(model);

2. Inside the Neighborhood Processing Subsystem block, insert a Sum of Elements block between input and output.

3. In the Block Parameters: Neighborhood dialog box:

Set Neighborhood Size to [3 3].
Set Output Size to Valid.

4. In the top model, connect an Inport block and Outport block to the Neighborhood Processing Subsystem block.

5. Open the Block Parameters dialog box for the Inport block. In the Signal Attributes tab, set Port Dimensions to [2000 2000]. The input to the Neighborhood Processing Subsystem block is sufficiently large for the automatic scheduler to transform loop nests.

Generate Code Using Default Parameters

1. Open the Configuration Parameters dialog box. In the Code Generation pane, set the System Target File parameter to ert.tlc.

set_param(model,SystemTargetFile='ert.tlc');

2. Build the model and generate code.

slbuild(model);

### Starting build procedure for: AutomaticallyScheduleForLoops
### Successful completion of build procedure for: AutomaticallyScheduleForLoops

Build Summary

Top model targets built:

Model                          Action                        Rebuild Reason                                    
===============================================================================================================
AutomaticallyScheduleForLoops  Code generated and compiled.  Code generation information file does not exist.  

1 of 1 models built (0 models already up to date)
Build duration: 0h 0m 25.744s

If you do not select the Automatically schedule for-loops parameter, the code generator produces the following nested loop code.

cfile = fullfile('AutomaticallyScheduleForLoops_ert_rtw','AutomaticallyScheduleForLoops.c');
coder.example.extractLines(cfile,'/* Forward declaration for local functions */', '/* Model step function */', 1, 0);

/* Forward declaration for local functions */
static void AutomaticallyScheduleForLoo_lnt(const real_T Inport[4000000], real_T
  rtb_ImpAsg_InsertedFor_Outpor_0[3992004]);
static void AutomaticallyScheduleForLoo_lnt(const real_T Inport[4000000], real_T
  rtb_ImpAsg_InsertedFor_Outpor_0[3992004])
{
  real_T Img_Load[9];
  int32_T i;
  int32_T var1;
  int32_T var2;
  for (var1 = 0; var1 < 1998; var1++) {
    for (var2 = 0; var2 < 1998; var2++) {
      real_T tmp;
      Img_Load[0] = Inport[2000 * var2 + var1];
      Img_Load[3] = Inport[(var2 + 1) * 2000 + var1];
      Img_Load[6] = Inport[(var2 + 2) * 2000 + var1];
      Img_Load[1] = Inport[(2000 * var2 + var1) + 1];
      Img_Load[4] = Inport[((var2 + 1) * 2000 + var1) + 1];
      Img_Load[7] = Inport[((var2 + 2) * 2000 + var1) + 1];
      Img_Load[2] = Inport[(2000 * var2 + var1) + 2];
      Img_Load[5] = Inport[((var2 + 1) * 2000 + var1) + 2];
      Img_Load[8] = Inport[((var2 + 2) * 2000 + var1) + 2];

      /* Outputs for Iterator SubSystem: '<Root>/Neighborhood Processing Subsystem' */
      /* Sum: '<S1>/Sum of Elements' */
      tmp = -0.0;
      for (i = 0; i < 9; i++) {
        tmp += Img_Load[i];
      }

      rtb_ImpAsg_InsertedFor_Outpor_0[var1 + 1998 * var2] = tmp;

      /* End of Sum: '<S1>/Sum of Elements' */
      /* End of Outputs for SubSystem: '<Root>/Neighborhood Processing Subsystem' */
    }
  }
}

Generate Code Using Automatic Scheduling

To transform the above for-loop nest, the code generator employs the optimization techniques of tiling and parallelization. You must therefore enable parallel for-loops before running automatic scheduling.

1. In the Optimization pane, set Specify custom optimizations to on.

set_param(model,OptimizationCustomize='on');

Then set both Generate parallel for-loops and Automatically schedule for-loops to on.

set_param(model,MultiThreadedLoops='on');
set_param(model,AutoScheduleForLoops='on');

Doing so enables you to generate nested loop code using automatic scheduling.

2. Build the model and generate code.

slbuild(model);

### Starting build procedure for: AutomaticallyScheduleForLoops
### Successful completion of build procedure for: AutomaticallyScheduleForLoops

Build Summary

Top model targets built:

Model                          Action                        Rebuild Reason                   
==============================================================================================
AutomaticallyScheduleForLoops  Code generated and compiled.  Generated code was out of date.  

1 of 1 models built (0 models already up to date)
Build duration: 0h 0m 11.647s

The code generator produces the following nested loop code. It first tiles the 1998-by-1998 loop nest by partitioning it into 8-by-32 blocks, which results in faster memory access and improved data locality. It then parallelizes the outer of the two loops over the blocks. The resulting code is optimized for the large input size.

cfile = fullfile('AutomaticallyScheduleForLoops_ert_rtw','AutomaticallyScheduleForLoops.c');
coder.example.extractLines(cfile,'/* Forward declaration for local functions */', '/* Model step function */', 1, 0);

/* Forward declaration for local functions */
static void AutomaticallyScheduleForLoo_lnt(const real_T Inport[4000000], real_T
  rtb_ImpAsg_InsertedFor_Outpor_0[3992004]);
static void AutomaticallyScheduleForLoo_lnt(const real_T Inport[4000000], real_T
  rtb_ImpAsg_InsertedFor_Outpor_0[3992004])
{
  real_T Img_Load[9];
  real_T tmp;
  int32_T i;
  int32_T newLoop0;
  int32_T newLoop1;
  int32_T u0;
  int32_T u0_0;
  int32_T var1;
  int32_T var1_0;
  int32_T var2;
  int32_T var2_0;

#pragma omp parallel for num_threads(omp_get_max_threads()) private(u0,var2_0,u0_0,newLoop0,var1,newLoop1,var2,Img_Load,tmp,i)

  for (var1_0 = 0; var1_0 < 250; var1_0++) {
    u0 = var1_0 << 3;
    if (u0 > 1990) {
      u0 = 1990;
    }

    for (var2_0 = 0; var2_0 < 63; var2_0++) {
      u0_0 = var2_0 << 5;
      if (u0_0 > 1966) {
        u0_0 = 1966;
      }

      for (newLoop0 = 0; newLoop0 < 8; newLoop0++) {
        var1 = u0 + newLoop0;
        for (newLoop1 = 0; newLoop1 < 32; newLoop1++) {
          var2 = u0_0 + newLoop1;
          Img_Load[0] = Inport[2000 * var2 + var1];
          Img_Load[3] = Inport[(var2 + 1) * 2000 + var1];
          Img_Load[6] = Inport[(var2 + 2) * 2000 + var1];
          Img_Load[1] = Inport[(2000 * var2 + var1) + 1];
          Img_Load[4] = Inport[((var2 + 1) * 2000 + var1) + 1];
          Img_Load[7] = Inport[((var2 + 2) * 2000 + var1) + 1];
          Img_Load[2] = Inport[(2000 * var2 + var1) + 2];
          Img_Load[5] = Inport[((var2 + 1) * 2000 + var1) + 2];
          Img_Load[8] = Inport[((var2 + 2) * 2000 + var1) + 2];

          /* Outputs for Iterator SubSystem: '<Root>/Neighborhood Processing Subsystem' */
          /* Sum: '<S1>/Sum of Elements' */
          tmp = -0.0;
          for (i = 0; i < 9; i++) {
            tmp += Img_Load[i];
          }

          rtb_ImpAsg_InsertedFor_Outpor_0[var1 + 1998 * var2] = tmp;

          /* End of Sum: '<S1>/Sum of Elements' */
          /* End of Outputs for SubSystem: '<Root>/Neighborhood Processing Subsystem' */
        }
      }
    }
  }
}

Automatically Schedule `for`-Loops for Neighborhood Processing Subsystems

Configure a Neighborhood Processing Subsystem Block to Sum Elements of a Matrix

Generate Code Using Default Parameters

Generate Code Using Automatic Scheduling

See Also

Related Topics

Automatically Schedule for-Loops for Neighborhood Processing Subsystems

Configure a Neighborhood Processing Subsystem Block to Sum Elements of a Matrix

Generate Code Using Default Parameters

Generate Code Using Automatic Scheduling

See Also

Related Topics

Automatically Schedule `for`-Loops for Neighborhood Processing Subsystems