Main Content


Train Mask R-CNN network to perform instance segmentation

Since R2022a


    trainedDetector = trainMaskRCNN(trainingData,network,options) trains a Mask R-CNN network. A trained Mask R-CNN network object can perform instance segmentation to detect and segment multiple object classes. This syntax supports transfer learning on a pretrained Mask R-CNN network and training an uninitialized Mask R-CNN network.

    This function requires that you have Deep Learning Toolbox™. It is recommended that you also have Parallel Computing Toolbox™ to use with a CUDA®-enabled NVIDIA® GPU. For information about the supported compute capabilities, see GPU Computing Requirements (Parallel Computing Toolbox).

    trainedDetector = trainMaskRCNN(trainingData,network,options,Name=Value) uses additional options specified by one or more name-value arguments.

    [trainedDetector,info] = trainMaskRCNN(trainingData,network,options) also returns information on the training progress, such as training loss and accuracy, for each iteration.

    Input Arguments

    collapse all

    Labeled ground truth training data, specified as a datastore. Your data must be set up so that calling the datastore with the read and readall functions returns a cell array with four columns. This table describes the format of each column.


    RGB image that serves as a network input, specified as an H-by-W-by-3 numeric array.

    Bounding boxes, specified as M-by-4 matrices, where M is the number of objects within the image. Each bounding box has the format [x y width height], where [x, y] represent the top-left coordinates of the bounding box.

    Object class names, specified as an M-by-1 categorical vector. All categorical data returned by the datastore must contain the same categories.

    Binary masks, specified as a logical array of size H-by-W-by-M. Each mask is the segmentation of one instance in the image.

    You can create a datastore that returns data in the required format using these steps:

    1. Create an imageDatastore that returns RGB image data

    2. Create a boxLabelDatastore that returns bounding box data and instance labels as a two-element cell array

    3. Create an imageDatastore and specify a custom read function that returns mask data as a binary matrix

    4. Combine the three datastores using the combine function

    For more information, see Getting Started with Mask R-CNN for Instance Segmentation.

    Mask R-CNN network to train, specified as a maskrcnn object.

    Training options, specified as a TrainingOptionsSGDM, TrainingOptionsRMSProp, or TrainingOptionsADAM object returned by the trainingOptions (Deep Learning Toolbox) function. To specify the solver name and other options for network training, use the trainingOptions function. You must set the ResetInputNormalization property as false.

    Name-Value Arguments

    Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

    Example: trainedDetector = trainMaskRCNN(trainingData,network,options,NumRegionsToSample=64) samples 64 region proposals from each training image

    Bounding box overlap ratios for positive training samples, specified as a two-element numeric vector with values in the range [0, 1]. Region proposals that overlap with ground truth bounding boxes within the specified range are used as positive training samples.

    The overlap ratio for bounding boxes A and B is:


    Bounding box overlap ratios for negative training samples, specified as a two-element numeric vector with values in the range [0, 1]. Region proposals that overlap with the ground truth bounding boxes within the specified range are used as negative training samples.

    The overlap ratio for bounding boxes A and B is:


    Maximum number of strongest region proposals to use for generating training samples, specified as a positive integer. Reduce this value to speed up processing time at the cost of training accuracy. To use all region proposals, set this value to Inf.

    Number of region proposals to randomly sample from each training image, specified as a positive integer. Reduce the number of regions to sample to reduce memory usage and speed up training. Reducing the value can also decrease training accuracy.

    Subnetworks to freeze during training, specified as one of these values:

    • "none" — Do not freeze subnetworks

    • "backbone" — Freeze the feature extraction subnetwork, including the layers following the ROI align layer

    • "rpn" — Freeze the region proposal subnetwork

    • ["backbone" "rpn"] — Freeze both the feature extraction and the region proposal subnetworks

    The weight of layers in frozen subnetworks does not change during training.

    Training experiment monitor, specified as an experiments.Monitor (Deep Learning Toolbox) object for use with the Experiment Manager (Deep Learning Toolbox) app. You can use this object to track the progress of training, update information fields in the training results table, record values of the metrics used by the training, and to produce training plots.

    Information monitored during training:

    • Training loss at each iteration

    • Training accuracy at each iteration

    • Training root mean square error (RMSE) for the box regression layer

    • Training loss for the mask segmentation branch

    • Learning rate at each iteration

    Validation information when the training options input contains validation data:

    • Validation loss at each iteration

    • Validation accuracy at each iteration

    • Validation RMSE at each iteration

    • Validation loss for the mask segmentation branch

    Output Arguments

    collapse all

    Trained Mask R-CNN network, returned as a maskrcnn object.

    Training progress information, returned as a structure. Each field corresponds to a stage of training.

    • TrainingLoss — Training loss at each iteration. The loss is the combination of the region proposal network (RPN), classification, regression and mask loss used to train the Mask R-CNN network.

    • TrainingRPNLoss — Total RPN loss at the end of each iteration.

    • TrainingRMSE — Training root mean squared error (RMSE) for the box regression layer at the end of each iteration.

    • TrainingMaskLoss — Training cross-entropy loss for the mask segmentation branch at the end of each iteration.

    • LearnRate — Learning rate at each iteration.

    • ValidationLoss — Validation loss at each iteration.

    • ValidationRPNLoss — Validation RPN loss at each iteration.

    • ValidationRMSE — Validation RMSE at each iteration.

    • ValidationMaskLoss — Validation cross-entropy loss for the mask segmentation branch at each iteration.

    Each field is a numeric vector with one element per training iteration. Values that are not calculated at a specific iteration are assigned as NaN. The structure contains the ValidationLoss, ValidationRPNLoss, ValidationRMSE, and ValidationMaskLoss fields only when options specifies validation data.


    • The trainMaskRCNN function has a high GPU memory requirement. It is recommended to train a Mask R-CNN network with at least 12 GB of available GPU memory.

    • To reduce the training memory consumption, try reducing the InputSize property of the network argument or the NumRegionsToSample name-value argument.

    • When you want to perform transfer learning on a data set with similar content to the COCO data set, freezing the feature extraction and region proposal subnetworks can help the network training converge faster.

    Version History

    Introduced in R2022a

    expand all