RegressionPartitionedModel

Cross-validated regression model

Description

RegressionPartitionedModel is a set of regression models trained on cross-validated folds. You can estimate the quality of the regression by using one or more kfold functions: kfoldPredict, kfoldLoss, and kfoldfun.

Each kfold function uses models trained on training-fold (in-fold) observations to predict the response for validation-fold (out-of-fold) observations. For example, when you use kfoldPredict with a k-fold cross-validated model, the software estimates a response for every observation using the model trained without that observation. For more information, see Partitioned Models.

Creation

You can create a RegressionPartitionedModel object in two ways:

Create a cross-validated model from a RegressionTree model object by using the crossval object function.
Create a cross-validated model by using the fitrtree function and specifying one of the name-value arguments CrossVal, CVPartition, Holdout, KFold, or Leaveout.

Properties

expand all

Cross-Validation Properties

`CrossValidatedModel` — Name of cross-validated model
Read-only: character vector

This property is read-only.

Name of the cross-validated model, returned as a character vector.

Data Types: char

`KFold` — Number of folds in model
Read-only: positive integer

This property is read-only.

Number of folds in the cross-validated model, returned as a positive integer.

Data Types: double

`ModelParameters` — Parameters of cross-validated model
Read-only: object

This property is read-only.

Parameters of the cross-validated model, returned as an object.

`Partition` — Partition used in cross-validation
Read-only: `cvpartition` object

This property is read-only.

Partition used in the cross-validation, returned as a cvpartition object.

`Trained` — Trained learners
Read-only: cell array of compact regression models

This property is read-only.

Trained learners, returned as a cell array of compact regression models. For more information, see Partitioned Models.

Data Types: cell

Other Regression Properties

`BinEdges` — Bin edges for numeric predictors
Read-only: cell array of p numeric vectors

This property is read-only.

Bin edges for numeric predictors, returned as a cell array of p numeric vectors, where p is the number of predictors. Each vector includes the bin edges for a numeric predictor. The element in the cell array for a categorical predictor is empty because the software does not bin categorical predictors.

The software bins numeric predictors only if you specify the NumBins name-value argument as a positive integer scalar when training a model with tree learners. The BinEdges property is empty if the NumBins value is empty (default).

You can reproduce the binned predictor data Xbinned by using the BinEdges property of the trained model mdl.

X = mdl.X; % Predictor data
Xbinned = zeros(size(X));
edges = mdl.BinEdges;
% Find indices of binned predictors.
idxNumeric = find(~cellfun(@isempty,edges));
if iscolumn(idxNumeric)
    idxNumeric = idxNumeric';
end
for j = idxNumeric 
    x = X(:,j);
    % Convert x to array if x is a table.
    if istable(x) 
        x = table2array(x);
    end
    % Group x into bins by using the discretize function.
    xbinned = discretize(x,[-inf; edges{j}; inf]); 
    Xbinned(:,j) = xbinned;
end

Xbinned contains the bin indices, ranging from 1 to the number of bins, for the numeric predictors. Xbinned values are 0 for categorical predictors. If X contains NaNs, then the corresponding Xbinned values are NaNs.

Data Types: cell

`CategoricalPredictors` — Categorical predictor indices
Read-only: vector of positive integers | `[]`

This property is read-only.

Categorical predictor indices, returned as a vector of positive integers. CategoricalPredictors contains index values indicating that the corresponding predictors are categorical. The index values are between 1 and p, where p is the number of predictors used to train the model. If none of the predictors are categorical, then this property is empty ([]).

Data Types: single | double

`NumObservations` — Number of observations in training data
Read-only: positive integer

This property is read-only.

Number of observations in the training data, returned as a positive integer. NumObservations can be less than the number of rows of input data when there are missing values in the input data or response data.

Data Types: double

`PredictorNames` — Predictor names
Read-only: cell array of character vectors

This property is read-only.

Predictor names in order of their appearance in the predictor data X, returned as a cell array of character vectors. The length of PredictorNames is equal to the number of columns in X.

Data Types: cell

`ResponseName` — Name of response variable
Read-only: character vector

This property is read-only.

Name of the response variable, returned as a character vector.

Data Types: char

`ResponseTransform` — Function for transforming predicted response values
`"none"` | function handle

Function for transforming the predicted response values, specified as "none" or a function handle. "none" means no transformation; equivalently, "none" means @(x)x. A function handle must accept a matrix of response values and return a matrix of the same size.

To change the function for transforming the predicted response values, use dot notation. For example, for a model Mdl and a function function that you define, you can specify:

Mdl.ResponseTransform = @function;

Data Types: char | string | function_handle

`W` — Scaled weights in ensemble
Read-only: numeric vector

This property is read-only.

Scaled weights in the ensemble, returned as a numeric vector. W has length n, the number of rows in the training data. The sum of the elements of W is 1.

Data Types: double

`X` — Predictor values
Read-only: real matrix | table

This property is read-only.

Predictor values, returned as a real matrix or table. Each column of X represents one variable (predictor), and each row represents one observation.

Data Types: double | table

`Y` — Response data
Read-only: numeric column vector

This property is read-only.

Response data, returned as a numeric column vector with the same number of rows as X. Each entry in Y is the response to the data in the corresponding row of X.

Data Types: double

Object Functions

`gather`	Gather properties of Statistics and Machine Learning Toolbox object from GPU
`kfoldLoss`	Loss for cross-validated partitioned regression model
`kfoldPredict`	Predict responses for observations in cross-validated regression model
`kfoldfun`	Cross-validate function for regression

Examples

collapse all

Evaluate Cross-Validation Error for Regression Tree

Open Live Script

Load the sample data. Create a variable X containing the Horsepower and Weight data.

load carsmall
X = [Horsepower Weight];

Create a regression tree using the sample data.

cvtree = fitrtree(X,MPG,Crossval="on");

Evaluate the cross-validation error of the carsmall data using Horsepower and Weight as predictor variables for mileage (MPG).

L = kfoldLoss(cvtree)

L = 
25.5338

Compare Holdout and k-Fold Cross-Validation Losses and Predictions

Open Live Script

Compute the loss and the predictions for a regression model, first partitioned using holdout validation and then partitioned using 5-fold cross-validation. Compare the two sets of losses and predictions.

Load the carbig data set, which contains measurements of cars made in the 1970s and early 1980s. Create a table containing the predictor variables Acceleration, Cylinders, Displacement, and so on, as well as the response variable MPG. View the first eight observations.

load carbig
cars = table(Acceleration,Cylinders,Displacement, ...
    Horsepower,Model_Year,Origin,Weight,MPG);
head(cars)

    Acceleration    Cylinders    Displacement    Horsepower    Model_Year    Origin     Weight    MPG
    ____________    _________    ____________    __________    __________    _______    ______    ___

          12            8            307            130            70        USA         3504     18 
        11.5            8            350            165            70        USA         3693     15 
          11            8            318            150            70        USA         3436     18 
          12            8            304            150            70        USA         3433     16 
        10.5            8            302            140            70        USA         3449     17 
          10            8            429            198            70        USA         4341     15 
           9            8            454            220            70        USA         4354     14 
         8.5            8            440            215            70        USA         4312     14

Remove rows of cars where the table has missing values.

cars = rmmissing(cars);

Categorize the cars based on whether they were made in the USA.

cars.Origin = categorical(cellstr(cars.Origin));
cars.Origin = mergecats(cars.Origin,["France","Japan",...
    "Germany","Sweden","Italy","England"],"NotUSA");

Partition the data using cvpartition. First, create a partition for holdout validation, using approximately 80% of the observations for the training data and 20% for the validation data. Then, create a partition for 5-fold cross-validation.

rng(0,"twister") % For reproducibility
holdoutPartition = cvpartition(height(cars),Holdout=0.20);
kfoldPartition = cvpartition(height(cars),KFold=5);

holdoutPartition and kfoldPartition are both nonstratified random partitions. You can use the training and test functions of the partition objects to find the indices for the observations in the training and validation sets, respectively.

Train a regression tree model using the cars data. Specify MPG as the response variable.

Mdl = fitrtree(cars,"MPG");

Create the partitioned regression models using crossval.

holdoutMdl = crossval(Mdl,CVPartition=holdoutPartition)

holdoutMdl = 
  RegressionPartitionedModel
      CrossValidatedModel: 'Tree'
           PredictorNames: {'Acceleration'  'Cylinders'  'Displacement'  'Horsepower'  'Model_Year'  'Origin'  'Weight'}
    CategoricalPredictors: 6
             ResponseName: 'MPG'
          NumObservations: 392
                    KFold: 1
                Partition: [1×1 cvpartition]
        ResponseTransform: 'none'


  Properties, Methods

kfoldMdl = crossval(Mdl,CVPartition=kfoldPartition)

kfoldMdl = 
  RegressionPartitionedModel
      CrossValidatedModel: 'Tree'
           PredictorNames: {'Acceleration'  'Cylinders'  'Displacement'  'Horsepower'  'Model_Year'  'Origin'  'Weight'}
    CategoricalPredictors: 6
             ResponseName: 'MPG'
          NumObservations: 392
                    KFold: 5
                Partition: [1×1 cvpartition]
        ResponseTransform: 'none'


  Properties, Methods

holdoutMdl and kfoldMdl are RegressionPartitionedModel objects.

Compute the mean squared error (MSE) for holdoutMdl and kfoldMdl using kfoldLoss.

holdoutL = kfoldLoss(holdoutMdl)

holdoutL = 
10.3015

kfoldL = kfoldLoss(kfoldMdl)

kfoldL = 
14.3603

holdoutL is the MSE computed using the predictions for one validation set, while kfoldL is an average MSE computed using the predictions for five folds of validation data. Cross-validation metrics tend to be better indicators of a model's performance on unseen data.

Compute the validation data predictions for the two models using kfoldPredict.

holdoutPredictions = kfoldPredict(holdoutMdl);
kfoldPredictions = kfoldPredict(kfoldMdl);
predictions = table(holdoutPredictions,kfoldPredictions, ...
    VariableNames=["holdoutMdl","kfoldMdl"])

predictions=392×2 table
    holdoutMdl    kfoldMdl
    __________    ________

       NaN             17 
       NaN         13.429 
       NaN         15.444 
        16         16.333 
       NaN         17.333 
       NaN         13.444 
       NaN         12.286 
       NaN         14.714 
        10         10.333 
       NaN         14.833 
       NaN         15.444 
       NaN         15.444 
       NaN         14.222 
       NaN         15.444 
       NaN         22.667 
       NaN           22.7 
      ⋮

kfoldPredict returns NaN values for the observations used to train holdoutMdl.Trained. The function uses the trained model to return predictions for the validation set observations. Similarly, kfoldPredict returns each kfoldMdl prediction using the model in kfoldMdl.Trained that was trained without that observation.

To predict responses for unseen data, use the model trained on the entire data set (Mdl) and its predict function rather than a partitioned model such as holdoutMdl or kfoldMdl.

Algorithms

expand all

Partitioned Models

You can create partitioned models by using k-fold cross-validation, holdout validation, leave-one-out cross-validation, or resubstitution.

k-fold cross-validation — The software divides the observations into KFold disjoint folds, each of which has approximately the same number of observations. The software trains KFold models (Trained), and each model is trained on KFold – 1 of the folds. When you use kfoldPredict, each model predicts the response values for the remaining fold.
Holdout validation — The software partitions the observations into a training set and a validation set. The software trains one model (Trained) using the training set. When you use kfoldPredict, the model predicts the response values for the validation set.
Leave-one-out cross-validation — The software creates NumObservations folds, where each observation is a fold. The software trains NumObservations models (Trained), and each model is trained on NumObservations – 1 of the folds. When you use kfoldPredict, each model predicts the response for the remaining fold (observation).
Resubstitution — The software does not partition the data. The software trains one model (Trained) on the entire data set. When you use kfoldPredict, the model predicts the response values for all observations.

Extended Capabilities

expand all

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

Usage notes and limitations:

RegressionPartitionedModel can be a cross-validated regression tree trained by using fitrtree with GPU array input arguments.
The object functions of a RegressionPartitionedModel model fully support GPU arrays.

For more information, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).

Version History

Introduced in R2011a

RegressionPartitionedModel

Description

Creation

Properties

Cross-Validation Properties

CrossValidatedModel — Name of cross-validated model Read-only: character vector

KFold — Number of folds in model Read-only: positive integer

ModelParameters — Parameters of cross-validated model Read-only: object

Partition — Partition used in cross-validation Read-only: cvpartition object

Trained — Trained learners Read-only: cell array of compact regression models

Other Regression Properties

BinEdges — Bin edges for numeric predictors Read-only: cell array of p numeric vectors

CategoricalPredictors — Categorical predictor indices Read-only: vector of positive integers | []

NumObservations — Number of observations in training data Read-only: positive integer

PredictorNames — Predictor names Read-only: cell array of character vectors

ResponseName — Name of response variable Read-only: character vector

ResponseTransform — Function for transforming predicted response values "none" | function handle

W — Scaled weights in ensemble Read-only: numeric vector

X — Predictor values Read-only: real matrix | table

Y — Response data Read-only: numeric column vector

Object Functions

Examples

Evaluate Cross-Validation Error for Regression Tree

Compare Holdout and k-Fold Cross-Validation Losses and Predictions

Algorithms

Partitioned Models

Extended Capabilities

GPU Arrays Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

Version History

See Also

`CrossValidatedModel` — Name of cross-validated model
Read-only: character vector

`KFold` — Number of folds in model
Read-only: positive integer

`ModelParameters` — Parameters of cross-validated model
Read-only: object

`Partition` — Partition used in cross-validation
Read-only: `cvpartition` object

`Trained` — Trained learners
Read-only: cell array of compact regression models

`BinEdges` — Bin edges for numeric predictors
Read-only: cell array of p numeric vectors

`CategoricalPredictors` — Categorical predictor indices
Read-only: vector of positive integers | `[]`

`NumObservations` — Number of observations in training data
Read-only: positive integer

`PredictorNames` — Predictor names
Read-only: cell array of character vectors

`ResponseName` — Name of response variable
Read-only: character vector

`ResponseTransform` — Function for transforming predicted response values
`"none"` | function handle

`W` — Scaled weights in ensemble
Read-only: numeric vector

`X` — Predictor values
Read-only: real matrix | table

`Y` — Response data
Read-only: numeric column vector

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.