RegressionPartitionedModel
Cross-validated regression model
Description
RegressionPartitionedModel is a set of regression models
trained on cross-validated folds. You can estimate the quality of the regression by
using one or more kfold functions: kfoldPredict, kfoldLoss, and kfoldfun.
Each kfold function uses models trained on training-fold (in-fold)
observations to predict the response for validation-fold (out-of-fold) observations. For
example, when you use kfoldPredict with a
k-fold cross-validated model, the software estimates a response for
every observation using the model trained without that observation. For more
information, see Partitioned Models.
Creation
You can create a RegressionPartitionedModel object in two ways:
Create a cross-validated model from a
RegressionTreemodel object by using thecrossvalobject function.Create a cross-validated model by using the
fitrtreefunction and specifying one of the name-value argumentsCrossVal,CVPartition,Holdout,KFold, orLeaveout.
Properties
Cross-Validation Properties
This property is read-only.
Name of the cross-validated model, returned as a character vector.
Data Types: char
This property is read-only.
Number of folds in the cross-validated model, returned as a positive integer.
Data Types: double
This property is read-only.
Parameters of the cross-validated model, returned as an object.
This property is read-only.
Partition used in the cross-validation, returned as a cvpartition object.
This property is read-only.
Trained learners, returned as a cell array of compact regression models. For more information, see Partitioned Models.
Data Types: cell
Other Regression Properties
This property is read-only.
Bin edges for numeric predictors, returned as a cell array of p numeric vectors, where p is the number of predictors. Each vector includes the bin edges for a numeric predictor. The element in the cell array for a categorical predictor is empty because the software does not bin categorical predictors.
The software bins numeric predictors only if you specify the NumBins
name-value argument as a positive integer scalar when training a model with tree learners.
The BinEdges property is empty if the NumBins value
is empty (default).
You can reproduce the binned predictor data Xbinned by using the
BinEdges property of the trained model
mdl.
X = mdl.X; % Predictor data
Xbinned = zeros(size(X));
edges = mdl.BinEdges;
% Find indices of binned predictors.
idxNumeric = find(~cellfun(@isempty,edges));
if iscolumn(idxNumeric)
idxNumeric = idxNumeric';
end
for j = idxNumeric
x = X(:,j);
% Convert x to array if x is a table.
if istable(x)
x = table2array(x);
end
% Group x into bins by using the discretize function.
xbinned = discretize(x,[-inf; edges{j}; inf]);
Xbinned(:,j) = xbinned;
endXbinned contains the bin indices, ranging from 1
to the number of bins, for the numeric predictors. Xbinned values are 0
for categorical predictors. If X contains NaNs, then
the corresponding Xbinned values are NaNs.Data Types: cell
This property is read-only.
Categorical predictor
indices, returned as a vector of positive integers. CategoricalPredictors
contains index values indicating that the corresponding predictors are categorical. The index
values are between 1 and p, where p is the number of
predictors used to train the model. If none of the predictors are categorical, then this
property is empty ([]).
Data Types: single | double
This property is read-only.
Number of observations in the training data, returned as a positive integer.
NumObservations can be less than the number of rows of input data
when there are missing values in the input data or response data.
Data Types: double
This property is read-only.
Predictor names in order of their appearance in the predictor data
X, returned as a cell array of
character vectors. The length of
PredictorNames is equal to the
number of columns in X.
Data Types: cell
This property is read-only.
Name of the response variable, returned as a character vector.
Data Types: char
Function for transforming the predicted response values, specified as
"none" or a function handle. "none" means no
transformation; equivalently, "none" means @(x)x.
A function handle must accept a matrix of response values and return a matrix of the
same size.
To change the function for transforming the predicted response values, use dot
notation. For example, for a model Mdl and a function
function that you define, you can specify:
Mdl.ResponseTransform = @function;
Data Types: char | string | function_handle
This property is read-only.
Scaled weights in the ensemble, returned as a numeric vector. W has length n, the number of rows in the training data. The sum of the elements of W is 1.
Data Types: double
This property is read-only.
Predictor values, returned as a real matrix or table. Each column of
X represents one variable (predictor), and each row represents
one observation.
Data Types: double | table
This property is read-only.
Response data, returned as a numeric column vector with the same number of rows as
X. Each entry in Y is the response to the
data in the corresponding row of X.
Data Types: double
Object Functions
gather | Gather properties of Statistics and Machine Learning Toolbox object from GPU |
kfoldLoss | Loss for cross-validated partitioned regression model |
kfoldPredict | Predict responses for observations in cross-validated regression model |
kfoldfun | Cross-validate function for regression |
Examples
Load the sample data. Create a variable X containing the Horsepower and Weight data.
load carsmall
X = [Horsepower Weight];Create a regression tree using the sample data.
cvtree = fitrtree(X,MPG,Crossval="on");Evaluate the cross-validation error of the carsmall data using Horsepower and Weight as predictor variables for mileage (MPG).
L = kfoldLoss(cvtree)
L = 25.5338
Compute the loss and the predictions for a regression model, first partitioned using holdout validation and then partitioned using 5-fold cross-validation. Compare the two sets of losses and predictions.
Load the carbig data set, which contains measurements of cars made in the 1970s and early 1980s. Create a table containing the predictor variables Acceleration, Cylinders, Displacement, and so on, as well as the response variable MPG. View the first eight observations.
load carbig cars = table(Acceleration,Cylinders,Displacement, ... Horsepower,Model_Year,Origin,Weight,MPG); head(cars)
Acceleration Cylinders Displacement Horsepower Model_Year Origin Weight MPG
____________ _________ ____________ __________ __________ _______ ______ ___
12 8 307 130 70 USA 3504 18
11.5 8 350 165 70 USA 3693 15
11 8 318 150 70 USA 3436 18
12 8 304 150 70 USA 3433 16
10.5 8 302 140 70 USA 3449 17
10 8 429 198 70 USA 4341 15
9 8 454 220 70 USA 4354 14
8.5 8 440 215 70 USA 4312 14
Remove rows of cars where the table has missing values.
cars = rmmissing(cars);
Categorize the cars based on whether they were made in the USA.
cars.Origin = categorical(cellstr(cars.Origin)); cars.Origin = mergecats(cars.Origin,["France","Japan",... "Germany","Sweden","Italy","England"],"NotUSA");
Partition the data using cvpartition. First, create a partition for holdout validation, using approximately 80% of the observations for the training data and 20% for the validation data. Then, create a partition for 5-fold cross-validation.
rng(0,"twister") % For reproducibility holdoutPartition = cvpartition(height(cars),Holdout=0.20); kfoldPartition = cvpartition(height(cars),KFold=5);
holdoutPartition and kfoldPartition are both nonstratified random partitions. You can use the training and test functions of the partition objects to find the indices for the observations in the training and validation sets, respectively.
Train a regression tree model using the cars data. Specify MPG as the response variable.
Mdl = fitrtree(cars,"MPG");Create the partitioned regression models using crossval.
holdoutMdl = crossval(Mdl,CVPartition=holdoutPartition)
holdoutMdl =
RegressionPartitionedModel
CrossValidatedModel: 'Tree'
PredictorNames: {'Acceleration' 'Cylinders' 'Displacement' 'Horsepower' 'Model_Year' 'Origin' 'Weight'}
CategoricalPredictors: 6
ResponseName: 'MPG'
NumObservations: 392
KFold: 1
Partition: [1×1 cvpartition]
ResponseTransform: 'none'
Properties, Methods
kfoldMdl = crossval(Mdl,CVPartition=kfoldPartition)
kfoldMdl =
RegressionPartitionedModel
CrossValidatedModel: 'Tree'
PredictorNames: {'Acceleration' 'Cylinders' 'Displacement' 'Horsepower' 'Model_Year' 'Origin' 'Weight'}
CategoricalPredictors: 6
ResponseName: 'MPG'
NumObservations: 392
KFold: 5
Partition: [1×1 cvpartition]
ResponseTransform: 'none'
Properties, Methods
holdoutMdl and kfoldMdl are RegressionPartitionedModel objects.
Compute the mean squared error (MSE) for holdoutMdl and kfoldMdl using kfoldLoss.
holdoutL = kfoldLoss(holdoutMdl)
holdoutL = 10.3015
kfoldL = kfoldLoss(kfoldMdl)
kfoldL = 14.3603
holdoutL is the MSE computed using the predictions for one validation set, while kfoldL is an average MSE computed using the predictions for five folds of validation data. Cross-validation metrics tend to be better indicators of a model's performance on unseen data.
Compute the validation data predictions for the two models using kfoldPredict.
holdoutPredictions = kfoldPredict(holdoutMdl); kfoldPredictions = kfoldPredict(kfoldMdl); predictions = table(holdoutPredictions,kfoldPredictions, ... VariableNames=["holdoutMdl","kfoldMdl"])
predictions=392×2 table
holdoutMdl kfoldMdl
__________ ________
NaN 17
NaN 13.429
NaN 15.444
16 16.333
NaN 17.333
NaN 13.444
NaN 12.286
NaN 14.714
10 10.333
NaN 14.833
NaN 15.444
NaN 15.444
NaN 14.222
NaN 15.444
NaN 22.667
NaN 22.7
⋮
kfoldPredict returns NaN values for the observations used to train holdoutMdl.Trained. The function uses the trained model to return predictions for the validation set observations. Similarly, kfoldPredict returns each kfoldMdl prediction using the model in kfoldMdl.Trained that was trained without that observation.
To predict responses for unseen data, use the model trained on the entire data set (Mdl) and its predict function rather than a partitioned model such as holdoutMdl or kfoldMdl.
Algorithms
You can create partitioned models by using k-fold cross-validation, holdout validation, leave-one-out cross-validation, or resubstitution.
k-fold cross-validation — The software divides the observations into
KFolddisjoint folds, each of which has approximately the same number of observations. The software trainsKFoldmodels (Trained), and each model is trained onKFold– 1 of the folds. When you usekfoldPredict, each model predicts the response values for the remaining fold.Holdout validation — The software partitions the observations into a training set and a validation set. The software trains one model (
Trained) using the training set. When you usekfoldPredict, the model predicts the response values for the validation set.Leave-one-out cross-validation — The software creates
NumObservationsfolds, where each observation is a fold. The software trainsNumObservationsmodels (Trained), and each model is trained onNumObservations– 1 of the folds. When you usekfoldPredict, each model predicts the response for the remaining fold (observation).Resubstitution — The software does not partition the data. The software trains one model (
Trained) on the entire data set. When you usekfoldPredict, the model predicts the response values for all observations.
Extended Capabilities
Usage notes and limitations:
RegressionPartitionedModelcan be a cross-validated regression tree trained by usingfitrtreewith GPU array input arguments.The object functions of a
RegressionPartitionedModelmodel fully support GPU arrays.
For more information, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).
Version History
Introduced in R2011a
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)