cvpartition with different selection of input features per fold

7 views (last 30 days)
I am currently utilizing cvpartition to perform k-fold cross validation.
A simplified code I am using looks like below:
% create cross validation object
cvo = cvpartition(y, 'KFold', 5); % 5-fold cross validation
% perform cross validation using logistic regression
logisticRegression = fitglm(x, y, 'CVPartition', cvo, 'ClassNames', classOrder);
The issue I have is that I now want to perform some feature selection only using the training fold of each cross validation split to prevent data leakage (using some statistics from the training data in order to select which features to use for model training).
Procedure (very rough pseudo code) for what I am describing above looks like below:
create cvo
for each fold f:
perform feature selection using training portion of the fold (using some stats calculated based on the training portion of the fold)
train the model with reduced set of features
evaluate (validate) the model using the validation portion of the fold
end
Is there a way achieve above (using different feature subsets for each fold) with cvo? I have already built a pretty extensive pipeline around using cvo, so I'd like to keep using it to prevent myself from re-doing most of the work.
Any help will be greatly appreciated. Thank you,

Answers (1)

Drew
Drew on 3 Sep 2024
The Classification Learner app will generate code that does exactly that. Here is an example using two classes from the Fisher Iris data, to demonstrate how to generate the code:
t=readtable("fisheriris.csv");
t100=t(51:150,:); % Select a subset of the data with only two classes
% Start Classification Learner
classificationLearner
In the app,
(1) Load t100, choose 5-fold cross-validation.
(2) Select "Binary GLM Logistic Regression" from the models gallery
(3) Choose Feature Selection, pick a ranking method, and choose "top 3 predictors". In the example here, the MRMR ranking algorithm (fscmrmr) was selected.
(4) Train the model
(5) Select "Generate Function" in the export section
The generated code is below. This code is exactly as generated by the Classification Learner app. In the resulting model training function, the validation accuracy is calculated as specified in the question. There is a loop over the validation folds, feature selection on training data in each fold, model building on the selected features for that fold, and evaluation of the model for each fold. Note that the generated code could be simplified if desired. The code is longer due to things like normalization, possible handling of categorical predictors and missing values, the use of a function handle for the predict function to handle feature selection, etc.
function [trainedClassifier, validationAccuracy] = trainClassifier(trainingData)
% [trainedClassifier, validationAccuracy] = trainClassifier(trainingData)
% Returns a trained classifier and its accuracy. This code recreates the
% classification model trained in Classification Learner app. Use the
% generated code to automate training the same model with new data, or to
% learn how to programmatically train models.
%
% Input:
% trainingData: A table containing the same predictor and response
% columns as those imported into the app.
%
%
% Output:
% trainedClassifier: A struct containing the trained classifier. The
% struct contains various fields with information about the trained
% classifier.
%
% trainedClassifier.predictFcn: A function to make predictions on new
% data.
%
% validationAccuracy: A double representing the validation accuracy as
% a percentage. In the app, the Models pane displays the validation
% accuracy for each model.
%
% Use the code to train the model with new data. To retrain your
% classifier, call the function from the command line with your original
% data or new data as the input argument trainingData.
%
% For example, to retrain a classifier trained with the original data set
% T, enter:
% [trainedClassifier, validationAccuracy] = trainClassifier(T)
%
% To make predictions with the returned 'trainedClassifier' on new data T2,
% use
% yfit = trainedClassifier.predictFcn(T2)
%
% T2 must be a table containing at least the same predictor columns as used
% during training. For details, enter:
% trainedClassifier.HowToPredict
% Auto-generated by MATLAB on 03-Sep-2024 18:02:04
% Extract predictors and response
% This code processes the data into the right shape for training the
% model.
inputTable = trainingData;
predictorNames = {'SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth'};
predictors = inputTable(:, predictorNames);
response = inputTable.Species;
isCategoricalPredictor = [false, false, false, false];
classNames = {'versicolor'; 'virginica'};
% Feature Ranking and Selection
% Replace Inf/-Inf values with NaN to prepare data for normalization
predictors = standardizeMissing(predictors, {Inf, -Inf});
% Normalize data for feature ranking
predictorMatrix = normalize(predictors, "DataVariable", ~isCategoricalPredictor);
% Rank features using MRMR algorithm
featureIndex = fscmrmr(...
predictorMatrix, ...
response);
numFeaturesToKeep = 3;
includedPredictorNames = predictors.Properties.VariableNames(featureIndex(1:numFeaturesToKeep));
predictors = predictors(:,includedPredictorNames);
isCategoricalPredictor = isCategoricalPredictor(featureIndex(1:numFeaturesToKeep));
% Train a classifier
% This code specifies all the classifier options and trains the classifier.
% For logistic regression, the response values must be converted to zeros
% and ones because the responses are assumed to follow a binomial
% distribution.
% 1 or true = 'successful' class
% 0 or false = 'failure' class
% NaN - missing response.
successClass = 'versicolor';
failureClass = 'virginica';
% Compute the majority response class. If there is a NaN-prediction from
% fitglm, convert NaN to this majority class label.
numSuccess = sum(strcmp(strtrim(response), successClass));
numFailure = sum(strcmp(strtrim(response), failureClass));
if numSuccess > numFailure
missingClass = successClass;
else
missingClass = failureClass;
end
successFailureAndMissingClasses = {successClass; failureClass; missingClass};
isMissing = cellfun(@(x) isempty(strtrim(x)), response);
zeroOneResponse = double(strcmp(strtrim(response), successClass));
zeroOneResponse(isMissing) = NaN;
% Prepare input arguments to fitglm.
concatenatedPredictorsAndResponse = [predictors, table(zeroOneResponse)];
% Train using fitglm.
GeneralizedLinearModel = fitglm(...
concatenatedPredictorsAndResponse, ...
'Distribution', 'binomial', ...
'link', 'logit');
% Convert predicted probabilities to predicted class labels and scores.
convertSuccessProbsToPredictions = @(p) successFailureAndMissingClasses( ~isnan(p).*( (p<0.5) + 1 ) + isnan(p)*3 );
returnMultipleValuesFcn = @(varargin) varargin{1:max(1,nargout)};
scoresFcn = @(p) [p, 1-p];
predictionsAndScoresFcn = @(p) returnMultipleValuesFcn( convertSuccessProbsToPredictions(p), scoresFcn(p) );
% Create the result struct with predict function
predictorExtractionFcn = @(t) t(:, predictorNames);
featureSelectionFcn = @(x) x(:,includedPredictorNames);
logisticRegressionPredictFcn = @(x) predictionsAndScoresFcn( predict(GeneralizedLinearModel, x) );
trainedClassifier.predictFcn = @(x) logisticRegressionPredictFcn(featureSelectionFcn(predictorExtractionFcn(x)));
% Add additional fields to the result struct
trainedClassifier.RequiredVariables = {'SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth'};
trainedClassifier.GeneralizedLinearModel = GeneralizedLinearModel;
trainedClassifier.SuccessClass = successClass;
trainedClassifier.FailureClass = failureClass;
trainedClassifier.MissingClass = missingClass;
trainedClassifier.ClassNames = {successClass; failureClass};
trainedClassifier.About = 'This struct is a trained model exported from Classification Learner R2024b.';
trainedClassifier.HowToPredict = sprintf('To make predictions on a new table, T, use: \n yfit = c.predictFcn(T) \nreplacing ''c'' with the name of the variable that is this struct, e.g. ''trainedModel''. \n \nThe table, T, must contain the variables returned by: \n c.RequiredVariables \nVariable formats (e.g. matrix/vector, datatype) must match the original training data. \nAdditional variables are ignored. \n \nFor more information, see <a href="matlab:helpview(fullfile(docroot, ''stats'', ''stats.map''), ''appclassification_exportmodeltoworkspace'')">How to predict using an exported model</a>.');
% Extract predictors and response
% This code processes the data into the right shape for training the
% model.
inputTable = trainingData;
predictorNames = {'SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth'};
predictors = inputTable(:, predictorNames);
response = inputTable.Species;
isCategoricalPredictor = [false, false, false, false];
classNames = {'versicolor'; 'virginica'};
% Perform cross-validation
KFolds = 5;
cvp = cvpartition(response, 'KFold', KFolds);
% Initialize the predictions to the proper sizes
validationPredictions = response;
numObservations = size(predictors, 1);
numClasses = 2;
validationScores = NaN(numObservations, numClasses);
for fold = 1:KFolds
trainingPredictors = predictors(cvp.training(fold), :);
trainingResponse = response(cvp.training(fold), :);
foldIsCategoricalPredictor = isCategoricalPredictor;
% Feature Ranking and Selection
% Replace Inf/-Inf values with NaN to prepare data for normalization
trainingPredictors = standardizeMissing(trainingPredictors, {Inf, -Inf});
% Normalize data for feature ranking
predictorMatrix = normalize(trainingPredictors, "DataVariable", ~foldIsCategoricalPredictor);
% Rank features using MRMR algorithm
featureIndex = fscmrmr(...
predictorMatrix, ...
trainingResponse);
numFeaturesToKeep = 3;
includedPredictorNames = trainingPredictors.Properties.VariableNames(featureIndex(1:numFeaturesToKeep));
trainingPredictors = trainingPredictors(:,includedPredictorNames);
foldIsCategoricalPredictor = foldIsCategoricalPredictor(featureIndex(1:numFeaturesToKeep));
% Train a classifier
% This code specifies all the classifier options and trains the classifier.
% For logistic regression, the response values must be converted to zeros
% and ones because the responses are assumed to follow a binomial
% distribution.
% 1 or true = 'successful' class
% 0 or false = 'failure' class
% NaN - missing response.
successClass = 'versicolor';
failureClass = 'virginica';
% Compute the majority response class. If there is a NaN-prediction from
% fitglm, convert NaN to this majority class label.
numSuccess = sum(strcmp(strtrim(trainingResponse), successClass));
numFailure = sum(strcmp(strtrim(trainingResponse), failureClass));
if numSuccess > numFailure
missingClass = successClass;
else
missingClass = failureClass;
end
successFailureAndMissingClasses = {successClass; failureClass; missingClass};
isMissing = cellfun(@(x) isempty(strtrim(x)), trainingResponse);
zeroOneResponse = double(strcmp(strtrim(trainingResponse), successClass));
zeroOneResponse(isMissing) = NaN;
% Prepare input arguments to fitglm.
concatenatedPredictorsAndResponse = [trainingPredictors, table(zeroOneResponse)];
% Train using fitglm.
GeneralizedLinearModel = fitglm(...
concatenatedPredictorsAndResponse, ...
'Distribution', 'binomial', ...
'link', 'logit');
% Convert predicted probabilities to predicted class labels and scores.
convertSuccessProbsToPredictions = @(p) successFailureAndMissingClasses( ~isnan(p).*( (p<0.5) + 1 ) + isnan(p)*3 );
returnMultipleValuesFcn = @(varargin) varargin{1:max(1,nargout)};
scoresFcn = @(p) [p, 1-p];
predictionsAndScoresFcn = @(p) returnMultipleValuesFcn( convertSuccessProbsToPredictions(p), scoresFcn(p) );
% Create the result struct with predict function
featureSelectionFcn = @(x) x(:,includedPredictorNames);
logisticRegressionPredictFcn = @(x) predictionsAndScoresFcn( predict(GeneralizedLinearModel, x) );
validationPredictFcn = @(x) logisticRegressionPredictFcn(featureSelectionFcn(x));
% Add additional fields to the result struct
% Compute validation predictions
validationPredictors = predictors(cvp.test(fold), :);
[foldPredictions, foldScores] = validationPredictFcn(validationPredictors);
% Store predictions in the original order
validationPredictions(cvp.test(fold), :) = foldPredictions;
validationScores(cvp.test(fold), :) = foldScores;
end
% Compute validation accuracy
correctPredictions = strcmp( strtrim(validationPredictions), strtrim(response));
isMissing = cellfun(@(x) all(isspace(x)), response, 'UniformOutput', true);
correctPredictions = correctPredictions(~isMissing);
validationAccuracy = sum(correctPredictions)/length(correctPredictions);
If this answer helps you, please remember to accept the answer.

Products


Release

R2019b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!