kFoldLoss output is different from R2018b to R2021b
2 views (last 30 days)
Show older comments
I have some code that uses fitctree and a test suite that runs the code with test data and checks the output.
I'm updating from MATLAB R2018b to R2021b and everything passes except for one test. When I set a random seed of 1 the two versions give identical outputs (to within 1e-14), but with a seed of 2 the kFoldLoss result is significantly different (0.6449 vs 0.6308).
I set the random seed with
rng(seed, 'twister');
then the code to train the model and get the kFoldLoss (adapted from the code generated in the Classification Learner app) is
% Train a classifier
% This code specifies all the classifier options and trains the classifier.
classificationTree = fitctree(...
predictors, ...
response, ...
'SplitCriterion', 'gdi', ...
'MaxNumSplits', 100, ...
'Surrogate', 'off', ...
'MinLeafSize', obj.MinLeafSize, ...
'ClassNames', obj.classes, ...
'Cost', obj.misclassificationCost, ...
'Weights', inputTable.Properties.UserData.instanceWeights);
% Create the result struct with predict function
predictorExtractionFcn = @(t) t(:, predictorNames);
treePredictFcn = @(x) predict(classificationTree, x);
trainedClassifier.predictFcn = @(x) treePredictFcn(predictorExtractionFcn(x));
% Add additional fields to the result struct
trainedClassifier.RequiredVariables = predictorNames;
trainedClassifier.ClassificationTree = classificationTree;
% duplicate of 'Extract predictors and response' section deleted
% Perform cross-validation
partitionedModel = crossval(trainedClassifier.ClassificationTree, ...
'KFold', obj.crossValidationFolds);
% Compute validation accuracy
validationAccuracy = 1 - kfoldLoss(partitionedModel, 'LossFun', 'ClassifError');
I've looked through the release notes and can't see anything obvious that would cause the difference. I've tried adding the 'Reproducible', true argument to fitctree in R2021b with no effect. The actual classification trees generated are the same. Both versions are running under the same version of Windows 10, with R2018b on my laptop and R2021b in a virtual machine under Virtualbox on a different PC. I'm not using any parallel execution.
Is there anything I've missed? Any ideas please?
5 Comments
Answers (2)
Drew
on 22 Nov 2022
The differences in kfoldloss are generally caused by differences in the k-fold partition, which results in different k-fold models, due to the different training data for each fold. When the seed changes, it is expected that the k-fold partition will be different. When the machine changes, with the same seed, the k-fold paritition may be different. When the MATLAB version changes, with the same seed, the k-fold partition may be different (apparently not as often). There is no single correct k-fold partition. Statistically, by testing a variety of k-fold partitions, one may get a more accurate estimate of expected accuracy on unseen data.
The code below is based on the code you provided, but it runs experiments using seeds from 1 to 20. Two experiments are run with seed=1, then two with seed=2, etc. This code also looks at the loss on the full model, calculated on the full training set (that is not a good predictor of accuracy on future unseen data), and examines the depth and number of leaves in the full model. These are added to show that these quantities are stable as the seed changes.
As a solution for your goal of calculating the same kfoldloss even when the machine or release changes, you could take greater control over the k-fold partition by explicitly dictating the cvpartition when calling the crossval method on the ClassificationTree object, by using the "CVPartition" Name-Value pair. See https://www.mathworks.com/help/stats/classificationtree.crossval.html
That is, to take greater control over the cvparitition object, you could save the cvpartition object from an earlier MATLAB release (and/or a different machine), load it in a later release (and/or on a different machine), and then pass that cvpartition object to the crossval method.
% Demonstration of different k-fold partitions
load('fisheriris.mat');
iris = array2table(meas);
% n is the number of observations
% d is the number of predictors
[n,d]=size(iris);
num_e=40; % Number of experiments
k = 4; % number of folds
%seed=[1:num_e]'; % If running one experiment for each seed
% Two experiments with each seed
seed = reshape( repmat( [1:num_e/2], 2,1 ), 1, [] )'; % 1 1 2 2 3 3 ...
[v dateofrelease] =version; % Get the MATLAB version and date
machine = "machinename";
Toolkit=[repmat(v,num_e,1)];
Machine=[repmat(machine,num_e,1)];
% Create an experiment table to hold results
et=table(seed,Toolkit,Machine);
cvtestmatrix=zeros(k,n,num_e); % This is used to compare k-fold partitions
for e=1:num_e
% Build classification tree using iris data
% seed = 19;
rng(et.seed(e), 'twister')
classificationTree = fitctree(...
iris, species, 'SplitCriterion', 'gdi', ...
'MaxNumSplits', 4, 'Surrogate', 'off', ...
'ClassNames', {'setosa'; 'versicolor'; 'virginica'});
et.depth(e)=treedepth(classificationTree); % Depth of the tree
et.leaves(e)=sum(classificationTree.IsBranchNode == 0); % Number of leaves
et.Mdl(e)={classificationTree}; % Store the model in the experiment table
%view(classificationTree,'mode','graph'); % View the tree
% Perform cross-validation
pModel = crossval(classificationTree, 'KFold', k);
et.cvpartition(e)={pModel.Partition};
% Determine if the cvpartition is the same or different from the
% previous one
for kIdx=1:k
% Create kxn matrix where each kth row indicates observations in
% test fold k, using the "test" method of the cvpartition object.
% Have one of these matrices for each e
cvtestmatrix(kIdx,:,e)=test(et.cvpartition{e},kIdx);
end
if (e>1)
% Compare the previous and current cvtestmatrices using matrix
% multiplication of the previous and the transpose of the current.
% If the partitions are the same, the resulting
% k x k matrix will be non-zero only on the diagonal.
et.testdiff(e)={cvtestmatrix(:,:,e-1)*cvtestmatrix(:,:,e)'};
% If the sum along the diagonal is equal to n, then the current
% and previous partitions are the same, and thus set testdiffbool to
% "true" for this experiment. Otherwise, set to "false"
et.testdiffbool(e)= (sum(diag(et.testdiff{e})) == n);
end
% Compute validation predictions and classification error
[validationPredictions, validationScores] = kfoldPredict(pModel);
classifError = kfoldLoss(pModel, 'LossFun', 'ClassifError');
et.kfoldLoss(e)=classifError; % store k-fold loss in experiment table
% Compute loss over the full training set, for the model trained on all
% data
et.loss(e)=loss(classificationTree,iris, species);
end
% To write results to file:
filenamecsv=sprintf("Ver %s %s.csv",v,machine);
writetable(et,filenamecsv)
% To view results
et
% view the differences in k-fold partitions
% Experiment 2 has the same k-fold partiion as Experiment 1
et.testdiff{2}
% Experiment 3 has different k-fold partition than Experiment 2
et.testdiff{3}
% helper function for observing tree depth
function depth = treedepth(tree)
parent = tree.Parent;
depth = 0;
node = parent(end);
while node~=0
depth = depth + 1;
node = parent(node);
end
end
See Also
Categories
Find more on Classification Trees in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!