kFoldLoss output is different from R2018b to R2021b

2 views (last 30 days)
I have some code that uses fitctree and a test suite that runs the code with test data and checks the output.
I'm updating from MATLAB R2018b to R2021b and everything passes except for one test. When I set a random seed of 1 the two versions give identical outputs (to within 1e-14), but with a seed of 2 the kFoldLoss result is significantly different (0.6449 vs 0.6308).
I set the random seed with
rng(seed, 'twister');
then the code to train the model and get the kFoldLoss (adapted from the code generated in the Classification Learner app) is
% Train a classifier
% This code specifies all the classifier options and trains the classifier.
classificationTree = fitctree(...
predictors, ...
response, ...
'SplitCriterion', 'gdi', ...
'MaxNumSplits', 100, ...
'Surrogate', 'off', ...
'MinLeafSize', obj.MinLeafSize, ...
'ClassNames', obj.classes, ...
'Cost', obj.misclassificationCost, ...
'Weights', inputTable.Properties.UserData.instanceWeights);
% Create the result struct with predict function
predictorExtractionFcn = @(t) t(:, predictorNames);
treePredictFcn = @(x) predict(classificationTree, x);
trainedClassifier.predictFcn = @(x) treePredictFcn(predictorExtractionFcn(x));
% Add additional fields to the result struct
trainedClassifier.RequiredVariables = predictorNames;
trainedClassifier.ClassificationTree = classificationTree;
% duplicate of 'Extract predictors and response' section deleted
% Perform cross-validation
partitionedModel = crossval(trainedClassifier.ClassificationTree, ...
'KFold', obj.crossValidationFolds);
% Compute validation accuracy
validationAccuracy = 1 - kfoldLoss(partitionedModel, 'LossFun', 'ClassifError');
I've looked through the release notes and can't see anything obvious that would cause the difference. I've tried adding the 'Reproducible', true argument to fitctree in R2021b with no effect. The actual classification trees generated are the same. Both versions are running under the same version of Windows 10, with R2018b on my laptop and R2021b in a virtual machine under Virtualbox on a different PC. I'm not using any parallel execution.
Is there anything I've missed? Any ideas please?
  5 Comments
Tom Hawkins
Tom Hawkins on 21 Oct 2022
I realised I have a snapshot of the VM with R2018b on it so that was easier to test than I thought. R2018b on the VM behaves identically to R2018b on the laptop.
Verifying the performance of R2021b in the VM was part of the process of confirming it's OK to upgrade to R2021b on the laptop, where I need the code and environment to be production-ready, so I was holding off doing that. Is it guaranteed that installing the newer version won't affect anything in the older version - are the maths libraries private to each version, for example?
Ultimately if the answer is 'randomisation may not be reproducible across versions or platforms' then that's OK, I just have to amend my tests accordingly. It would be good to confirm exactly why that is the case though.
Tom Hawkins
Tom Hawkins on 9 Nov 2022
I'm still stuck with this. The partitions (partitionedModel.Partition.test(n) and .training(n) for each fold n) are the same on R2021b as on R2018b, and I've confirmed that the data is exactly the same, so the only thing I can see that can be different is the behaviour of kFoldLoss.
I spotted that the Classification Learner app-generated code (which was actually originally created under R2016b) specifies 'LossFun', 'ClassifError' whereas the help for 2018b and 2021b has this option as 'classiferror' (all lower-case), but changing this doesn't make a difference under either release.

Sign in to comment.

Answers (2)

Tom Hawkins
Tom Hawkins on 10 Nov 2022
Here is a minimal example using a built-in dataset. On my R2018b machine classifError is 0.06000; on R2021b it comes out at 0.073333.
% Demonstration of different kFoldLoss results vs MATLAB versions
% Adapted from code generated by R2018b Classification Learner app
load('fisheriris.mat');
iris = array2table(meas);
seed = 19;
k = 4; % number of folds
% Build classification tree using iris data
rng(seed, 'twister')
classificationTree = fitctree(...
iris, species, 'SplitCriterion', 'gdi', ...
'MaxNumSplits', 4, 'Surrogate', 'off', ...
'ClassNames', {'setosa'; 'versicolor'; 'virginica'});
% Create the result struct with predict function
predictorExtrFcn = @(t) t(:, predictorNames);
treePredictFcn = @(x) predict(classificationTree, x);
trainedClassifier.predictFcn = @(x) treePredictFcn(predictorExtrFcn(x));
trainedClassifier.ClassificationTree = classificationTree;
% Perform cross-validation
pModel = crossval(trainedClassifier.ClassificationTree, 'KFold', k);
% Compute validation predictions and classification error
[validationPredictions, validationScores] = kfoldPredict(pModel);
classifError = kfoldLoss(pModel, 'LossFun', 'ClassifError');
  3 Comments
Tom Hawkins
Tom Hawkins on 11 Nov 2022
With the help of a colleague I've now run this test on a number of PCs with the following results:
PC 1, R2018b: 0.06000
PC 2, R2018b: 0.06000
PC 2, R2021b: 0.07333
VM hosted on PC 2, R2021b: 0.07333
PC 3, R2018b: 0.06000
PC 3, R2020b: 0.07333
PC 4, R2021b: 0.07333
PC 4, R2022b: 0.07333
The partitioning is confirmed the same in each case. The R2018b installations all report the 2018.0.1 math kernel library; the others have more recent versions.
I'll try and get results from R2019a - R2020a and report back.
Tom Hawkins
Tom Hawkins on 18 Nov 2022
Edited: Tom Hawkins on 18 Nov 2022
I've tested some more MATLAB versions and got:
PC2, R2019a: 0.060000
PC2, R2019b: 0.060000
PC2, R2020a: 0.060000
PC2, R2020b: 0.073333
The math kernel library reported under R2020a and R2020b is the same: Version 2019.0.3 Product Build 20190125.
So this looks to me like a change in behaviour of kFoldLoss introduced with R2020b. The test mentioned in my initial post also confirms this: under R2020a it behaves like R2018b, under R2020b it behaves like R2021b. I've reviewed the release notes and can't see any obvious cause. Can anyone else reproduce or not reproduce this please? Any thoughts as to an explanation?

Sign in to comment.


Drew
Drew on 22 Nov 2022
The differences in kfoldloss are generally caused by differences in the k-fold partition, which results in different k-fold models, due to the different training data for each fold. When the seed changes, it is expected that the k-fold partition will be different. When the machine changes, with the same seed, the k-fold paritition may be different. When the MATLAB version changes, with the same seed, the k-fold partition may be different (apparently not as often). There is no single correct k-fold partition. Statistically, by testing a variety of k-fold partitions, one may get a more accurate estimate of expected accuracy on unseen data.
The code below is based on the code you provided, but it runs experiments using seeds from 1 to 20. Two experiments are run with seed=1, then two with seed=2, etc. This code also looks at the loss on the full model, calculated on the full training set (that is not a good predictor of accuracy on future unseen data), and examines the depth and number of leaves in the full model. These are added to show that these quantities are stable as the seed changes.
As a solution for your goal of calculating the same kfoldloss even when the machine or release changes, you could take greater control over the k-fold partition by explicitly dictating the cvpartition when calling the crossval method on the ClassificationTree object, by using the "CVPartition" Name-Value pair. See https://www.mathworks.com/help/stats/classificationtree.crossval.html
That is, to take greater control over the cvparitition object, you could save the cvpartition object from an earlier MATLAB release (and/or a different machine), load it in a later release (and/or on a different machine), and then pass that cvpartition object to the crossval method.
% Demonstration of different k-fold partitions
load('fisheriris.mat');
iris = array2table(meas);
% n is the number of observations
% d is the number of predictors
[n,d]=size(iris);
num_e=40; % Number of experiments
k = 4; % number of folds
%seed=[1:num_e]'; % If running one experiment for each seed
% Two experiments with each seed
seed = reshape( repmat( [1:num_e/2], 2,1 ), 1, [] )'; % 1 1 2 2 3 3 ...
[v dateofrelease] =version; % Get the MATLAB version and date
machine = "machinename";
Toolkit=[repmat(v,num_e,1)];
Machine=[repmat(machine,num_e,1)];
% Create an experiment table to hold results
et=table(seed,Toolkit,Machine);
cvtestmatrix=zeros(k,n,num_e); % This is used to compare k-fold partitions
for e=1:num_e
% Build classification tree using iris data
% seed = 19;
rng(et.seed(e), 'twister')
classificationTree = fitctree(...
iris, species, 'SplitCriterion', 'gdi', ...
'MaxNumSplits', 4, 'Surrogate', 'off', ...
'ClassNames', {'setosa'; 'versicolor'; 'virginica'});
et.depth(e)=treedepth(classificationTree); % Depth of the tree
et.leaves(e)=sum(classificationTree.IsBranchNode == 0); % Number of leaves
et.Mdl(e)={classificationTree}; % Store the model in the experiment table
%view(classificationTree,'mode','graph'); % View the tree
% Perform cross-validation
pModel = crossval(classificationTree, 'KFold', k);
et.cvpartition(e)={pModel.Partition};
% Determine if the cvpartition is the same or different from the
% previous one
for kIdx=1:k
% Create kxn matrix where each kth row indicates observations in
% test fold k, using the "test" method of the cvpartition object.
% Have one of these matrices for each e
cvtestmatrix(kIdx,:,e)=test(et.cvpartition{e},kIdx);
end
if (e>1)
% Compare the previous and current cvtestmatrices using matrix
% multiplication of the previous and the transpose of the current.
% If the partitions are the same, the resulting
% k x k matrix will be non-zero only on the diagonal.
et.testdiff(e)={cvtestmatrix(:,:,e-1)*cvtestmatrix(:,:,e)'};
% If the sum along the diagonal is equal to n, then the current
% and previous partitions are the same, and thus set testdiffbool to
% "true" for this experiment. Otherwise, set to "false"
et.testdiffbool(e)= (sum(diag(et.testdiff{e})) == n);
end
% Compute validation predictions and classification error
[validationPredictions, validationScores] = kfoldPredict(pModel);
classifError = kfoldLoss(pModel, 'LossFun', 'ClassifError');
et.kfoldLoss(e)=classifError; % store k-fold loss in experiment table
% Compute loss over the full training set, for the model trained on all
% data
et.loss(e)=loss(classificationTree,iris, species);
end
% To write results to file:
filenamecsv=sprintf("Ver %s %s.csv",v,machine);
writetable(et,filenamecsv)
% To view results
et
et = 40×11 table
seed Toolkit Machine depth leaves Mdl cvpartition kfoldLoss loss testdiff testdiffbool ____ ________________________________ _____________ _____ ______ ________________________ _________________ _________ ________ ____________ ____________ 1 9.13.0.2114483 (R2022b) Update 2 "machinename" 3 4 {1×1 ClassificationTree} {1×1 cvpartition} 0.073333 0.026667 {0×0 double} false 1 9.13.0.2114483 (R2022b) Update 2 "machinename" 3 4 {1×1 ClassificationTree} {1×1 cvpartition} 0.073333 0.026667 {4×4 double} true 2 9.13.0.2114483 (R2022b) Update 2 "machinename" 3 4 {1×1 ClassificationTree} {1×1 cvpartition} 0.06 0.026667 {4×4 double} false 2 9.13.0.2114483 (R2022b) Update 2 "machinename" 3 4 {1×1 ClassificationTree} {1×1 cvpartition} 0.06 0.026667 {4×4 double} true 3 9.13.0.2114483 (R2022b) Update 2 "machinename" 3 4 {1×1 ClassificationTree} {1×1 cvpartition} 0.06 0.026667 {4×4 double} false 3 9.13.0.2114483 (R2022b) Update 2 "machinename" 3 4 {1×1 ClassificationTree} {1×1 cvpartition} 0.06 0.026667 {4×4 double} true 4 9.13.0.2114483 (R2022b) Update 2 "machinename" 3 4 {1×1 ClassificationTree} {1×1 cvpartition} 0.046667 0.026667 {4×4 double} false 4 9.13.0.2114483 (R2022b) Update 2 "machinename" 3 4 {1×1 ClassificationTree} {1×1 cvpartition} 0.046667 0.026667 {4×4 double} true 5 9.13.0.2114483 (R2022b) Update 2 "machinename" 3 4 {1×1 ClassificationTree} {1×1 cvpartition} 0.073333 0.026667 {4×4 double} false 5 9.13.0.2114483 (R2022b) Update 2 "machinename" 3 4 {1×1 ClassificationTree} {1×1 cvpartition} 0.073333 0.026667 {4×4 double} true 6 9.13.0.2114483 (R2022b) Update 2 "machinename" 3 4 {1×1 ClassificationTree} {1×1 cvpartition} 0.073333 0.026667 {4×4 double} false 6 9.13.0.2114483 (R2022b) Update 2 "machinename" 3 4 {1×1 ClassificationTree} {1×1 cvpartition} 0.073333 0.026667 {4×4 double} true 7 9.13.0.2114483 (R2022b) Update 2 "machinename" 3 4 {1×1 ClassificationTree} {1×1 cvpartition} 0.053333 0.026667 {4×4 double} false 7 9.13.0.2114483 (R2022b) Update 2 "machinename" 3 4 {1×1 ClassificationTree} {1×1 cvpartition} 0.053333 0.026667 {4×4 double} true 8 9.13.0.2114483 (R2022b) Update 2 "machinename" 3 4 {1×1 ClassificationTree} {1×1 cvpartition} 0.053333 0.026667 {4×4 double} false 8 9.13.0.2114483 (R2022b) Update 2 "machinename" 3 4 {1×1 ClassificationTree} {1×1 cvpartition} 0.053333 0.026667 {4×4 double} true
% view the differences in k-fold partitions
% Experiment 2 has the same k-fold partiion as Experiment 1
et.testdiff{2}
ans = 4×4
37 0 0 0 0 38 0 0 0 0 38 0 0 0 0 37
% Experiment 3 has different k-fold partition than Experiment 2
et.testdiff{3}
ans = 4×4
10 11 10 6 4 10 11 13 11 10 8 9 12 7 9 9
% helper function for observing tree depth
function depth = treedepth(tree)
parent = tree.Parent;
depth = 0;
node = parent(end);
while node~=0
depth = depth + 1;
node = parent(node);
end
end
  1 Comment
Tom Hawkins
Tom Hawkins on 23 Nov 2022
Thank you for the further detailed response and the suggestion to save and load the partition object. I have updated my example to do this.
% Demonstration of different kFoldLoss results vs MATLAB versions
% Adapted from code generated by R2018b Classification Learner app
load('fisheriris.mat');
iris = array2table(meas);
seed = 19;
% Build classification tree using iris data
rng(seed, 'twister')
classificationTree = fitctree(...
iris, species, 'SplitCriterion', 'gdi', ...
'MaxNumSplits', 4, 'Surrogate', 'off', ...
'ClassNames', {'setosa'; 'versicolor'; 'virginica'});
% Create the result struct with predict function
predictorExtrFcn = @(t) t(:, predictorNames);
treePredictFcn = @(x) predict(classificationTree, x);
trainedClassifier.predictFcn = @(x) treePredictFcn(predictorExtrFcn(x));
trainedClassifier.ClassificationTree = classificationTree;
% Perform cross-validation
if contains(version, 'R2018b')
disp('Creating and saving partition object')
pModel = crossval(trainedClassifier.ClassificationTree, 'KFold', 4);
part = pModel.Partition;
save('part.mat', 'part')
else
disp('Using saved partition object')
p = load('part.mat');
pModel = crossval(trainedClassifier.ClassificationTree, 'CVPartition', p.part);
end
% Compute validation predictions and classification error
[validationPredictions, validationScores] = kfoldPredict(pModel);
classifError = kfoldLoss(pModel, 'LossFun', 'ClassifError');
When run under R2018b, this gives classifError 0.060000. When run under R2020b or later, loading the partition that was saved by R2018b, this gives me classifError 0.073333 - in other words, as far as I can see, kFoldLoss with identical inputs gives different output depending on MATLAB version.
Please let me know if you think I have missed something. As said, I'm just trying to understand which steps of thie process I should expect to be reproducible and which I should not, and I found it unexpected that the loss calculation for identical models, partitions and datasets should not be the same to several significant figures.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!