Why is variance high for high K value in this KNN code?

2 views (last 30 days)
Hello,
Long post, please bear with me
I have a matlab dataset (dataset.mat) whose size is 280*3. The last column is the labels. There are total 3 classes (1, 2 and 3). I am implementing KNN on this dataset. Basically, I want to calculate the classification error, the mean and the variance of the classification error over multiple (random, but even) splits. From the plot I want to determine how k value affects the mean and the variance of the classification error. Now, I understand the concept of Bias and Variance. I also know that as the k value increases, the bias will increase and variance will decrease. When K = 1 the bias will be 0, however, when it comes to new data (in test set), it has higher chance to be an error, which causes high variance. But, the variance isnt decreasing in my plot (please see the attachment)
My code looks like this:
%% Loading the dataset
clear all
clc
load('dataset.mat');
%% Calculating the mean, variance and classification error for multiple splits
m = []; % empty list to store the mean of the classification error
variance = []; % empty list to store the variance of the classification error
error = []; % empty list to store the classification error
for k= 1:20 % different k values
error = [];
for j= 1:10 % This for loop is for random split (note: each time it is split evenly i.e. 50% into a training set and rest in a test set).
% dataset is split evenly (i.e. 50%), but randomly in to a training set and a test set all 10 times
N = size(knn_samples,1);
idx = randperm(N);
train = knn_samples(idx(1:round(N*0.5)),:);
test = knn_samples(idx(round(N*0.5)+1:end),:);
X_train = train(:,1:2); % size 140*2
y_train = train(:,3); % size 140*1
X_test = test(:,1:2); % size 140*2
y_test = test(:,3); % size 140*1
Model = fitcknn(X_train,y_train,'NumNeighbors',k,'Standardize',1); % KNN model
rloss = resubLoss(Model); % the classification loss by resubstitution
[label_test,score_test,cost_test] = predict(Model,X_test);
L = loss(Model,X_test,y_test); %how well the model classifies the data
C_test = confusionmat(y_test,label_test); % confusion matrix
idx = find(C_test ~= diag(C_test)); %to find the index of the off diagonal entries of confusion matrix i.e. classification error
off_diag = sum(C_test(idx)); %to calculate the total value of off diagonal entries
accuracy = sum(diag(C_test)/sum(C_test(:)));
errorClass = sum(label_test ~= y_test)/length(y_test);
error = [error, errorClass]; % classification error
end
m = [m, mean(error)]; %mean of the classification error
variance = [variance, var(error)]; % variance of the classification error
end
figure(1)
hold on
colormat1 = y_test;
scatter(X_test(:, 1), X_test(:, 2), [], colormat1);
l = (label_test ~= y_test); % specify wrong predictions
colormat2 = label_test(l);
mkr = 'x';
scatter(X_test(l, 1), X_test(l, 2), [], colormat2, mkr); % mark the wrong predictions
k = 1:20;
figure(2)
plot(k, m, 'b')
xlabel('K values')
ylabel('Mean')
title('Mean of the classification error') % over multiple splits
figure(3)
plot(k, predictiveVariance, 'k')
xlabel('K values')
ylabel('Variance')
title('Variance of the classification error')
Maybe there is a compact way of writing this code, but I am a beginner. This could be a very very basic quetion, but I am unable to figure it out. I looked online for the solution, but I didn't find anything. Almost every site talks about Bias and Variance trade-off, but I didn't find any code example or a reason on why the variance could be increasing with increasing value of k. May be there is a small glitch in the code which I am unable to figure it out. I have given up on finding solution on my own, hence looking for solution in the Matlab community. You can also suggest a better way to write this code or any link which could give me a solution for this.
Note: Please also have a look at the variance value. Is it too small (it is in 10^-3 range)
Thank you very much
  2 Comments
Ganesh Regoti
Ganesh Regoti on 24 Jul 2019
Can you provide a section of dataset to test on the model?
Vanditha Rao
Vanditha Rao on 28 Jul 2019
@Ganesh Regoti: What do you mean by the section of dataset? Do you want me to attach the dataset? I have attached the dataset.

Sign in to comment.

Answers (2)

llueg
llueg on 24 Jul 2019
I agree more information on the data would be helpful. Also, since your data set is fairly small, you can probably do more than 10 (maybe a hundred) different splits for each k, just to get a more accurate average. If the current trend is still there, it's probably due to properties specific to your data.

Ganesh Regoti
Ganesh Regoti on 29 Jul 2019
Edited: Ganesh Regoti on 29 Jul 2019
In KNN-classification, variance need not be decreasing as the K value increases. Usually it is ‘U’- shape and we find out the optimal point.
There might be certain predictors which contribute more for the classification. If those highly contributing predictors vary as such
Constant: There will be not much difference in variance graph for the entire data set.
Values vary and reach an optimum at certain point: Variance also varies accordingly (probably decreasing with increase in K value) but once optimal point is reached, it might start increasing.
So, I think that in your case optimum point is reached in the process, and continuing the process lead to increase in variance.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!