kmeans with centroids from previous analysis
10 views (last 30 days)
Show older comments
Hans van der Horn
on 8 Dec 2023
Commented: Hans van der Horn
on 9 Dec 2023
Hello everyone,
I wanted to confirm whether my approach is right. I have centroids from a previous kmeans analysis, and now I'd like to extract the membership indexes for these clusters from new data. Am I correct by using:
SubjectMembershipIndex = kmeans(Data, [], 'Distance','cityblock', 'Start', PreviousCentroids);
Thanks!
Best
Hans
0 Comments
Accepted Answer
Walter Roberson
on 8 Dec 2023
Yes, that looks good.
I notice that you pass [] for the k value. That will probably not be immediately obvious to readers, but it does exactly match the documentation of the use of the 'Start' option, which says that k will be deduced from the first dimension of the numeric array you pass in, ignoring any k value you passed. So the code should work fine.
... but since a lot of people don't know about that, as a matter of style it would be easier for people to read if you pass in an explicit numeric k value (even if that k value is going to be ignored.) It is faster for people to see an explicit k value there and not think too deeply about it, compared to people asking themselves WTF it means to have [] for a k value, and having to dig into the documentation to find out that the meaning is well defined.
More Answers (1)
Image Analyst
on 8 Dec 2023
I think some explanation is needed here;
The second time you call kmeans, it will do a kmeans all over again but starting with those centroids as seeds. So the returned centroids will be different the second time than the first time though they should be close. However if you now want to use those same cluster centroids as starting seeds for a totally new set of data, the new centroids may or may not be close to those from the first time. The second centroids depends on what data you actually plug into kmeans the second time. See this demo where I did kmeans twice with k=2 for 2 clusters. First I do it on the two widely separated clusters, then I do it on the two inner clusters using the centroids from the outer clusters.
% Initialization steps.
clc; % Clear the command window.
close all; % Close all figures (except those of imtool.)
clear; % Erase all existing variables. Or clearvars if you want.
workspace; % Make sure the workspace panel is showing.
format long g;
format compact;
fontSize = 8;
%======================================================================================================%------------------------------------------------------------------------------------------------
% FIRST CREATE SAMPLE DATA.
% Make up 4 clusters with 150 points each.
pointsPerCluster = 150;
spread = 0.03;
offsets = [0.3, 0.5, 0.7, 0.9];
% offsets = [0.62, 0.73, 0.84, 0.95];
xa = spread * randn(pointsPerCluster, 1) + offsets(1);
ya = spread * randn(pointsPerCluster, 1) + offsets(1);
xb = spread * randn(pointsPerCluster, 1) + offsets(2);
yb = spread * randn(pointsPerCluster, 1) + offsets(2);
xc = spread * randn(pointsPerCluster, 1) + offsets(3);
yc = spread * randn(pointsPerCluster, 1) + offsets(3);
xd = spread * randn(pointsPerCluster, 1) + offsets(4);
yd = spread * randn(pointsPerCluster, 1) + offsets(4);
%-------------------------------------------------------------------------------------------------------------------------------------------
% First let's run kmeans with 2 clusters a & d
x = [xa; xd];
y = [ya; yd];
xy = [x, y];
%-------------------------------------------------------------------------------------------------------------------------------------------
% K-MEANS CLUSTERING.
% Now do the initial kmeans clustering.
% Determine what the best k is:
% evaluationObject = evalclusters(xy, 'kmeans', 'DaviesBouldin', 'klist', [3:10])
% Do the kmeans with that k (evaluationObject.OptimalK should be 2).
evaluationObject.OptimalK = 2;
[assignedClass, clusterCenters] = kmeans(xy, evaluationObject.OptimalK);
clusterCenters % Echo to command window
% Do a scatter plot with the original class numbers assigned by kmeans.
hfig1 = figure;
subplot(1, 2, 1);
gscatter(x, y, assignedClass);
legend('FontSize', fontSize, 'Location', 'northwest');
grid on;
xlabel('x', 'fontSize', fontSize);
ylabel('y', 'fontSize', fontSize);
title('Original Class Numbers Assigned by kmeans()', 'fontSize', fontSize);
% Plot the class number labels on top of the cluster.
hold on;
for row = 1 : size(clusterCenters, 1)
text(clusterCenters(row, 1), clusterCenters(row, 2), num2str(row), 'FontSize', 25, 'FontWeight', 'bold', 'HorizontalAlignment', 'center', 'VerticalAlignment', 'middle');
end
hold off;
hfig1.WindowState = 'maximized'; % Maximize the figure window so that it takes up the full screen.
% IMPORTANT NOTE: BECAUSE OF RANDOMNESS, SOMETIMES THE LOWER LEFT CLUSTER
% IS LABELED 1 AND SOMETIMES IT'S LABELED 2.
%-------------------------------------------------------------------------------------------------------------------------------------------
% K-MEANS CLUSTERING.
% Now do the kmeans clustering again, using the same data and the centroids from before.
PreviousCentroids = clusterCenters;
[SubjectMembershipIndex, newCentroids1] = kmeans(xy, [], 'Distance','cityblock', 'Start', PreviousCentroids);
fprintf('Using the same data (will be close but not exact):\n')
newCentroids1
% Note the new centroids are close to, but not exactly the same as the previous centroids.
% Now do the kmeans clustering again, using the centroids from before,
% but with new data -- the b and c clusters instead of the a and d clusters.
x2 = [xb; xc];
y2 = [yb; yc];
xy2 = [x2, y2];
[SubjectMembershipIndex, newCentroids2] = kmeans(xy2, [], 'Distance','cityblock', 'Start', PreviousCentroids);
fprintf('Using different data (could be very different depending on the new data:\n')
newCentroids2
% Do a scatter plot with the original class numbers assigned by kmeans.
subplot(1, 2, 2);
gscatter(x, y, assignedClass);
hold on;
gscatter(x2, y2, SubjectMembershipIndex);
legend('FontSize', fontSize, 'Location', 'northwest');
grid on;
xlabel('x', 'fontSize', fontSize);
ylabel('y', 'fontSize', fontSize);
title('Class Numbers Assigned by kmeans()', 'fontSize', fontSize);
% Plot the class number labels on top of the cluster.
hold on;
for row = 1 : size(newCentroids2, 1)
text(newCentroids2(row, 1), newCentroids2(row, 2), num2str(row), 'FontSize', 25, 'FontWeight', 'bold', 'HorizontalAlignment', 'center', 'VerticalAlignment', 'middle');
end
hold off;
Things to note here: The cluster centroids the second time are way different than the first time and are centered where the inner 2 clusters are because that's the data I told it to determine classes for. Note that for these very widely separated clusters the class labels are the same (I ran it dozens of times to check). HOWEVER for mixed/overlapping clusters I believe there might be some points in the "overlap" region that get assigned to class #1 during one run but during the next run they might get assigned to class #2. Of course that's always the case even if you don't re-use centroids. Sometimes with points in the overlap region they might be assigned to different class (cluster) numbers due to the randomness inherent in the algorithm.
(Hope this wasn't too confusing - reread it several times if it is.)
5 Comments
Image Analyst
on 8 Dec 2023
Yep, I agree with your last paragraph and with Walter. If you just want to know which centroid is closest, then just compute the distance of your new points from those centroids rather than doing kmeans again. Like he said, you can use pdist2. For each point, whichever centroid has the smaller distance is the class that point should be assigned to.
See Also
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!