Clear Filters
Clear Filters

How to determine number of clusters automatically for each image to be used in K Means Algorithm

37 views (last 30 days)
I am having a dataset of 888 images and each having a different color specifications. I have used Dr.Bezdek algorithm for finding the total number of clusters automatically however the results are not the ones i expected. Can anyone suggest a simpler method to determine k automatically or provide reference code for the algorithm i used .
  2 Comments
Bashar Saad
Bashar Saad on 12 Jul 2019
Sir, did you solve this issue i have same the problem in K-selection
Please if you solved share me via bbesho90288@gmail.com just the idea of K-selection
Walter Roberson
Walter Roberson on 12 Jul 2019
I already explained below : the only objective criteria possible is to put each unique point in its own cluster.
The algorithms that I have seen to try to determine the number of clusters automatically, all have had a tuning parameter with a value that there is no rigid justification for, just "this is what people thought worked best".
Suppose that through research it is found that for every student in elementary school class beyond 22, that the average score on tests slips by one grade letter. You might naively say that that means that it would be recognized that classes should have at most 22 students, but that is not what the mathematics of automatic clustering would say. Automatic clustering would look and say that the school board can save $50 million in teacher salaries for each additional student per class and that saving $50 million is worth having the grades slip by one letter, and automatic clustering would proceed from there to say that all of the students should be put into the same class because the incremental change between students scoring 1% on major tests vs 1.1% is negligible compared to not having to pay a second teacher.
Automatic clustering like this is all about "slippery slopes". If you can only afford to hire 100 teachers or can only afford to build 3 bridges then just say so ahead of time instead of trying to justify it mathematically.

Sign in to comment.

Accepted Answer

kira
kira on 2 May 2019
old question, but I just found a way myself looking at matlab documentation:
klist=2:n;%the number of clusters you want to try
myfunc = @(X,K)(kmeans(X, K));
eva = evalclusters(net.IW{1},myfunc,'CalinskiHarabasz','klist',klist)
classes=kmeans(net.IW{1},eva.OptimalK);

More Answers (2)

Walter Roberson
Walter Roberson on 30 Oct 2012
The simplest way to determine k automatically is to make every point into its own cluster.
If that is not appropriate for your situation then you need to define why not, and the method for choosing clusters has to reflect whatever criteria is appropriate for your situation.
  5 Comments
Tom Lane
Tom Lane on 1 Nov 2012
I think Walter is funnin' you because you have posed a very difficult problem without explaining, for example, why the results are not as you expect. There have been attempts to do this, such as
There is also a silhouette plot in the Statistics Toolbox that could help.
Walter Roberson
Walter Roberson on 3 May 2019
When the criteria is the shortest average or total distance from each point to the centroid it is associated with, then by having a centroid located at each unique point, the distance from the point its centroid is 0, and the average and total distances are 0, giving you a perfect fit.
If you think that 3 or 4 clusters is more appropriate than (say) 593 clusters, even though 593 clusters gives you a smaller average or total distance to centroids, then you are somehow weighting a smaller number of clusters as being more important than lower residue. You are doing some trade-off between number of clusters and residue, but how do you justify the trade-off when you know you could get zero residue by increasing the number of clusters?
There are situations in which you can come up with a particular number of clusters. For example you could define a situation in which each cluster represented a construction cost (so you would naturally prefer smaller) to be balanced out against distance -- but how do you define the balance and justify if? If it represents a driving distance, then how do you justify the maximum driving distance as being (say) 32 miles, and not 31.7 miles or 38.1 miles?
Suppose you are trying to optimize number of teachers in elementary school against class size, then certainly teachers cost salary so you would tend to want to reduce the number of teachers, but how do you justify that the maximum class size must be 22 and not 23 or 28 or 37 or (like some of my lectures in university) 435 or even 600?
The only inherent balances possible in choosing the number of clusters is to have no clusters, or to have one cluster per point. Everything is based upon someone making a judgement call as to what is acceptable degradation of service. Those judgement calls are nearly always political, and subject to political revision, most often in the direction of reducing short-term costs, but not infrequently in the direction of satisfying some interest group that direct spending on something has been reduced or eliminated even if that forces you to pay more in indirect spending.

Sign in to comment.


Pamudu Ranasinghe
Pamudu Ranasinghe on 19 Jun 2022
Refer "evalclusters" function
https://in.mathworks.com/help/stats/evalclusters.html#d123e303261
  1 Comment
Walter Roberson
Walter Roberson on 19 Jun 2022
Edited: Walter Roberson on 19 Jun 2022
evalclusters is one of the algorithms I referred to in 2019 as having a tuning parameter that there is no inherent justification for.
The optimal number of clusters is always the number of unique points — unless you have a penalty term (for example a cost to construct an electrical distribution system.)
And if you do have a cost per cluster then in order to prevent the algorithm from clustering everything into one cluster, you need one of:
  • a counter-cost per member that can make it less expensive to build two clusters than have everything served by one; Or
  • a maximum capacity per cluster; Or
  • the distance cost has to be carefully scaled so as to be comparable in bulk to the cost of building more clusters.
None of these are compatible with the k-means implementation.
If you use the classic example of a cluster being an electrical generator then you run across the limitation that electrical distribution is rarely point-to-centre and is instead run along branching lines and therefore the cost of adding an additional endpoint close to another is normally incrementally small, whereas k-means deals strictly with point-to-centre costs.
If you consider each cluster as an electrical distribution point rather than a generator, you still have the wires branching issue, and you have the issue of needing to have the distribution system connect to a generator (and generators have maximum capacity.) If your generator is hydro or solar or wind then you might not be able to place generators wherever you want.
You can see why as soon as you start putting costs on, you either start making costs arbitrary, or you end up in complicated real-world situations beyond the capacity of k-means or evalclusters.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!