# How to Cluster Dataset and remove outlier in MATLAB

6 views (last 30 days)
Med Future on 16 Jan 2023
Answered: Sai Pavan ungefär 22 timmar ago
Hello, I have the following dataset, In which i have four features in each column.
I want to cluster Dataset. I have go through K-means it required Number of clusters as input.
What have you tried?
Med Future on 17 Jan 2023
@Constantino Carlos Reyes-Aldasoro I have tried K_means but it required Number of clusters

Sai Pavan ungefär 22 timmar ago
Hello,
I understand that you want to cluster the 4-feature dataset and remove the outliers from the dataset. This task can be carried out using the following workflow:
• Determine the optimal number of clusters: The elbow method involves plotting the within-cluster sum of squares (WCSS) against the number of clusters and looking for the "elbow" point where the rate of decrease sharply changes. This point is often considered a good choice for the number of clusters.
• Perform K-means clustering: After determining the optimal number of clusters, perform k-means clustering.
• Removing outliers: Outliers can be detected and removed based on their distance from the centroid of their assigned cluster. A common approach is to remove points that are farthest from the centroid beyond a certain threshold.
Please refer to the below code snippet that illustrates the above workflow:
data = Dataset;
wcss = [];
for k = 1:10 % Test up to 10 clusters
[idx, C, sumd] = kmeans(data, k, 'Replicates', 10);
wcss(k) = sum(sumd);
end
plot(1:10, wcss);
xlabel('Number of clusters');
ylabel('WCSS');
title('Elbow Method');
optimalK = % the optimal number of clusters you determined
[idx, C, sumd] = kmeans(data, optimalK, 'Replicates', 10);
% Calculate distances of each point to its cluster centroid
distances = zeros(size(data, 1), 1);
for i = 1:optimalK
clusterPoints = data(idx == i, :);
centroid = C(i, :);
distances(idx == i) = sqrt(sum((clusterPoints - centroid).^2, 2));
end
threshold = prctile(distances, 95); % Define a threshold for outlier removal, e.g., 95th percentile of distances
outliers = distances > threshold; % Identify outliers
% Remove outliers
dataCleaned = data(~outliers, :);
idxCleaned = idx(~outliers);
Hope it helps!

### Categories

Find more on k-Means and k-Medoids Clustering in Help Center and File Exchange

R2022a

### Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!