How to Cluster Dataset and remove outlier in MATLAB

11 views (last 30 days)
Hello, I have the following dataset, In which i have four features in each column.
I want to cluster Dataset. I have go through K-means it required Number of clusters as input.

Answers (1)

Sai Pavan
Sai Pavan on 17 Apr 2024 at 10:45
Hello,
I understand that you want to cluster the 4-feature dataset and remove the outliers from the dataset. This task can be carried out using the following workflow:
  • Determine the optimal number of clusters: The elbow method involves plotting the within-cluster sum of squares (WCSS) against the number of clusters and looking for the "elbow" point where the rate of decrease sharply changes. This point is often considered a good choice for the number of clusters.
  • Perform K-means clustering: After determining the optimal number of clusters, perform k-means clustering.
  • Removing outliers: Outliers can be detected and removed based on their distance from the centroid of their assigned cluster. A common approach is to remove points that are farthest from the centroid beyond a certain threshold.
Please refer to the below code snippet that illustrates the above workflow:
data = Dataset;
wcss = [];
for k = 1:10 % Test up to 10 clusters
[idx, C, sumd] = kmeans(data, k, 'Replicates', 10);
wcss(k) = sum(sumd);
end
plot(1:10, wcss);
xlabel('Number of clusters');
ylabel('WCSS');
title('Elbow Method');
optimalK = % the optimal number of clusters you determined
[idx, C, sumd] = kmeans(data, optimalK, 'Replicates', 10);
% Calculate distances of each point to its cluster centroid
distances = zeros(size(data, 1), 1);
for i = 1:optimalK
clusterPoints = data(idx == i, :);
centroid = C(i, :);
distances(idx == i) = sqrt(sum((clusterPoints - centroid).^2, 2));
end
threshold = prctile(distances, 95); % Define a threshold for outlier removal, e.g., 95th percentile of distances
outliers = distances > threshold; % Identify outliers
% Remove outliers
dataCleaned = data(~outliers, :);
idxCleaned = idx(~outliers);
Hope it helps!
  1 Comment
Med Future
Med Future ungefär 4 timmar ago
I have implement the code you shared with my code. But still there is an error Arrays have incompatible sizes for this operation. I have attached the dataset and the code below. Please modified the code for that. As i know the ground truth there should be only 1 cluster the remaining are the noise. Based on the distance calculation
dataset1=data(:,[2 4]);
% Step 1: Identify and remove outliers
freq_outliers = isoutlier(dataset1(:, 1));
pw_outliers = isoutlier(dataset1(:, 2));
outliers = freq_outliers | pw_outliers;
% Step 2: Remove rows with outliers from all columns
dataset1_no_outliers = dataset1(~outliers, :);
pdw_no_outliers = data(~outliers, :);
% Now, continue with your existing code using 'dataset1_no_outliers'
eva = evalclusters(dataset1_no_outliers, 'kmeans', 'silhouette', 'KList', [1:8]);
%eva = evalclusters(dataset1, 'kmeans', 'silhouette', 'KList', [1:8]);
K = eva.OptimalK;
[idx,C,sumdist] = kmeans(dataset1,K);
dataset=data;
dataset_idx=zeros(length(dataset),5);
dataset_idx=dataset(:,1:5);
dataset_idx(:,6)=idx;
clusters = cell(K,1);
for i = 1:K
clusters{i} = dataset_idx(dataset_idx(:,6) == i,:);
end
cluster_assignments=idx;
optimalK=K
% Calculate distances of each point to its cluster centroid
distances = zeros(size(data, 1), 1);
for i = 1:optimalK
clusterPoints = data(idx == i, :);
centroid = C(i, :);
distances(idx == i) = sqrt(sum((clusterPoints - centroid).^2, 2));
end
threshold = prctile(distances, 95); % Define a threshold for outlier removal, e.g., 95th percentile of distances
outliers = distances > threshold; % Identify outliers
% Remove outliers
dataCleaned = data(~outliers, :);
idxCleaned = idx(~outliers);

Sign in to comment.

Products


Release

R2022a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!