Draw a curve defining the bounds of a scatter plot to detect anomalous (sparse or low density) points
25 views (last 30 days)
Show older comments
Dear all, I have thousands of x,y data paris of observed and theoretical temperatures in a given region, where I need to identify some anomalous (for the purpose of my work) data points. I have tried PCA analysys (i.e., DBSCAN), statistical thresholds, curve fittings, etc., but none of them did work. I then tried a KDE-based approach, which seems to 'better' work for my data. To achieve the results shown below I used https://it.mathworks.com/matlabcentral/fileexchange/8430-flow-cytometry-data-reader-and-visualization as:
[hAxes,col,ctrs1,ctrs2,F] = dscatter(X,Y,'BINS',[250,250]);
% Please note, I modified the code to export hAxes,col,ctrs1,ctrs2 and plot
% the contour line as:
contour(ctrs1,ctrs2,F,0.0015,'k-'); % or 0.015 ---> See figure below
By using this apporach I was able to draw a contour line around the main cluster of my data, and I was mostly able to draw the line between 'main cluster' and 'anomalous' points. However, if I have too many 'anomalous' data points (see xy4 in the figure below), the method, using a fixed threshold, fails. Beside, I have to adjust the threshold based on the region of each x,y pair and I have no idea on how to find the correct threshold level (see the figure below - For the same region, a threshold level of 0.0015 seems to work for situations with a few anomalous points, but a threshold of 0.015 is needed when more spreaded points occurred).
What I would really like to do is to draw a curve around the main cluster of my scatterplot, so to devide the anomalous data points. I fully understand this may be a challenging task, but I hope you may provide some good alternatives and/or solutions.
Another solution, as it seems to work, may be defining automatically the density threshold level, but I don't really know where to start from.
Below, you can see 4 examples (xy 1 to 4 are also attached as .mat files). To the left is the simple x,y scatterplot. In the middle, the ideal curve (in red) defining the boundary of my scatterplot (manually sketched). To the right, the KDE-based approach discussed above. Please note the last figure where a threshold of 0.0015 fails, and a threshld of 0.015 is needed instead.
Any help is grately appreciated!
Any other approach to identify the points outside the red boundaries is more than wellcome!
Thanks a lot in advance
0 Comments
Accepted Answer
Matt J
on 13 Sep 2023
Edited: Matt J
on 13 Sep 2023
Sets=["xy1" "xy2" "xy3" "xy4"];
Factors=[20,20,5,5];
for i=1:numel(Sets)
xy=load(Sets(i)).(Sets(i));
p=envelope(xy,Factors(i));
figure
scatter(xy(:,1),xy(:,2),5);
hold on
plot(p,'FaceColor','none','EdgeColor','r', 'LineWidth',2)
hold off
end
function p=envelope(xy,spreadFactor)
[D,I]=pdist2(xy,xy,'euclidean','Smallest',5);
Dmax=max(D,[],1);
d=median(D(:))*spreadFactor;
XY=xy(Dmax<=d,:);
b=boundary(XY,0.95);
p=polyshape(XY(b,:));
end
11 Comments
Matt J
on 16 Sep 2023
Edited: Matt J
on 16 Sep 2023
As a last thing, could you kindly provide a brief summary on the rationale and on what this code (thus pdist2 function) is actually doing, or better yet, how it determines 'where to draw a boundary'?
Well, pdist2 finds the 8 nearest neighbors of every point (x0,y0) and if any of the neighbors is at a distance of >1 than (x0,y0) is classified as belonging to the sparse region, and is discarded from the list. Meanwhile, points well-nested in the dense region will survive this elimination process. Ultimately, the idea is that all the sparse points will be stripped away. Some points within the dense central cloud may also be stripped away, but the expectation is that enough points there will survive so that the boundaries of the central cloud can be well-delineated by the boundary command.
To determine the threshold of >1, I had to read your mind :-). It seemed to me from your initial post and where you were drawing the boundaries, that your criterion for classifying points as an outlier was that they were spaced apart by >1 on average.
Especially, why is D2=max(__); running 3 times,
In the cloud of sparse outlying points, you may have a small, isolated clump of dense points, separate from the main lobe. By iterating a few times, it will eat away at the edges of those clumps, eventually consuming them.
Also, do you have any suggestions for further reading so that I can better understand the approach?
I don't. This is a very improvised clustering strategy, and one that is very customized to your specific data sets. I'm sure you will find much better algorithms if you really dive into the literature on clustering. If you do come across this one in the literature, though, I would be interested to know!
More Answers (1)
Mathieu NOE
on 13 Sep 2023
hello
maybe this ?
I simply decided that I wanted the contour line corresponding to a level 5% of F range above the min value of F
Result :
[hAxes,col,ctrs1,ctrs2,F] = dscatter(xy4(:,1),xy4(:,2),'BINS',[250,250]);
colorbar('vert');
hold on
lb = min(F,[],'all');
ub = max(F,[],'all');
threshold = lb+0.05*(ub-lb); % threshold is 5% of data range above min value
contour(ctrs1,ctrs2,F,[threshold threshold],'k-'); %
hold off
9 Comments
Mathieu NOE
on 18 Sep 2023
No problem
I just need now to find an alternative to pdist2 as I don't have the relevant toolbox
Mathieu NOE
on 18 Sep 2023
so finally (and for my own fun) I tried some variants to replace pdist2
so for those like me that lack the right toolbox , here two codes that do the same process .
the differences are minor, one keeps valid points and the other one remove invalid points like Matt did in his code.
One thing I removed is the 3 iterations in the envelope function ; I didn't see any benefit of doing this , but maybe it's specific to my own code
solution 1
Sets=["xy1" "xy2" "xy3" "xy4"];
for i=1:numel(Sets)
xy=load(Sets(i)).(Sets(i));
[p,XY]=myenvelope(xy);
p = polybuffer(p,0.25);
figure
scatter(xy(:,1),xy(:,2),5);
hold on
plot(p,'FaceColor','none','EdgeColor','r', 'LineWidth',2)
hold off
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [p,XY]=myenvelope(xy)
% D2=max(pdist2(xy,xy,'euclidean','Smallest', 10),[],1);
% the one line engine above is replaced by the following 8 lines of code
x = xy(:,1);
y = xy(:,2);
D2 = sqrt((x-x').^2 + (y-y').^2);
D2(D2 <eps ) = NaN;% replace diag elements that are 0 by NaN's
% find cols where we can have n neigbours within distance of 1
D2 = sort(D2,'ascend');
D2 = D2(1:10,:); % keep the n smallest values
f = all(D2<=1); % keep those points
xy=xy(f,:);
XY = xy;
b=boundary(XY,0.95);
p=polyshape(XY(b,:));
end
solution 2
Sets=["xy1" "xy2" "xy3" "xy4"];
for i=1:numel(Sets)
xy=load(Sets(i)).(Sets(i));
[p,XY]=myenvelope(xy);
p = polybuffer(p,0.25);
figure
scatter(xy(:,1),xy(:,2),5);
hold on
plot(p,'FaceColor','none','EdgeColor','r', 'LineWidth',2)
hold off
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [p,XY]=myenvelope(xy)
% D2=max(pdist2(xy,xy,'euclidean','Smallest', 10),[],1);
% the one line engine above is replaced by the following 8 lines of code
x = xy(:,1);
y = xy(:,2);
D2 = sqrt((x-x').^2 + (y-y').^2);
D2(D2 <eps ) = NaN;% replace diag elements that are 0 by NaN's
% remove rows where we haven't n neigbours within distance of 1
D2 = sort(D2,'ascend');
D2 = D2(1:10,:); % keep the n smallest values
D2=max(D2,[],1);
xy(D2>1,:)=[];
XY = xy;
b=boundary(XY,0.95);
p=polyshape(XY(b,:));
end
See Also
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!