How can I detect and remove outliers from a large dataset?
9 views (last 30 days)
Show older comments
I am presently trying to process a large dataset (n = 5000000) and I am really facing challenges writing codes that could detect and remove all the outliers present in the dataset. I tried using modified thomson tau method but it didnt work and I am presently trying to apply modified z- score method but still cant make a head go with the matlab codes.
Attached is the plot of the signal with peaks and dips for better understanding. I also want to fill the deleted outlier points with an interpolation and would appreciate a suggestion.
Please, I will appreciate any further assistance on how to get rid of the peaks and dips on the signal and how to fill the removed outliers points with an interpolation.
I will appreciate any suggestions on other methods to use to remove the outliers and if possible codes for the method.
thank you.
2 Comments
Star Strider
on 12 Mar 2014
Do you have any trends in your data that you could model, perhaps with nlinfit or other regression routines? I have no idea what you are doing or what your data are, but detecting trends and other patterns first could make your task easier.
Answers (4)
Shahab B
on 30 Sep 2016
How can I used it for simple data such as: main=[0 347.666506871168 97.948966303887 98.8584847142621 96.4002074686564];
note that the outlier data is = 347.666506871168 .
4 Comments
Image Analyst
on 21 Nov 2016
There are several definitions of MAD. My code above does definition 1.2.1 as listed on this page https://en.wikipedia.org/wiki/Average_absolute_deviation which gives 4 definitions using all combinations of mean and median. You're welcome to use whichever of those definitions best meets your needs.
Nivodi
on 14 Aug 2018
Image Analyst, how can I apply this part of your code to several columns?
% Compute the median absolute difference
meanValue = mean(vector)
% Compute the absolute differences. It will be a vector.
absoluteDeviation = abs(vector - meanValue)
% Compute the median of the absolute differences
mad = median(absoluteDeviation)
% Find outliers. They're outliers if the absolute difference
% is more than some factor times the mad value.
sensitivityFactor = 6 % Whatever you want.
thresholdValue = sensitivityFactor * mad;
outlierIndexes = abs(absoluteDeviation) > thresholdValue
% Extract outlier values:
outliers = vector(outlierIndexes)
% Extract non-outlier values:
nonOutliers = vector(~outlierIndexes)t% Compute the median absolute difference
meanValue = mean(vector)
% Compute the absolute differences. It will be a vector.
absoluteDeviation = abs(vector - meanValue)
% Compute the median of the absolute differences
mad = median(absoluteDeviation)
% Find outliers. They're outliers if the absolute difference
% is more than some factor times the mad value.
sensitivityFactor = 6 % Whatever you want.
thresholdValue = sensitivityFactor * mad;
outlierIndexes = abs(absoluteDeviation) > thresholdValue
% Extract outlier values:
outliers = vector(outlierIndexes)
% Extract non-outlier values:
nonOutliers = vector(~outlierIndexes)
Image Analyst
on 12 Mar 2014
That's not large. It's just a fraction of the size of a typical digital image. You can use "deleteoutliers" from Brett Shoelson of the Mathworks: http://www.mathworks.com/matlabcentral/fileexchange/3961-deleteoutliers Or you could try the Median Absolute Deviation (a popular statistical method for detecting outliers) as demonstrated on an image in the file I attached.
7 Comments
Tim leonard
on 12 Mar 2014
Trimming your values based on percentiles is quick and powerful -
vector = randi(100,100,1);
percntiles = prctile(vector,[5 95]); %5th and 95th percentile
outlierIndex = vector < percntiles(1) | vector > percntiles(2);
%remove outlier values
vector(outlierIndex) = [];
1 Comment
Image Analyst
on 12 Mar 2014
But something at the 1% or 99% or 100% percentile is not necessarily an outlier so you could be getting rid of good data. It's quick but I wouldn't call it powerful. I'd call it risky, unless you know for a fact that you have a certain specific amount of noise present.
Amir H. Souri
on 26 Jun 2017
Hi, I may be late, but I just want to point out that definition of outlier is totally subjective. In order to find them, you need to estimate the probably distribution of your data, and fit a distribution (say for example Gaussian), and check whether it is statistically significant (you may use Kolmogorov–Smirnov test or a bootstrap method). Then, you will be able to identify the outliers by defining the confidence interval. For example, you can say any data within 95% confidence interval are acceptable and others can be ignored as outliers. As I mentioned there is no absolute answer, and it totally depends on the nature of data and how strict you want to be in regards to the confidence interval.
Good luck!
0 Comments
See Also
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!