removing outliers

Question

1 vote

Hi,

I have data which is by event for n number of companies (not time series data). Visually, I can see that there are outliers but I don't know which method to use to remove these outliers using matlab. Any help is appreciated

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Richard Willey on 1 Apr 2011

Open in MATLAB Online

4 votes

Automatically detecting outliers is tricky stuff.

You normally need fairly precise information regarding your data as well as the model that you are fitting to your data.

Here's a relatively simple technique that will work for many types of linear models. The methodology is based on a statistics called "Cook's Distance" that you can extract from regstats.

Cook's Distance for a given data point measures the extent to which a regression model would change if this data point were excluded from the regression. Cook's Distance is sometimes used to suggest whether a given data point might be an outlier.

Here's a simple example illustrating how this works

% Create a vector of X values
X = 1:100;
X = X';
% Create a noise vector
noise = randn(100,1);
% Create a second noise value where sigma is much larger
noise2 = 10*randn(100,1);
% Substitute noise2 for noise1 at obs# (11, 31, 51, 71, 91)
% Many of these points will have an undue influence on the model 
noise(11:20:91) = noise2(11:20:91);
% Specify Y = F(X)
Y = 3*X + 2 + noise;
% Cook's Distance for a given data point measures the extent to 
% which a regression model would change if this data point 
% were excluded from the regression. Cook's Distance is 
% sometimes used to suggest whether a given data point might be an outlier.
% Use regstats to calculate Cook's Distance
stats = regstats(Y,X,'linear');
% if Cook's Distance > n/4 is a typical treshold that is used to suggest
% the presence of an outlier
potential_outlier = stats.cookd > 4/length(X);
% Display the index of potential outliers and graph the results
X(potential_outlier)
scatter(X,Y, 'b.')
hold on
scatter(X(potential_outlier),Y(potential_outlier), 'r.')

2 Comments
Show None Hide None

Mark Shore on 1 Apr 2011

Looks interesting but unfortunately requires the Statistics Toolbox.

Anirudh Thatipelli on 24 May 2018

Thanks for referring to Cook's distance @Richard Wiley. It has been a great help for me in removing outliers.

Sign in to comment.

Answer 2

Matt Fig on 26 Mar 2011

Open in MATLAB Online

0 votes

What form is the data? You might be able to use logical indexing. For example:

% x is some data with outliers 99 and -70.  We want only 0<x<10.
x = [2 3 2 3 1 4 2 3 4 99 2 3 2 -70];
x = x(x<10);  % Take those values less than 10
x = x(x>0);  % Take those values greater than zero.

.

You could also do this in one shot, as below.

% x is some data with outliers 99 and -70.  We want only 0<x<10.
x = [2 3 2 3 1 4 2 3 4 99 2 3 2 -70];
x = x(x<10 & x>0)

1 Comment
Show -1 older comments Hide -1 older comments

joseph Frank on 27 Mar 2011

the data is in % stock returns. it will be difficult to set a subjective cut off point. I am wondering if there is another way t determine what is outlier and what is not

Sign in to comment.

Answer 3

Walter Roberson on 27 Mar 2011

0 votes

"outlier" is mathematically a matter of interpretation.

What is the outlier in this data?

1 2 3 1 2 3

Answer: 2, because the underlying process is believed to create 2 only 1 time in 1000 compared to 1 or 3, so for 2 to show up twice is unusual for this data.

But if you only had the data, how would you know that?

Thus, in order for a program to determine what is an "outlier" or not, you need to encode a model about what is "typical" data and what is not.

5 Comments
Show 3 older comments Hide 3 older comments

joseph Frank on 1 Apr 2011

I have used 3 standard deviations away from the mean to remove outliers and I still have some.I have no clue how to compute the 1st derivative. If you have any instructions I will follow them to compute the 1st derivative

Walter Roberson on 1 Apr 2011

Sometimes it is more effective to compute deviations with a "leave one out" method: if this point was not already part of the dataset, how many deviations away from the mean would it be of the (smaller) dataset?

Three standard deviations is 99.7%; possibly for your purposes, a looser test such as 2.5 standard deviations is warranted.

Sign in to comment.

removing outliers

0 Comments
Show -2 older comments Hide -2 older comments

Answers (3)

2 Comments
Show None Hide None

1 Comment
Show -1 older comments Hide -1 older comments

5 Comments
Show 3 older comments Hide 3 older comments

Categories

Tags

Community Treasure Hunt

removing outliers

0 Comments Show -2 older comments Hide -2 older comments

Answers (3)

2 Comments Show None Hide None

1 Comment Show -1 older comments Hide -1 older comments

5 Comments Show 3 older comments Hide 3 older comments

Categories

Tags

See Also

Community Treasure Hunt

0 Comments
Show -2 older comments Hide -2 older comments

2 Comments
Show None Hide None

1 Comment
Show -1 older comments Hide -1 older comments

5 Comments
Show 3 older comments Hide 3 older comments