How to perform linear curve fitting based on distance from line instead of residuals

13 views (last 30 days)
I already know how to perform a classic 1st order curve fir using the polyval function. However, I find that the polyval curve fit does accurately follow my current data because it has a high slope and a round trend. Based on the sum of square errors approach used by polyval, it seems to be over-weighing certain points. Is there a way to do a 1st order curve fit that is guided by the total distance of the points from the line, instead of based on residuals?

Accepted Answer

John D'Errico
John D'Errico on 5 Jul 2022
Edited: John D'Errico on 5 Jul 2022
This is the classic total least square problem. Sometimes it is called the errors in variables problem, sometimes known as orthogonal regression.
Lacking your data, I can't show how to do it using your data. But it is not difficult to do. Assume you have data in the form of vectors x and y. I'll make some up, here:
t = rand(100,1);
x = t/10 + randn(size(t))/10;
y = 5*t + randn(size(t))/10;
plot(x,y,'o')
axis equal
The actual slope of the curve should have been 50. What happens when we use polyfit?
P1 = polyfit(x,y,1)
P1 = 1×2
2.9081 2.3460
The problem is, polyfit always produces biased estimates in these problems, where the slope will be biased towards zero. Do you see the polyfit extimated slope is very small? Instead of 50, we got a number around 2.9081 from polyfit.
The trick is to use total least squares. We can get the coefficients of the line from:
mux = mean(x);
muy = mean(y);
[~,~,V] = svd([x(:)-mux,y(:)-muy],0);
coef = V(:,end)
coef = 2×1
-0.9998 0.0176
AEst = -coef(1)/coef(2)
AEst = 56.7846
So despite the rather high presence of noise on our data, we get a slope estimate 56.7846, instead of the 2.9 value that polyfit predicted. That seems pretty reasonable.
The constant term in a regression model will be:
BEst = muy - AEst*mux
BEst = 0.0527
The line of best orthogonal fit is:
y = 0.0527 + 56.7846*x
  1 Comment
Damon Aboud
Damon Aboud on 18 Jul 2022
Wow, thank you very much for your thorough help on this task!! This is super helpful, and it totally solves my problem.

Sign in to comment.

More Answers (5)

Sulaymon Eshkabilov
Sulaymon Eshkabilov on 5 Jul 2022
You can try a scaling factor here. Your data seems to be within [355 365]; thus, take out 355 from x and then perform the linear fit.
You can also try fitlm() that is very similar to polyfit().

Torsten
Torsten on 5 Jul 2022
Edited: Torsten on 5 Jul 2022
The residuals are the sum of the vertical distances of the points from the line.
If you want to consider the orthogonal distances of the points from the line, you can proceed as follows:
Calculate the distance d_i of a point P = (x_i,y_i) from a line y = a*x+b. If you don't know how to do this: google is your friend.
Once you have this distance, you can formulate the optimization problem as
min_{a,b} sum_i d_i

Torsten
Torsten on 6 Jul 2022
Edited: Torsten on 6 Jul 2022
rng("default")
t = rand(100,1);
x = t/10 + randn(size(t))/10;
y = 5*t + randn(size(t))/10;
plot(x,y,'o')
sol = fminsearch(@(p)fun(p,x,y),[1 1])
sol = 1×2
54.4774 0.2832
fun(sol,x,y)
ans = 1.0309
hold on
plot(x,sol(1)*x+sol(2))
function res = fun(p,x,y)
a = p(1);
b = p(2);
for i = 1:numel(x)
xproj = (x(i)+a*(y(i)-b))/(1+a^2);
yproj = a*xproj+b;
dsquared(i) = (xproj-x(i))^2+(yproj-y(i))^2;
end
res = sum(dsquared);
end
  1 Comment
Torsten
Torsten on 6 Jul 2022
Edited: Torsten on 6 Jul 2022
rng("default")
t = rand(100,1);
x = t/10 + randn(size(t))/10;
y = 5*t + randn(size(t))/10;
mux = mean(x);
muy = mean(y);
[~,~,V] = svd([x(:)-mux,y(:)-muy],0);
coef = V(:,end)
coef = 2×1
-0.9998 0.0184
AEst = -coef(1)/coef(2)
AEst = 54.4774
BEst = muy - AEst*mux
BEst = 0.2832

Sign in to comment.


Matt J
Matt J on 18 Jul 2022

Image Analyst
Image Analyst on 18 Jul 2022
In cases like this, with mostly vertical data, I just swap x and y and it seems to provide a good fit. Of course the residuals are horizontal rather than vertical but the fitted line seems reasonable. Of course you would only do this in certain situations where it physically makes sense, like fitting a line to the side of some blob in a binary image. You wouldn't do it for something like time series measurements. Of course you "orthogonal" residuals also would not have any theoretical reason for being right in that case either.

Categories

Find more on Statistics and Machine Learning Toolbox in Help Center and File Exchange

Products


Release

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!