How to calculate R^2 based on external linear equation?
21 views (last 30 days)
I have a scatter and I calculated the slope and intercept of the best fitting line (using orthogonal regression).
So now I have scattered datapoints and an equation in the shape y=mx+c.
How can I calculate R^2 based on this new equation I found?
John D'Errico on 15 Mar 2022
Edited: John D'Errico on 15 Mar 2022
Unfortunately, R^2 is not a completely simple concept when working with an orthogonal regression. To compute a valid R^2, we need to understand what it means. So start there. (Sigh, this may get long. Sadly, people use R^2, in fact, rely on it for curve fitting, without really understanding it far too often.)
What does R^2 mean? In the context of fitting a line (or a curve) to data, where the noise is assume to be on only y, R^2 asks the question of how well does the model fit the data, when compared to a constant model? For example, you might have some simple data, with the noise on only y.
x = (-5:5)';
y = 2*x + randn(size(x));
[mdl,goodness] = fit(x,y,'poly1')
So we see that R^2 was a number close to 1. In fact, R^2 will always be a number between 0 and 1 as long as there is a constant term in your model (worth a long explanation even there.)
Why did I include a green line there at the mean of y? Because effectively, that is what R^2 compares your model to. Effectively, how well does the line in blue fit the data, when compared to a model where the function is just a constant?
Now we can compute R^2 for this problem ourselves. I hope we get the same result. :)
constantModel = mean(y)
totalSumOfSquaresForConstantModel = sum((y - constantModel).^2)
sumOfSquaresForLinearModel = sum((y - mdl(x)).^2)
Rsq = 1 - sumOfSquaresForLinearModel/totalSumOfSquaresForConstantModel
In fact, it just happens to be exactly the same value as we fit gave us. WHEW, did I get lucky, or what?
That data was actually pretty good, with a fairly high signal to noise ratio, so we expect R^2 to be close to 1. For data that is just completely random, with no signal at all, we will see R^2 values that are near zero. So people have come to effectively rely on R^2 as telling them something important. How good is their model? (If I'm not careful here, I might start to rant and rave about why R^2 is not as useful or important as they think. GRRRR. set('rantMode','off')). Ok. Better now.)
Now let me create some total least squares data as an example, where there is some noise in both variables. (Even that has issues in terms of how we would compute R^2. UGH.)
x = (-10:10)';
y = 2*x;
x = x + randn(size(x))*2;
y = y + randn(size(x))*2;
Compute the total least squares model. Some call it an errors in variables model. Some will call it an orthogonal regression. And some might call it a PCA model, because of how it might be computed. I'll use svd to do the work here.
A = [x - mean(x),y - mean(y)];
[~,~,V] = svd(A,0);
coef = V(:,end)
So coeff contains the coefficients in the model
a2*(y - mean(y)) + a1*mean(x - mean(x)) = 0
If we want to write the model in the form y = m*x+b, be careful, in case the coefficient on y was zero!!!!!!
m = -coef(1)/coef(2)
b = mean(x)*coef(1)/coef(2) + mean(y)
xp = linspace(min(x),max(x));
yp = m*xp + b;
Now, what does it mean to compare the orthogonal regression to a default constant model? That default constant model is just the point at (mean(x),mean(y)). I'll plot that point in green again.
Again, R^2 for an orthogonal regression should mean how well does the straight line model compare to no model at all? If the data were perfectly spherical noise, we might hope to see an R^2 near zero.
So here, I'll compute R^2 much as we did before.
totalSumOfSquaresForConstantModel = sum((x - mean(x)).^2) + sum((y - mean(y)).^2)
For the linear model, we need the sum of squares of orthogonal distances to the curve for each point. There are many ways we might do that of course. Being too lazy to derive it here, I'll just give you a reference for one way:
sumOfSquaresForlinearModel = sum((-y + m*x + b).^2/(1 + m^2))
OrthRegR2 = 1 - sumOfSquaresForlinearModel/totalSumOfSquaresForConstantModel
So a reasonable estimate of R^2 for this orthogonal regression model (based on the same intuitive idea of what R^2 means in the classical sense) is 0.9790.
Again, R^2 tells us how much better we fit the data using a linear model, when compared to essentially no fit at all, so compared to the total noise in the data itself as if the data were entirely noise. A perfect fit would have R^2 as 1. Complete garbage for data would expect to see R^2 near zero.
KSSV on 15 Mar 2022
Edited: KSSV on 15 Mar 2022
Substitute your x in th equation y = mx+c, get (x1,y1) and let (x0,y0) be your old existing values. Then you can use
regression or plotregression.
% Demo example:
% Demo line
m = rand; c = randi(10) ;
x = 1:10 ;
y = m*x+c ;
% Make scattered data along the line
x0 = x ;
y0 = y+2*rand(size(x0)) ;
% Fit a line
p = polyfit(x0,y0,1) ;
x1 = x0 ;
y1 = polyval(p,x1) ;
R = regression(y0,y1)