I get different regression coefficients, depending on how I do the binning. Which regression should I use?
Show older comments
I have some doubts about a linear regression model. To begin with, I have a table with 2 columns and a lot of rows. From this table, column 1 is represented on y axis and column 2 is represented on x axis (see bellow images). Using these data, I created an algorithm which computes the upper and lower values as you can see from the bellow images, and also draws the regression lines. The algorithm has a search range, for example, if I would like to get the lower values from 0.0 to 1.0 (x axis), then I will set this range from 0.0 to 1.0 with a given step of 0.05 in order to get the regression lines. As a result, this will search from 0.00 to 0.05, 0.05 to 0.10 ... 0.95 to 1.00 and draw the bottom line (see bellow images) . My problem is that I can't understand how to choose the right range. Bellow you can see images with different ranges which gives different linear regression equations. First image has a range of 0.0 to 1.0 with 0.05. as a step. Second image has a range of 0.0 to 0.95 with 0.05 and the third image has a range of 0.0 to 0.85 with 0.05 as a step. Should I pick the one with the biggest R-Squared and lowest RMSE?



7 Comments
the cyclist
on 10 Jan 2023
Edited: the cyclist
on 10 Jan 2023
I had a difficult time understanding your explanation, but I think I figured it out. I am going to write the algorithm here, for the benefit of others.
- Collect the data points, which are the blue circles
- Define bins of the data along the x-axis, based on equal-width spacing. (For example, data points with 0 < x < 0.05 is the first bin, 0.05 < x < 0.10 is the second bin, and so on.)
- For each bin, find the data point with the highest y-value. (These are the points labeled with red crosses.)
- Perform a regression on those red-cross data points.
- [Repeat for the lowest points, the yellow crosses.]
And it seems your question is, "I get different regression coefficients, depending on how I do the binning. Which regression should I use?"
Is all that correct? If so, my question back to you is, What are you actually trying to calculate or estimate?
Christos Tsallis
on 10 Jan 2023
the cyclist
on 10 Jan 2023
So, you are trying to somehow deal with outliers, and your goal of doing all of this is to make your overall regression more accurate?
I would not take your approach. First of all, it looks like you have thousands of data points. And even though they stand out visually, you really have very few that seem "misplaced", and even these are pretty close to the overall pattern of your data.
More importantly, one should never really remove "outliers" without a specific reason for believing that they do not belong in the sample (e.g. some kind of known anomaly with the data collection). If you remove points only because they are statistical outliers, then you are fudging your data.
In my opinion, you should just calculate your fit parameters on all the data. The confidence intervals will account for the fact that some of the data don't fit as well. Anything else you do probably makes your data less meaningful, not more.
J. Alex Lee
on 10 Jan 2023
after you bin up your data, can you look at a histogram for each bin? to cyclist's point, looking at the scatterplot with a filled blue middle doesnt' tell you much about how the dots are actually distributed if there's a LOT of data, you could be hiding the fact that you have a really denser "core", even if the "corona" is already dense.
but to support cyclist's final recommendation, not being an expert in stats, i was eventually led to understand that there are various regression meta information and confidence intervals are rigorously "enough" explanation.
as another avenue, some of the features in your image lead me to speculate that you might be showing many data sets on this domain of 0-1, each of which may have tighter distributions...?
Christos Tsallis
on 10 Jan 2023
the cyclist
on 10 Jan 2023
You never really answered the question of what you are trying to do. What is the purpose of doing this regression?
I am not sure why think you need to remove any data. Normally, the purpose of collecting data is to estimate an effect of some kind. Collect the appropriate data, do the regression (which results in an estimate of the parameter), and you are done.
I am very confused about why you are trying to regress the top and bottom. Why are you not just regressing all the data?
Christos Tsallis
on 10 Jan 2023
Edited: Christos Tsallis
on 10 Jan 2023
Answers (0)
Categories
Find more on MATLAB in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!