I get different regression coefficients, depending on how I do the binning. Which regression should I use?

I have some doubts about a linear regression model. To begin with, I have a table with 2 columns and a lot of rows. From this table, column 1 is represented on y axis and column 2 is represented on x axis (see bellow images). Using these data, I created an algorithm which computes the upper and lower values as you can see from the bellow images, and also draws the regression lines. The algorithm has a search range, for example, if I would like to get the lower values from 0.0 to 1.0 (x axis), then I will set this range from 0.0 to 1.0 with a given step of 0.05 in order to get the regression lines. As a result, this will search from 0.00 to 0.05, 0.05 to 0.10 ... 0.95 to 1.00 and draw the bottom line (see bellow images) . My problem is that I can't understand how to choose the right range. Bellow you can see images with different ranges which gives different linear regression equations. First image has a range of 0.0 to 1.0 with 0.05. as a step. Second image has a range of 0.0 to 0.95 with 0.05 and the third image has a range of 0.0 to 0.85 with 0.05 as a step. Should I pick the one with the biggest R-Squared and lowest RMSE?

7 Comments

I had a difficult time understanding your explanation, but I think I figured it out. I am going to write the algorithm here, for the benefit of others.
  • Collect the data points, which are the blue circles
  • Define bins of the data along the x-axis, based on equal-width spacing. (For example, data points with 0 < x < 0.05 is the first bin, 0.05 < x < 0.10 is the second bin, and so on.)
  • For each bin, find the data point with the highest y-value. (These are the points labeled with red crosses.)
  • Perform a regression on those red-cross data points.
  • [Repeat for the lowest points, the yellow crosses.]
And it seems your question is, "I get different regression coefficients, depending on how I do the binning. Which regression should I use?"
Is all that correct? If so, my question back to you is, What are you actually trying to calculate or estimate?
Yes everything is correct, sorry for my poor explanation. I think that there are outliers that's why I reduced the range and asked the question. I think that, values on the bottom left corner (you cans see that there are not that much pixels there) create a noise to my linear regression line. Do you think that I should exclude values bellow them?
So, you are trying to somehow deal with outliers, and your goal of doing all of this is to make your overall regression more accurate?
I would not take your approach. First of all, it looks like you have thousands of data points. And even though they stand out visually, you really have very few that seem "misplaced", and even these are pretty close to the overall pattern of your data.
More importantly, one should never really remove "outliers" without a specific reason for believing that they do not belong in the sample (e.g. some kind of known anomaly with the data collection). If you remove points only because they are statistical outliers, then you are fudging your data.
In my opinion, you should just calculate your fit parameters on all the data. The confidence intervals will account for the fact that some of the data don't fit as well. Anything else you do probably makes your data less meaningful, not more.
after you bin up your data, can you look at a histogram for each bin? to cyclist's point, looking at the scatterplot with a filled blue middle doesnt' tell you much about how the dots are actually distributed if there's a LOT of data, you could be hiding the fact that you have a really denser "core", even if the "corona" is already dense.
but to support cyclist's final recommendation, not being an expert in stats, i was eventually led to understand that there are various regression meta information and confidence intervals are rigorously "enough" explanation.
as another avenue, some of the features in your image lead me to speculate that you might be showing many data sets on this domain of 0-1, each of which may have tighter distributions...?
I see, I have one more question about the last thing you mentioned. "The confidence intervals will account for the fact that some of the data don't fit as well. Anything else you do probably makes your data less meaningful, not more.". So, anything outside of confidence intervals can be removed? Bellow there is an image which shows some ouside of confidence values.
You never really answered the question of what you are trying to do. What is the purpose of doing this regression?
I am not sure why think you need to remove any data. Normally, the purpose of collecting data is to estimate an effect of some kind. Collect the appropriate data, do the regression (which results in an estimate of the parameter), and you are done.
I am very confused about why you are trying to regress the top and bottom. Why are you not just regressing all the data?
You are right. First of all, axes are labeled as follows. Y axis has Land Surface Temperature (LST) values in Kelvin, and X axis has values of a Normalized Difference Vegetation Index (NDVI). NDVI values range from +1.0 to -1.0. Areas of barren rock, sand, or snow usually show very low NDVI values (for example, 0.1 or less). Sparse vegetation such as shrubs and grasslands or senescing crops may result in moderate NDVI values (approximately 0.2 to 0.5). I am attaching an article bellow, which is the cause of linear regression that I am using. You see, there is a way to find soil moisture using Land Surface Temperature (LST) in Kelvin and NDVI values. The first thing you have to do is to extract NDVI and LST values from a multispectral image. After that, you merge LST and NDVI values to a table and it will have 2 columns (1st column: LST, 2nd column: NDVI), and a lot of rows (rows represents the number of pixels). In order to find soil moisture there are 2 edges that must be computed. The dry edge (upper) and the wet edge (bottom) (see all the above images). I adopted a simple algorithm to search inside a range of x axis, and find the maximum LST values for the dry edge and the minimum LST values for the wet edge. As a result, values which are located at the top and at the bottom represents those edges. Now, linear regression must be applied to upper and bottom edge in order to get the equations of each edge. Upper edge's equation is LSTmax=Ax+b, and bottom edge's equation is LSTmin=Ax+b. To compute soil moisture we apply the following formula SMI=(LSTmax-LST)/(LSTmax-LSTmin), LST is a given matrix with land surface temperature values in Kelvin. My doubt exists because if I apply linear regression to a specific range like 0 to 1 with 0.05 as a step then soil moisture index has some values above 1 and bellow 0 (this is not permitted as SMI range should be 0 to 1). If I reapply linear regression again with a different range like, 0.1 to 0.8 with 0.05 as a step there is a possibility that values above 1 and bellow 0 will be less (or more) than the previous try. That's why I asked. I am trying to find the best way to have values inside the valid range.

Sign in to comment.

Answers (0)

Categories

Find more on MATLAB in Help Center and File Exchange

Asked:

on 10 Jan 2023

Edited:

on 10 Jan 2023

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!