stepwiselm does not respect 'Upper' 'linear' limit during multiple iterations

Hi
I am trying to run a regression analysis for a public company by trying to figure out which variables are important to determine the overall sales (using stepwiselm for this), but I don't want interaction terms. To test time lags on the different factors, I take the original raw data and then run multiple calls to stepwiselm with various time lags for each of the factors (the data table is generated in another function and result is stored in variable tab). My ultimate goal is to find the regression equation with the highest adjusted R2.
What I noticed is that when stepwiselm is called multiple times (for example in excess of 400 runs) in succession, it ends up bringing in interaction terms in the final regression equation. This is the call to stepwiselm in my for loop. (Note that I get the same result whether I use a "for" or a "parfor" loop for exection.)
mdl=stepwiselm(tab,'Upper','linear','Verbose',0)
I compile the results from each stepwiselm iteration in a cell array called models (first column holds the model, the third column has the equation). This is one of the results which contains the interaction terms:
>> models{445,1}
ans =
Linear regression model:
HomeSales_4 ~ [Linear formula with 6 terms in 3 predictors]
Estimated Coefficients:
Estimate SE tStat pValue
__________ __________ _______ _________
(Intercept) 480.89 135.42 3.5511 0.0007463
HousingStartsTotal_4 -0.0035197 0.0013978 -2.5179 0.014445
NewHomeOrders_4 -0.70127 0.32902 -2.1314 0.037094
HomeBacklog -0.1529 0.083674 -1.8273 0.07255
HousingStartsTotal_4:NewHomeOrders_4 8.9666e-06 3.2585e-06 2.7518 0.0077939
HousingStartsTotal_4:HomeBacklog 2.0849e-06 8.3874e-07 2.4858 0.015682
Number of observations: 67, Error degrees of freedom: 61
Root Mean Squared Error: 41.8
R-squared: 0.652, Adjusted R-Squared: 0.624
F-statistic vs. constant model: 22.9, p-value = 7.53e-13
>> models{445,3}
ans =
"HomeSales_4 ~ 1 + HousingStartsTotal_4*NewHomeOrders_4 + HousingStartsTotal_4*HomeBacklog"
However, if I run the same regression manually (i.e. just one iteration with the same input X and y), stepwiselm does not generate the interaction terms.
>> mdl=stepwiselm(tab2,'Upper','linear','Verbose',0)
mdl =
Linear regression model:
HomeSales_4 ~ 1 + HousingStartsTotal_4 + NewHomeOrders_4 + HomeBacklog
Estimated Coefficients:
Estimate SE tStat pValue
_________ __________ _______ __________
(Intercept) -82.309 43.518 -1.8914 0.06317
HousingStartsTotal_4 0.0023333 0.00040806 5.718 3.1831e-07
NewHomeOrders_4 0.16748 0.061989 2.7018 0.0088514
HomeBacklog 0.048105 0.014275 3.3698 0.0012884
Number of observations: 67, Error degrees of freedom: 63
Root Mean Squared Error: 47.1
R-squared: 0.544, Adjusted R-Squared: 0.522
F-statistic vs. constant model: 25, p-value = 8.93e-11
I am at a loss as to what is going on. I tried manually defining the equation in the Wilkinson format (instead of using 'Upper', 'linear'), but I still end up with the same results. I appreciate any inputs you may have in to the matter.
Thanks!

7 Comments

That does seem strange!
How reliably reproducible is the behavior? Would you be able to upload a small dataset and code that allows us to reproduce the issue?
Hi - I tried to dig in to the problem more and it appears that the issue is related to the Wilkinson notation and not the Upper keyword. The keyword actually works fine, but for some reason the Wilkinson notation isn't being picked up properly. Or (more likely) I am doing something very dumb and I am noticing it.
I have attached the specific case where it happens reliably. The matlab file contains a table for the regression. I've attached my output running stepwiselm in the following order:
  1. stepwiselm(tab,'HomeSales_4 ~ -1 + Inventory_1 + HousingStartsTotal_4 + NewHomeOrders_4 + HomeBacklog')
  2. stepwiselm(tab,'HomeSales_4 ~ 1 + Inventory_1 + HousingStartsTotal_4 + NewHomeOrders_4 + HomeBacklog')
  3. stepwiselm(tab,'Upper','linear','Intercept',false)
  4. stepwiselm(tab,'Upper','linear')
It fails reliably in the first two cases, but works properly in the last two cases. I have produced the output below.
>> stepwiselm(tab,'HomeSales_4 ~ -1 + Inventory_1 + HousingStartsTotal_4 + NewHomeOrders_4 + HomeBacklog')
1. Adding (Intercept), FStat = 4.121, pValue = 0.046651
2. Adding HousingStartsTotal_4:NewHomeOrders_4, FStat = 10.9375, pValue = 0.00158304
3. Adding HousingStartsTotal_4:HomeBacklog, FStat = 6.0657, pValue = 0.016671
4. Removing Inventory_1, FStat = 0.070707, pValue = 0.79122
ans =
Linear regression model:
HomeSales_4 ~ 1 + HousingStartsTotal_4*NewHomeOrders_4 + HousingStartsTotal_4*HomeBacklog
Estimated Coefficients:
Estimate SE tStat pValue
__________ __________ _______ _________
(Intercept) 480.89 135.42 3.5511 0.0007463
HousingStartsTotal_4 -0.0035197 0.0013978 -2.5179 0.014445
NewHomeOrders_4 -0.70127 0.32902 -2.1314 0.037094
HomeBacklog -0.1529 0.083674 -1.8273 0.07255
HousingStartsTotal_4:NewHomeOrders_4 8.9666e-06 3.2585e-06 2.7518 0.0077939
HousingStartsTotal_4:HomeBacklog 2.0849e-06 8.3874e-07 2.4858 0.015682
Number of observations: 67, Error degrees of freedom: 61
Root Mean Squared Error: 41.8
R-squared: 0.652, Adjusted R-Squared: 0.624
F-statistic vs. constant model: 22.9, p-value = 7.53e-13
>> stepwiselm(tab,'HomeSales_4 ~ 1 + Inventory_1 + HousingStartsTotal_4 + NewHomeOrders_4 + HomeBacklog')
1. Adding HousingStartsTotal_4:NewHomeOrders_4, FStat = 10.9375, pValue = 0.00158304
2. Adding HousingStartsTotal_4:HomeBacklog, FStat = 6.0657, pValue = 0.016671
3. Removing Inventory_1, FStat = 0.070707, pValue = 0.79122
ans =
Linear regression model:
HomeSales_4 ~ 1 + HousingStartsTotal_4*NewHomeOrders_4 + HousingStartsTotal_4*HomeBacklog
Estimated Coefficients:
Estimate SE tStat pValue
__________ __________ _______ _________
(Intercept) 480.89 135.42 3.5511 0.0007463
HousingStartsTotal_4 -0.0035197 0.0013978 -2.5179 0.014445
NewHomeOrders_4 -0.70127 0.32902 -2.1314 0.037094
HomeBacklog -0.1529 0.083674 -1.8273 0.07255
HousingStartsTotal_4:NewHomeOrders_4 8.9666e-06 3.2585e-06 2.7518 0.0077939
HousingStartsTotal_4:HomeBacklog 2.0849e-06 8.3874e-07 2.4858 0.015682
Number of observations: 67, Error degrees of freedom: 61
Root Mean Squared Error: 41.8
R-squared: 0.652, Adjusted R-Squared: 0.624
F-statistic vs. constant model: 22.9, p-value = 7.53e-13
>> stepwiselm(tab,'Upper','linear','Intercept',false)
1. Adding HousingStartsTotal_4, FStat = 1660.0886, pValue = 1.6618961e-48
2. Adding HomeBacklog, FStat = 11.4564, pValue = 0.001212
3. Adding NewHomeOrders_4, FStat = 7.3516, pValue = 0.008593
ans =
Linear regression model:
HomeSales_4 ~ HousingStartsTotal_4 + NewHomeOrders_4 + HomeBacklog
Estimated Coefficients:
Estimate SE tStat pValue
_________ __________ ______ __________
HousingStartsTotal_4 0.0016962 0.00023496 7.2192 7.6631e-10
NewHomeOrders_4 0.17134 0.063191 2.7114 0.008593
HomeBacklog 0.032623 0.011929 2.7348 0.0080647
Number of observations: 67, Error degrees of freedom: 64
Root Mean Squared Error: 48.1
>> stepwiselm(tab,'Upper','linear')
1. Adding HousingStartsTotal_4, FStat = 38.0391, pValue = 4.97561e-08
2. Adding HomeBacklog, FStat = 15.2639, pValue = 0.000228008
3. Adding NewHomeOrders_4, FStat = 7.2995, pValue = 0.0088514
ans =
Linear regression model:
HomeSales_4 ~ 1 + HousingStartsTotal_4 + NewHomeOrders_4 + HomeBacklog
Estimated Coefficients:
Estimate SE tStat pValue
_________ __________ _______ __________
(Intercept) -82.309 43.518 -1.8914 0.06317
HousingStartsTotal_4 0.0023333 0.00040806 5.718 3.1831e-07
NewHomeOrders_4 0.16748 0.061989 2.7018 0.0088514
HomeBacklog 0.048105 0.014275 3.3698 0.0012884
Number of observations: 67, Error degrees of freedom: 63
Root Mean Squared Error: 47.1
R-squared: 0.544, Adjusted R-Squared: 0.522
F-statistic vs. constant model: 25, p-value = 8.93e-11
>>
I was able to reproduce this on my machine. I used the debugger to step into the MATLAB routines, and there is definitely something odd going on. I can't say that I've reached a complete understanding, but it looks to me as if the code is ignoring the input formula completely, and instead allowing the "upper" terms to include interactions, just the same as if you had simply called it as
stepwiselm(tab)
This definitely seems like a bug to me. I don't see any reported bugs in stepwiselm in R2021b on the bug report page. I would recommend you submit a bug report. (You can do it from that same page.) If you do, please keep us updated here!
Okay will do. Thanks for looking in to it so far! I really appreciate it.
I can't see how this is a bug, and your examples are totally consistent with stepwiselm doc. For the first two commands in your comment above, you still need to set the upper to linear, because by default stepwiselm uses 'interactions':
stepwiselm(tab,'HomeSales_4 ~ -1 + Inventory_1 + HousingStartsTotal_4 + NewHomeOrders_4 + HomeBacklog', 'Upper', 'linear') % intercept -1?
stepwiselm(tab,'HomeSales_4 ~ 1 + Inventory_1 + HousingStartsTotal_4 + NewHomeOrders_4 + HomeBacklog', 'Upper', 'linear')
Note that modelspec is only used as the starting model, and it doesn't mean the algorithm is bound to use only linear terms because the starting model contains linear terms!
Ahhh, of course.
Sorry for the misinformation!
Ah I see. That is my error then. Thank you very much for your clarification and help!

Sign in to comment.

Answers (0)

Products

Release

R2021a

Tags

Asked:

on 17 Oct 2021

Commented:

on 26 Oct 2021

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!