stepwiselm does not respect 'Upper' 'linear' limit during multiple iterations
Show older comments
Hi
I am trying to run a regression analysis for a public company by trying to figure out which variables are important to determine the overall sales (using stepwiselm for this), but I don't want interaction terms. To test time lags on the different factors, I take the original raw data and then run multiple calls to stepwiselm with various time lags for each of the factors (the data table is generated in another function and result is stored in variable tab). My ultimate goal is to find the regression equation with the highest adjusted R2.
What I noticed is that when stepwiselm is called multiple times (for example in excess of 400 runs) in succession, it ends up bringing in interaction terms in the final regression equation. This is the call to stepwiselm in my for loop. (Note that I get the same result whether I use a "for" or a "parfor" loop for exection.)
mdl=stepwiselm(tab,'Upper','linear','Verbose',0)
I compile the results from each stepwiselm iteration in a cell array called models (first column holds the model, the third column has the equation). This is one of the results which contains the interaction terms:
>> models{445,1}
ans =
Linear regression model:
HomeSales_4 ~ [Linear formula with 6 terms in 3 predictors]
Estimated Coefficients:
Estimate SE tStat pValue
__________ __________ _______ _________
(Intercept) 480.89 135.42 3.5511 0.0007463
HousingStartsTotal_4 -0.0035197 0.0013978 -2.5179 0.014445
NewHomeOrders_4 -0.70127 0.32902 -2.1314 0.037094
HomeBacklog -0.1529 0.083674 -1.8273 0.07255
HousingStartsTotal_4:NewHomeOrders_4 8.9666e-06 3.2585e-06 2.7518 0.0077939
HousingStartsTotal_4:HomeBacklog 2.0849e-06 8.3874e-07 2.4858 0.015682
Number of observations: 67, Error degrees of freedom: 61
Root Mean Squared Error: 41.8
R-squared: 0.652, Adjusted R-Squared: 0.624
F-statistic vs. constant model: 22.9, p-value = 7.53e-13
>> models{445,3}
ans =
"HomeSales_4 ~ 1 + HousingStartsTotal_4*NewHomeOrders_4 + HousingStartsTotal_4*HomeBacklog"
However, if I run the same regression manually (i.e. just one iteration with the same input X and y), stepwiselm does not generate the interaction terms.
>> mdl=stepwiselm(tab2,'Upper','linear','Verbose',0)
mdl =
Linear regression model:
HomeSales_4 ~ 1 + HousingStartsTotal_4 + NewHomeOrders_4 + HomeBacklog
Estimated Coefficients:
Estimate SE tStat pValue
_________ __________ _______ __________
(Intercept) -82.309 43.518 -1.8914 0.06317
HousingStartsTotal_4 0.0023333 0.00040806 5.718 3.1831e-07
NewHomeOrders_4 0.16748 0.061989 2.7018 0.0088514
HomeBacklog 0.048105 0.014275 3.3698 0.0012884
Number of observations: 67, Error degrees of freedom: 63
Root Mean Squared Error: 47.1
R-squared: 0.544, Adjusted R-Squared: 0.522
F-statistic vs. constant model: 25, p-value = 8.93e-11
I am at a loss as to what is going on. I tried manually defining the equation in the Wilkinson format (instead of using 'Upper', 'linear'), but I still end up with the same results. I appreciate any inputs you may have in to the matter.
Thanks!
7 Comments
the cyclist
on 17 Oct 2021
That does seem strange!
How reliably reproducible is the behavior? Would you be able to upload a small dataset and code that allows us to reproduce the issue?
James Craig
on 19 Oct 2021
the cyclist
on 20 Oct 2021
I was able to reproduce this on my machine. I used the debugger to step into the MATLAB routines, and there is definitely something odd going on. I can't say that I've reached a complete understanding, but it looks to me as if the code is ignoring the input formula completely, and instead allowing the "upper" terms to include interactions, just the same as if you had simply called it as
stepwiselm(tab)
This definitely seems like a bug to me. I don't see any reported bugs in stepwiselm in R2021b on the bug report page. I would recommend you submit a bug report. (You can do it from that same page.) If you do, please keep us updated here!
James Craig
on 22 Oct 2021
I can't see how this is a bug, and your examples are totally consistent with stepwiselm doc. For the first two commands in your comment above, you still need to set the upper to linear, because by default stepwiselm uses 'interactions':
stepwiselm(tab,'HomeSales_4 ~ -1 + Inventory_1 + HousingStartsTotal_4 + NewHomeOrders_4 + HomeBacklog', 'Upper', 'linear') % intercept -1?
stepwiselm(tab,'HomeSales_4 ~ 1 + Inventory_1 + HousingStartsTotal_4 + NewHomeOrders_4 + HomeBacklog', 'Upper', 'linear')
Note that modelspec is only used as the starting model, and it doesn't mean the algorithm is bound to use only linear terms because the starting model contains linear terms!
the cyclist
on 24 Oct 2021
Ahhh, of course.
Sorry for the misinformation!
James Craig
on 26 Oct 2021
Answers (0)
Categories
Find more on Model Building and Assessment in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!