find the best linear regression model using stepwiselm
5 views (last 30 days)
sp I have a table with the response variable bein 'VNAF'. I have 9 other predictors and I'm trying to use stepwiselm to find the best linear regression model according to highest Rsquared. I want to add all the possible interaction terms and also quadratic terms, This is my code:
fileName = 'Aim1Data_CMC5_noRRA_individuals.xlsx';
T = readtable(fileName,'ReadRowNames',true);
mdl = stepwiselm(T,'quadratic','ResponseVar','VNAF','Criterion','rsquared')
This gives me an Rsquared of 0.705.
while the simple linea regression mdoel gives me a R2 of 0.67.
My first question is that if I'm using stepwiselm right? meaning that my code includes all the interaction terms and quadratic terms for all the predictors?
my second question is that how I can improve my Rsquared? Does step function help? if so, what should I specify in it?
Thank you all
the cyclist on 10 Jun 2022
First, I feel obligated to mention that stepwise regression is often criticized as a technique. This wikipedia article discusses it.
Second: Yes, I think you are calling the function correctly, to get the terms you want to try.
Third: You might want to get a better understanding of the 'Criterion' you are using. If your goal is for this model to generalize beyond your dataset, I think something like AIC might be a better choice. (I am also a little confused why the function did not just return the full model with all possible terms, since that model gives the max R^2. But that is probably just me not researching the documentation fully.)
Last: Empirically, a model with R^2=0.67 and one with R^2=0.71 are so close to the same (in terms of explained variation) that I would consider them to be basically identical. I would not choose a model based on that difference.
Since this is a physical system, I feel like there should be some rationale about why each term might come in a particular power, and that could help drive the decision.
If you are truly going for only predictive power, you might want to use a machine learning model instead ... although with only 181 data points, I'm not entirely sure that that is the best way to go.