Main Content

This example shows how to perform a regression with categorical
covariates using categorical arrays and `fitlm`

.

`load carsmall`

The variable `MPG`

contains measurements on the miles per
gallon of 100 sample cars. The model year of each car is in the variable
`Model_Year`

, and `Weight`

contains the
weight of each car.

Draw a scatter plot of `MPG`

against
`Weight`

, grouped by model year.

figure() gscatter(Weight,MPG,Model_Year,'bgr','x.o') title('MPG vs. Weight, Grouped by Model Year')

The grouping variable, `Model_Year`

, has three
unique values, `70`

, `76`

, and
`82`

, corresponding to model years 1970, 1976, and
1982.

Create a table that contains the variables `MPG`

,
`Weight`

, and `Model_Year`

. Convert the
variable `Model_Year`

to a categorical array.

cars = table(MPG,Weight,Model_Year); cars.Model_Year = categorical(cars.Model_Year);

Fit a regression model using `fitlm`

with
`MPG`

as the dependent variable, and
`Weight`

and `Model_Year`

as the
independent variables. Because `Model_Year`

is a categorical
covariate with three levels, it should enter the model as two indicator
variables.

The scatter plot suggests that the slope of `MPG`

against
`Weight`

might differ for each model year. To assess this,
include weight-year interaction terms.

The proposed model is

$$E(MPG)={\beta}_{0}+{\beta}_{1}Weight+{\beta}_{2}I[1976]+{\beta}_{3}I[1982]+{\beta}_{4}Weight\times I[1976]+{\beta}_{5}Weight\times I[1982],$$

where *I*[1976] and
*I*[1982] are dummy variables indicating the model years 1976
and 1982, respectively. *I*[1976] takes the value 1 if model
year is 1976 and takes the value 0 if it is not. *I*[1982]
takes the value 1 if model year is 1982 and takes the value 0 if it is not. In
this model, 1970 is the reference year.

`fit = fitlm(cars,'MPG~Weight*Model_Year')`

fit = Linear regression model: MPG ~ 1 + Weight*Model_Year Estimated Coefficients: Estimate SE ___________ __________ (Intercept) 37.399 2.1466 Weight -0.0058437 0.00061765 Model_Year_76 4.6903 2.8538 Model_Year_82 21.051 4.157 Weight:Model_Year_76 -0.00082009 0.00085468 Weight:Model_Year_82 -0.0050551 0.0015636 tStat pValue ________ __________ (Intercept) 17.423 2.8607e-30 Weight -9.4612 4.6077e-15 Model_Year_76 1.6435 0.10384 Model_Year_82 5.0641 2.2364e-06 Weight:Model_Year_76 -0.95953 0.33992 Weight:Model_Year_82 -3.2329 0.0017256 Number of observations: 94, Error degrees of freedom: 88 Root Mean Squared Error: 2.79 R-squared: 0.886, Adjusted R-Squared: 0.88 F-statistic vs. constant model: 137, p-value = 5.79e-40

The regression output shows:

`fitlm`

recognizes`Model_Year`

as a categorical variable, and constructs the required indicator (dummy) variables. By default, the first level,`70`

, is the reference group (use`reordercats`

to change the reference group).The model specification,

`MPG~Weight*Model_Year`

, specifies the first-order terms for`Weight`

and`Model_Year`

, and all interactions.The model

*R*= 0.886, meaning the variation in miles per gallon is reduced by 88.6% when you consider weight, model year, and their interactions.^{2}The fitted model is

$$M\widehat{P}G=37.4-0.006Weight+4.7I[1976]+21.1I[1982]-0.0008Weight\times I[1976]-0.005Weight\times I[1982].$$

Thus, the estimated regression equations for the model years are as follows.

Model Year Predicted MPG Against Weight 1970 $$M\widehat{P}G=37.4-0.006Weight$$

1976 $$M\widehat{P}G=(37.4+4.7)-(0.006+0.0008)Weight$$

1982 $$M\widehat{P}G=(37.4+21.1)-(0.006+0.005)Weight$$

The relationship between `MPG`

and `Weight`

has an increasingly negative slope as the model year increases.

Plot the data and fitted regression lines.

w = linspace(min(Weight),max(Weight)); figure() gscatter(Weight,MPG,Model_Year,'bgr','x.o') line(w,feval(fit,w,'70'),'Color','b','LineWidth',2) line(w,feval(fit,w,'76'),'Color','g','LineWidth',2) line(w,feval(fit,w,'82'),'Color','r','LineWidth',2) title('Fitted Regression Lines by Model Year')

Test for significant differences between the slopes. This is equivalent to testing the hypothesis

$$\begin{array}{l}{H}_{0}:{\beta}_{4}={\beta}_{5}=0\\ {H}_{A}:\text{\hspace{0.17em}}{\beta}_{i}\ne 0\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{for}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{at}\text{\hspace{0.17em}}\text{least}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{one}\text{\hspace{0.17em}}\text{\hspace{0.17em}}i.\end{array}$$

anova(fit)

ans = SumSq DF MeanSq F pValue Weight 2050.2 1 2050.2 263.87 3.2055e-28 Model_Year 807.69 2 403.84 51.976 1.2494e-15 Weight:Model_Year 81.219 2 40.609 5.2266 0.0071637 Error 683.74 88 7.7698

`0.0072`

(from the interaction row,
`Weight:Model_Year`

), so the null hypothesis is rejected at
the 0.05 significance level. The value of the test statistic is
`5.2266`

. The numerator degrees of freedom for the test is
`2`

, which is the number of coefficients in the null
hypothesis.There is sufficient evidence that the slopes are not equal for all three model years.

`anova`

| `categorical`

| `fitlm`

| `reordercats`

- Test Differences Between Category Means
- Linear Regression
- Linear Regression Workflow
- Interpret Linear Regression Results