- A model with a categorical predictor that has L levels (categories) includes L – 1 indicator variables. The model uses the first category as a reference level, so it does not include the indicator variable for the reference level. If the data type of the categorical predictor is categorical, then you can check the order of categories by using categories and reorder the categories by using reordercats to customize the reference level. For more details about creating indicator variables, see Automatic Creation of Dummy Variables.
- fitlm treats the group of L – 1 indicator variables as a single variable. If you want to treat the indicator variables as distinct predictor variables, create indicator variables manually by using dummyvar. Then use the indicator variables, except the one corresponding to the reference level of the categorical variable, when you fit a model. For the categorical predictor X, if you specify all columns of dummyvar(X) and an intercept term as predictors, then the design matrix becomes rank deficient.
- Interaction terms between a continuous predictor and a categorical predictor with L levels consist of the element-wise product of the L – 1 indicator variables with the continuous predictor.
- Interaction terms between two categorical predictors with L and M levels consist of the (L – 1)*(M – 1) indicator variables to include all possible combinations of the two categorical predictor levels.
- You cannot specify higher-order terms for a categorical predictor because the square of an indicator is equal to itself.
You are now following this question
- You will see updates in your followed content feed.
- You may receive emails, depending on your communication preferences.
How does fitlm set reference level with categorical variables?
7 views (last 30 days)
Show older comments
I am running linear regression using fitlm with categorical datasets:
model = fitlm(DataTable ,'Score ~ Industry + Rating + Liquid')
The regressor set the Industry and Rating reference level to the 1st row cells, but for "Liquid" variable, it sets "Q1" as the reference level. I am a little confused on this select? I thought the regressor will always set the 1st row as reference for all 3 variables. Could you please explain why it choose a different reference level for the "Liquid" variable.
Answers (1)
Cris LaPierre
on 11 Oct 2024
See this example: Linear Regression with Categorical Predictor
fitlm treats a categorical predictor as follows:
9 Comments
Guohua
on 11 Oct 2024
Thank you.
As the notes says: "The model uses the first category as a reference level", that is why it selects "METAL" as reference levels for "Industry", and "BA1" as reference for "Rating". But for "Liquid", it doesn't set the 1st category "Q2" as refrence, it uses "Q1". This is where I don't understand, could you please claify on this?
Cris LaPierre
on 11 Oct 2024
Can you attach your data table file to your post using the paperclip icon?
Walter Roberson
on 11 Oct 2024
"the first category" does not refer to the entry that is encountered first:
"the first category" refers to the category that sorts first.
Cris LaPierre
on 14 Oct 2024
The attached data does not contain the Liquid column, so i can't reproduce the issue, but as Walter stated, categories tend to be ordered alphabetically, so the first one listed is not necessarily the first category.
Liquid = categorical(["Q2";"Q2";"Q1";"Q1";"Q2";"Q4";"Q1";"Q5";"Q2";"Q1";"Q2";"Q3"])
Liquid = 12x1 categorical array
Q2
Q2
Q1
Q1
Q2
Q4
Q1
Q5
Q2
Q1
Q2
Q3
C = categories(Liquid)
C = 5x1 cell array
{'Q1'}
{'Q2'}
{'Q3'}
{'Q4'}
{'Q5'}
C{1}
ans = 'Q1'
If you want to change the order of your categories, use reordercats
Liquid_V2 = reordercats(Liquid,["Q2";"Q1";"Q4";"Q5";"Q3"]);
C2 = categories(Liquid_V2)
C2 = 5x1 cell array
{'Q2'}
{'Q1'}
{'Q4'}
{'Q5'}
{'Q3'}
C2{1}
ans = 'Q2'
Guohua
on 14 Oct 2024
Please use this data set. The model output I got is as below. It looks like pick the first row Industry and Rating as reference, but Liquid, the model selects "1" as the reference. What confuses me is that the model doesn't select Industry and Rating reference levels using alphabetical order, which would be "AEROSPACE/DEFENSE" and "AAA", but it does select numerical order "Liquid" 1 as reference. To be consistent, if the solver selects the 1st row as reference for Industry and Rating, why it doesn't stick to this rule for "Liquid". Thank you.
------------------------------------------ Output -------------------------------------------------
md2 =
Linear regression model:
Score ~ 1 + Industry + Rating + Liquid
Estimated Coefficients:
Estimate SE tStat pValue
________ ______ __________ ___________
(Intercept) -30.661 31.957 -0.95945 0.33751
Industry_AIRLINES 74.075 48.623 1.5235 0.12788
Industry_TECHNOLOGY 46.33 28.915 1.6023 0.10933
Industry_RETAIL_&_SUPERMARKETS 58.743 32.97 1.7817 0.075031
Industry_OTHER_REITS 3.1752 50.35 0.063062 0.94973
Industry_PHARMACEUTICALS 23.979 39.108 0.61313 0.5399
Industry_MEDIA_ENTERTAINMENT -73.778 33.416 -2.2079 0.027427
Industry_AUTOMOTIVE_AUTO_SUPPLIERS 26.169 38.719 0.67587 0.49924
Industry_HEALTHCARE 27.374 31.002 0.88296 0.37742
Industry_RETAILERS -16.989 69.658 -0.24389 0.80736
Industry_INDUSTRIAL_OTHER -6.4796 38.917 -0.1665 0.86779
Industry_CONSUMER_PRODUCTS 23.455 36.496 0.64267 0.52055
Industry_P&C 58.788 32.565 1.8052 0.071269
Industry_REIT 85.54 32.967 2.5947 0.0095745
Industry_PKGED_FOOD_FOODSVCS_REST 30.374 33.14 0.91654 0.35955
Industry_CONSUMER_CYCLICAL_SERVICES 10.205 40.119 0.25436 0.79925
Industry_Utilities_OpCo_FMB 84.52 34.752 2.4321 0.015146
Industry_Utilities_Holdco 86.896 31.922 2.7222 0.0065725
Industry_Utilities_OpCo_Uns 71.542 35.941 1.9905 0.046742
Industry_LIFE 68.912 37.124 1.8563 0.063641
Industry_CONSTRUCTION_MACHINERY 64.929 43.476 1.4934 0.13556
Industry_AEROSPACE/DEFENSE 25.573 37.21 0.68726 0.49204
Industry_AIRCRAFT_LEASE 26.407 78.714 0.33549 0.73731
Industry_CHEMICALS 79.924 33.1 2.4146 0.015888
Industry_BANKING_US_SUB 67.288 35.675 1.8862 0.059495
Industry_BANKING_US_SR 78.517 33.661 2.3326 0.019822
Industry_MIDSTREAM 57.774 31.627 1.8267 0.067968
Industry_BROKERAGE_ASSETMANAGERS_EXCHANGES 61.685 36.962 1.6689 0.095383
Industry_CABLE_TELCO -15.638 32.765 -0.47727 0.63325
Industry_INDEPENDENT 21.359 34.483 0.61941 0.53576
Industry_OIL_FIELD_SERVICES -35.168 36.037 -0.97587 0.32931
Industry_FINANCE_COMPANIES 23.547 34.786 0.67691 0.49858
Industry_BANKING 39.035 44.918 0.86904 0.38498
Industry_LIFE_FA_BACKED_NOTES 70.792 95.979 0.73757 0.46091
Industry_DIVERSIFIED_MANUFACTURING 37.67 32.393 1.1629 0.24508
Industry_PACKAGING 24.327 42.538 0.5719 0.56749
Industry_TRANSPORTATION_SERVICES 26.77 42.432 0.63089 0.52822
Industry_ELECTRIC 79.585 59.145 1.3456 0.17867
Industry_GAMING 31.264 39.646 0.78857 0.43051
Industry_PAPER 35.469 48.39 0.73299 0.4637
Industry_BUILDING_MATERIALS 17.926 37.124 0.48287 0.62927
Industry_BEVERAGE 64.112 52.879 1.2124 0.22557
Industry_NO_INDUSTRY -61.964 71.257 -0.86958 0.38469
Industry_RAILROADS_ENVIRONMENTAL 42.524 44.653 0.95232 0.34111
Industry_HOME_CONSTRUCTION 7.4579 40.197 0.18553 0.85284
Industry_FINANCIAL_OTHER 5.8455 63.208 0.092481 0.92633
Industry_CABLE_SATELLITE -1.1275 131.6 -0.0085676 0.99317
Industry_LODGING_LEISURE 4.4847 35.959 0.12472 0.90077
Industry_Utilities_Genco 20.092 48.759 0.41207 0.68036
Industry_REFINING -14.143 50.302 -0.28117 0.77862
Industry_REITS_HEALTHCARE 63.194 50.417 1.2534 0.21027
Industry_INTEGRATED 79.255 71.394 1.1101 0.26716
Industry_HEALTHCARE_REITS -597.22 100.19 -5.9606 3.2334e-09
Industry_RETAIL_REITS 33.916 94.718 0.35808 0.72034
Industry_AUTOMOTIVE_AUTO_FINCO 53.97 69.608 0.77535 0.43828
Industry_HIGHER_ED_TXCRP 19.481 131.76 0.14785 0.88248
Industry_ENVIRONMENTAL -58.522 131.98 -0.44342 0.65753
Industry_BANKING_US_PFD 32.933 78.824 0.4178 0.67616
Industry_AIRLINES_EETC_A 33.805 95.25 0.3549 0.72272
Industry_INSURANCE_US_SUBORDINATED 53.352 78.697 0.67795 0.49793
Industry_TOBACCO 41.967 69.422 0.60451 0.54561
Industry_BANKING_GLOBAL_TLAC_SR 60.837 131.21 0.46367 0.64296
Industry_UTILITY_OTHER 5.1819 131.24 0.039483 0.96851
Rating_BA2 30.788 20.998 1.4662 0.14282
Rating_AA1 -72.611 129.89 -0.55901 0.57625
Rating_BAA3 -6.6247 19.214 -0.34478 0.73032
Rating_A3 -46.588 22.728 -2.0498 0.040584
Rating_B3 78.003 24.971 3.1238 0.0018249
Rating_AA3 -42.946 38.267 -1.1223 0.26196
Rating_BAA1 -30.175 20.288 -1.4874 0.13716
Rating_B1 33.131 21.115 1.569 0.11688
Rating_BA3 14.299 20.243 0.70638 0.48008
Rating_B2 20.609 22.265 0.92559 0.35483
Rating_A1 -71.461 28.924 -2.4707 0.013613
Rating_A2 -54.844 24.399 -2.2478 0.024757
Rating_BAA2 -24.753 19.031 -1.3007 0.1936
Rating_CAA3 534.91 43.266 12.363 2.8422e-33
Rating_AA2 -113.84 52.126 -2.1839 0.029147
Rating_CAA1 169.91 27.988 6.0709 1.667e-09
Rating_CAA2 413.13 39.372 10.493 8.8261e-25
Rating_CA 788.56 57.97 13.603 1.7066e-39
Rating_NR -98.999 131.24 -0.75431 0.4508
Rating_AAA -48.279 93.631 -0.51563 0.6062
Rating_C 3773.2 94.223 40.045 5.4623e-229
Liquid_2 -7.5509 10.665 -0.70804 0.47905
Liquid_3 23.177 11.96 1.9378 0.052866
Liquid_4 11.18 12.751 0.87681 0.38075
Liquid_5 20.929 14.331 1.4604 0.14442
Number of observations: 1386, Error degrees of freedom: 1298
Root Mean Squared Error: 128
R-squared: 0.643, Adjusted R-Squared: 0.619
F-statistic vs. constant model: 26.9, p-value = 7.06e-231
Cris LaPierre
on 15 Oct 2024
The reason for the behavior you are seeing is because Industry and Rating are not categorical variables.
load matlab_datasample2.mat
varfun(@class,DataSample)
ans = 1x4 table
class_Score class_Industry class_Rating class_Liquid
___________ ______________ ____________ ____________
double cell cell categorical
If you want row 1 to be the reference values, then either don't use categorical data types, or use reordercats to ensure the row 1 categorical values are the first category.
Here, I'm converting Liquid to string.
DataSample = convertvars(DataSample, "Liquid","string")
DataSample = 1406x4 table
Score Industry Rating Liquid
_______ _____________________________ ________ ______
-92.102 {'METALS_AND_MINING' } {'BA1' } "2"
-125.94 {'AIRLINES' } {'BA2' } "2"
-90.965 {'AIRLINES' } {'BA1' } "1"
-56.942 {'TECHNOLOGY' } {'AA1' } "1"
-127.78 {'RETAIL_&_SUPERMARKETS' } {'BA1' } "2"
9.7511 {'OTHER_REITS' } {'BAA3'} "4"
4.5882 {'PHARMACEUTICALS' } {'A3' } "1"
-112.25 {'MEDIA_ENTERTAINMENT' } {'B3' } "5"
-84.497 {'AUTOMOTIVE_AUTO_SUPPLIERS'} {'BA2' } "2"
-53.485 {'HEALTHCARE' } {'AA3' } "1"
0.51723 {'METALS_AND_MINING' } {'BAA1'} "2"
-3.3194 {'AIRLINES' } {'BA1' } "3"
-62.494 {'RETAILERS' } {'BA2' } "5"
32.613 {'INDUSTRIAL_OTHER' } {'B1' } "3"
8.5647 {'CONSUMER_PRODUCTS' } {'BA3' } "4"
-4.5917 {'P&C' } {'BAA1'} "2"
varfun(@class,DataSample)
ans = 1x4 table
class_Score class_Industry class_Rating class_Liquid
___________ ______________ ____________ ____________
double cell cell string
model = fitlm(DataSample,'Score ~ Industry + Rating + Liquid')
model =
Linear regression model:
Score ~ 1 + Industry + Rating + Liquid
Estimated Coefficients:
Estimate SE tStat pValue
________ ______ __________ ___________
(Intercept) -38.212 31.515 -1.2125 0.22554
Industry_AIRLINES 74.075 48.623 1.5235 0.12788
Industry_TECHNOLOGY 46.33 28.915 1.6023 0.10933
Industry_RETAIL_&_SUPERMARKETS 58.743 32.97 1.7817 0.075031
Industry_OTHER_REITS 3.1752 50.35 0.063062 0.94973
Industry_PHARMACEUTICALS 23.979 39.108 0.61313 0.5399
Industry_MEDIA_ENTERTAINMENT -73.778 33.416 -2.2079 0.027427
Industry_AUTOMOTIVE_AUTO_SUPPLIERS 26.169 38.719 0.67587 0.49924
Industry_HEALTHCARE 27.374 31.002 0.88296 0.37742
Industry_RETAILERS -16.989 69.658 -0.24389 0.80736
Industry_INDUSTRIAL_OTHER -6.4796 38.917 -0.1665 0.86779
Industry_CONSUMER_PRODUCTS 23.455 36.496 0.64267 0.52055
Industry_P&C 58.788 32.565 1.8052 0.071269
Industry_REIT 85.54 32.967 2.5947 0.0095745
Industry_PKGED_FOOD_FOODSVCS_REST 30.374 33.14 0.91654 0.35955
Industry_CONSUMER_CYCLICAL_SERVICES 10.205 40.119 0.25436 0.79925
Industry_Utilities_OpCo_FMB 84.52 34.752 2.4321 0.015146
Industry_Utilities_Holdco 86.896 31.922 2.7222 0.0065725
Industry_Utilities_OpCo_Uns 71.542 35.941 1.9905 0.046742
Industry_LIFE 68.912 37.124 1.8563 0.063641
Industry_CONSTRUCTION_MACHINERY 64.929 43.476 1.4934 0.13556
Industry_AEROSPACE/DEFENSE 25.573 37.21 0.68726 0.49204
Industry_AIRCRAFT_LEASE 26.407 78.714 0.33549 0.73731
Industry_CHEMICALS 79.924 33.1 2.4146 0.015888
Industry_BANKING_US_SUB 67.288 35.675 1.8862 0.059495
Industry_BANKING_US_SR 78.517 33.661 2.3326 0.019822
Industry_MIDSTREAM 57.774 31.627 1.8267 0.067968
Industry_BROKERAGE_ASSETMANAGERS_EXCHANGES 61.685 36.962 1.6689 0.095383
Industry_CABLE_TELCO -15.638 32.765 -0.47727 0.63325
Industry_INDEPENDENT 21.359 34.483 0.61941 0.53576
Industry_OIL_FIELD_SERVICES -35.168 36.037 -0.97587 0.32931
Industry_FINANCE_COMPANIES 23.547 34.786 0.67691 0.49858
Industry_BANKING 39.035 44.918 0.86904 0.38498
Industry_LIFE_FA_BACKED_NOTES 70.792 95.979 0.73757 0.46091
Industry_DIVERSIFIED_MANUFACTURING 37.67 32.393 1.1629 0.24508
Industry_PACKAGING 24.327 42.538 0.5719 0.56749
Industry_TRANSPORTATION_SERVICES 26.77 42.432 0.63089 0.52822
Industry_ELECTRIC 79.585 59.145 1.3456 0.17867
Industry_GAMING 31.264 39.646 0.78857 0.43051
Industry_PAPER 35.469 48.39 0.73299 0.4637
Industry_BUILDING_MATERIALS 17.926 37.124 0.48287 0.62927
Industry_BEVERAGE 64.112 52.879 1.2124 0.22557
Industry_NO_INDUSTRY -61.964 71.257 -0.86958 0.38469
Industry_RAILROADS_ENVIRONMENTAL 42.524 44.653 0.95232 0.34111
Industry_HOME_CONSTRUCTION 7.4579 40.197 0.18553 0.85284
Industry_FINANCIAL_OTHER 5.8455 63.208 0.092481 0.92633
Industry_CABLE_SATELLITE -1.1275 131.6 -0.0085676 0.99317
Industry_LODGING_LEISURE 4.4847 35.959 0.12472 0.90077
Industry_Utilities_Genco 20.092 48.759 0.41207 0.68036
Industry_REFINING -14.143 50.302 -0.28117 0.77862
Industry_REITS_HEALTHCARE 63.194 50.417 1.2534 0.21027
Industry_INTEGRATED 79.255 71.394 1.1101 0.26716
Industry_HEALTHCARE_REITS -597.22 100.19 -5.9606 3.2334e-09
Industry_RETAIL_REITS 33.916 94.718 0.35808 0.72034
Industry_AUTOMOTIVE_AUTO_FINCO 53.97 69.608 0.77535 0.43828
Industry_HIGHER_ED_TXCRP 19.481 131.76 0.14785 0.88248
Industry_ENVIRONMENTAL -58.522 131.98 -0.44342 0.65753
Industry_BANKING_US_PFD 32.933 78.824 0.4178 0.67616
Industry_AIRLINES_EETC_A 33.805 95.25 0.3549 0.72272
Industry_INSURANCE_US_SUBORDINATED 53.352 78.697 0.67795 0.49793
Industry_TOBACCO 41.967 69.422 0.60451 0.54561
Industry_BANKING_GLOBAL_TLAC_SR 60.837 131.21 0.46367 0.64296
Industry_UTILITY_OTHER 5.1819 131.24 0.039483 0.96851
Rating_BA2 30.788 20.998 1.4662 0.14282
Rating_AA1 -72.611 129.89 -0.55901 0.57625
Rating_BAA3 -6.6247 19.214 -0.34478 0.73032
Rating_A3 -46.588 22.728 -2.0498 0.040584
Rating_B3 78.003 24.971 3.1238 0.0018249
Rating_AA3 -42.946 38.267 -1.1223 0.26196
Rating_BAA1 -30.175 20.288 -1.4874 0.13716
Rating_B1 33.131 21.115 1.569 0.11688
Rating_BA3 14.299 20.243 0.70638 0.48008
Rating_B2 20.609 22.265 0.92559 0.35483
Rating_A1 -71.461 28.924 -2.4707 0.013613
Rating_A2 -54.844 24.399 -2.2478 0.024757
Rating_BAA2 -24.753 19.031 -1.3007 0.1936
Rating_CAA3 534.91 43.266 12.363 2.8422e-33
Rating_AA2 -113.84 52.126 -2.1839 0.029147
Rating_CAA1 169.91 27.988 6.0709 1.667e-09
Rating_CAA2 413.13 39.372 10.493 8.8261e-25
Rating_CA 788.56 57.97 13.603 1.7066e-39
Rating_NR -98.999 131.24 -0.75431 0.4508
Rating_AAA -48.279 93.631 -0.51563 0.6062
Rating_C 3773.2 94.223 40.045 5.4623e-229
Liquid_1 7.5509 10.665 0.70804 0.47905
Liquid_4 18.731 11.687 1.6027 0.10925
Liquid_5 28.48 13.186 2.1599 0.030965
Liquid_3 30.728 11.029 2.7862 0.0054111
Number of observations: 1386, Error degrees of freedom: 1298
Root Mean Squared Error: 128
R-squared: 0.643, Adjusted R-Squared: 0.619
F-statistic vs. constant model: 26.9, p-value = 7.06e-231
See Also
Categories
Find more on Weather and Atmospheric Science in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!An Error Occurred
Unable to complete the action because of changes made to the page. Reload the page to see its updated state.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom(English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)