This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English version of the page.

Note: This page has been translated by MathWorks. Click here to see
To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

Credit Scorecard Modeling with Missing Values

This example shows how to handle missing values when working with creditscorecard objects. First, the example shows how to use the creditscorecard functionality to create an explicit bin for missing data with corresponding points. Then, the example shows how to "treat" the missing data to get a final scorecard with no explicit bins for missing values. To develop a scorecard without explicit bins for missing data, data must be treated and treatments for old and new data sets prior to scoring must be consistent.

Develop a Credit Scorecard with Explicit Bins for Missing Values

When creating a creditscorecard object, the data may contain missing values. When using creditscorecard to create a creditscorecard object, you can specify the name-value pair argument for 'BinMissingData' set to true. In this case, the missing data for numeric predictors (NaN values) and for categorical predictors (<undefined> values) is binned in a separate bin labeled <missing> that appears at the end of the bins. Predictors with no missing values in the training data have no <missing> bin. If you do not specify the 'BinMissingData' argument or if you set 'BinMissingData' to false, the creditscorecard function discards missing observations when computing frequencies of Good and Bad, and neither the bininfo nor plotbins functions reports such observations.

The <missing> bin remains in place throughout the scorecard modeling process. The final scorecard explicitly indicates the points to be assigned to missing values for predictors that have a <missing> bin. These points are determined from the Weight-of-Evidence (WOE) of the <missing> bin and the predictor's coefficient in the logistic model. For predictors without an explicit <missing> bin, you can assign points to missing values.

The dataMissing table in the CreditCardData.mat file has two predictors, CustAge and ResStatus, with missing values.

load CreditCardData.mat
head(dataMissing,5)
ans=5×11 table
    CustID    CustAge    TmAtAddress     ResStatus     EmpStatus    CustIncome    TmWBank    OtherCC    AMBalance    UtilRate    status
    ______    _______    ___________    ___________    _________    __________    _______    _______    _________    ________    ______

      1          53          62         <undefined>    Unknown        50000         55         Yes       1055.9        0.22        0   
      2          61          22         Home Owner     Employed       52000         25         Yes       1161.6        0.24        0   
      3          47          30         Tenant         Employed       37000         61         No        877.23        0.29        0   
      4         NaN          75         Home Owner     Employed       53000         20         Yes       157.37        0.08        0   
      5          68          56         Home Owner     Employed       53000         14         Yes       561.84        0.11        0   

Create a creditscorecard object using the CreditCardData.mat file to load the dataMissing table with missing values. Set the 'BinMissingData' argument to true. Apply automatic binning.

sc = creditscorecard(dataMissing,'IDVar','CustID','BinMissingData',true);
sc = autobinning(sc);

The bin information and bin plots for the predictors that have missing data both show a <missing> bin at the end.

bi = bininfo(sc,'CustAge');
disp(bi)
        Bin        Good    Bad     Odds       WOE       InfoValue 
    ___________    ____    ___    ______    ________    __________

    '[-Inf,33)'     69      52    1.3269    -0.42156      0.018993
    '[33,37)'       63      45       1.4    -0.36795      0.012839
    '[37,40)'       72      47    1.5319     -0.2779     0.0079824
    '[40,46)'      172      89    1.9326    -0.04556     0.0004549
    '[46,48)'       59      25      2.36     0.15424     0.0016199
    '[48,51)'       99      41    2.4146     0.17713     0.0035449
    '[51,58)'      157      62    2.5323     0.22469     0.0088407
    '[58,Inf]'      93      25      3.72     0.60931      0.032198
    '<missing>'     19      11    1.7273    -0.15787    0.00063885
    'Totals'       803     397    2.0227         NaN      0.087112
plotbins(sc,'CustAge')

bi = bininfo(sc,'ResStatus');
disp(bi)
        Bin         Good    Bad     Odds        WOE       InfoValue 
    ____________    ____    ___    ______    _________    __________

    'Tenant'        296     161    1.8385    -0.095463     0.0035249
    'Home Owner'    352     171    2.0585     0.017549    0.00013382
    'Other'         128      52    2.4615      0.19637     0.0055808
    '<missing>'      27      13    2.0769     0.026469    2.3248e-05
    'Totals'        803     397    2.0227          NaN     0.0092627
plotbins(sc,'ResStatus')

The training data for the 'CustAge' and 'ResStatus' predictors has missing data (NaNs and <undefined>). The binning process estimates WOE values of -0.15787 and 0.026469, respectively, for the missing data in these predictors.

The training data for EmpStatus and CustIncome has no explicit bin for <missing> values because there are no missing values for these predictors.

bi = bininfo(sc,'EmpStatus');
disp(bi)
       Bin        Good    Bad     Odds       WOE       InfoValue
    __________    ____    ___    ______    ________    _________

    'Unknown'     396     239    1.6569    -0.19947    0.021715 
    'Employed'    407     158    2.5759      0.2418    0.026323 
    'Totals'      803     397    2.0227         NaN    0.048038 
bi = bininfo(sc,'CustIncome');
disp(bi)
          Bin          Good    Bad     Odds         WOE       InfoValue 
    _______________    ____    ___    _______    _________    __________

    '[-Inf,29000)'      53      58    0.91379     -0.79457       0.06364
    '[29000,33000)'     74      49     1.5102     -0.29217     0.0091366
    '[33000,35000)'     68      36     1.8889     -0.06843    0.00041042
    '[35000,40000)'    193      98     1.9694    -0.026696    0.00017359
    '[40000,42000)'     68      34          2    -0.011271    1.0819e-05
    '[42000,47000)'    164      66     2.4848      0.20579     0.0078175
    '[47000,Inf]'      183      56     3.2679      0.47972      0.041657
    'Totals'           803     397     2.0227          NaN       0.12285

Use fitmodel to fit a logistic regression model using Weight of Evidence (WOE) values. fitmodel internally transforms all the predictor variables into WOE values, using the bins found during the automatic binning process. By default, fitmodel then fits a logistic regression model using a stepwise method. For predictors that have missing data, there is an explicit <missing> bin with a corresponding WOE value computed from the data. When using fitmodel, the corresponding WOE value for the <missing> bin is applied when performing the WOE transformation.

[sc,mdl] = fitmodel(sc);
1. Adding CustIncome, Deviance = 1490.8527, Chi2Stat = 32.588614, PValue = 1.1387992e-08
2. Adding TmWBank, Deviance = 1467.1415, Chi2Stat = 23.711203, PValue = 1.1192909e-06
3. Adding AMBalance, Deviance = 1455.5715, Chi2Stat = 11.569967, PValue = 0.00067025601
4. Adding EmpStatus, Deviance = 1447.3451, Chi2Stat = 8.2264038, PValue = 0.0041285257
5. Adding CustAge, Deviance = 1442.8477, Chi2Stat = 4.4974731, PValue = 0.033944979
6. Adding ResStatus, Deviance = 1438.9783, Chi2Stat = 3.86941, PValue = 0.049173805
7. Adding OtherCC, Deviance = 1434.9751, Chi2Stat = 4.0031966, PValue = 0.045414057

Generalized linear regression model:
    status ~ [Linear formula with 8 terms in 7 predictors]
    Distribution = Binomial

Estimated Coefficients:
                   Estimate       SE       tStat       pValue  
                   ________    ________    ______    __________

    (Intercept)    0.70229     0.063959     10.98    4.7498e-28
    CustAge        0.57421      0.25708    2.2335      0.025513
    ResStatus       1.3629      0.66952    2.0356       0.04179
    EmpStatus      0.88373       0.2929    3.0172      0.002551
    CustIncome     0.73535       0.2159     3.406    0.00065929
    TmWBank         1.1065      0.23267    4.7556    1.9783e-06
    OtherCC         1.0648      0.52826    2.0156      0.043841
    AMBalance       1.0446      0.32197    3.2443     0.0011775


1200 observations, 1192 error degrees of freedom
Dispersion: 1
Chi^2-statistic vs. constant model: 88.5, p-value = 2.55e-16

Scale the scorecard points by the points-to-double-the-odds (PDO) method using the 'PointsOddsAndPDO' argument of formatpoints. Suppose that you want a score of 500 points to have odds of 2 (twice as likely to be good than to be bad) and that the odds double every 50 points (so that 550 points would have odds of 4).

Display the scorecard showing the scaled points for predictors retained in the fitting model.

sc = formatpoints(sc,'PointsOddsAndPDO',[500 2 50]);
PointsInfo = displaypoints(sc)
PointsInfo=33×3 table
     Predictors          Bin          Points
    ____________    ______________    ______

    'CustAge'       '[-Inf,33)'       54.062
    'CustAge'       '[33,37)'         56.282
    'CustAge'       '[37,40)'         60.012
    'CustAge'       '[40,46)'         69.636
    'CustAge'       '[46,48)'         77.912
    'CustAge'       '[48,51)'          78.86
    'CustAge'       '[51,58)'          80.83
    'CustAge'       '[58,Inf]'         96.76
    'CustAge'       '<missing>'       64.984
    'ResStatus'     'Tenant'          62.138
    'ResStatus'     'Home Owner'      73.248
    'ResStatus'     'Other'           90.828
    'ResStatus'     '<missing>'       74.125
    'EmpStatus'     'Unknown'         58.807
    'EmpStatus'     'Employed'        86.937
    'CustIncome'    '[-Inf,29000)'    29.375
      ⋮

Notice that points for the <missing> bin for CustAge and ResStatus are explicitly shown (as 64.9836 and 74.1250, respectively). These points are computed from the WOE value for the <missing> bin and the logistic model coefficients.

Predictors that have no missing data in the training set have no explicit <missing> bin. By default, the points are set to NaN for missing data and they lead to a score of NaN when running score. For predictors that have no explicit <missing> bin, use the name-value argument 'Missing' in formatpoints to indicate how missing data should be treated for scoring purposes.

The scorecard is ready for scoring new data sets. You can also use the scorecard to compute probabilities of default or perform model validation. For details, see score, probdefault, and validatemodel. To further explore the handling of missing data, take a few rows from the original data as test data and introduce some missing data.

tdata = dataMissing(11:14,mdl.PredictorNames); % Keep only the predictors retained in the model
% Set some missing values
tdata.CustAge(1) = NaN;
tdata.ResStatus(2) = '<undefined>';
tdata.EmpStatus(3) = '<undefined>';
tdata.CustIncome(4) = NaN;
disp(tdata)
    CustAge     ResStatus      EmpStatus     CustIncome    TmWBank    OtherCC    AMBalance
    _______    ___________    ___________    __________    _______    _______    _________

      NaN      Tenant         Unknown          34000         44         Yes        119.8  
       48      <undefined>    Unknown          44000         14         Yes       403.62  
       65      Home Owner     <undefined>      48000          6         No        111.88  
       44      Other          Unknown            NaN         35         No        436.41  

Score the new data and see how points for missing data are differently assigned for CustAge and ResStatus and for EmpStatus and CustIncome. CustAge and ResStatus have an explicit <missing> bin for missing data. However, for EmpStatus and CustIncome the score function sets the points to NaN.

[Scores,Points] = score(sc,tdata);
disp(Scores)
  481.2231
  520.8353
       NaN
       NaN
disp(Points)
    CustAge    ResStatus    EmpStatus    CustIncome    TmWBank    OtherCC    AMBalance
    _______    _________    _________    __________    _______    _______    _________

    64.984      62.138       58.807        67.893      61.858     75.622      89.922  
     78.86      74.125       58.807        82.439      61.061     75.622      89.922  
     96.76      73.248          NaN        96.969      51.132     50.914      89.922  
    69.636      90.828       58.807           NaN      61.858     50.914      89.922  

Use the name-value argument 'Missing' in formatpoints to choose how to assign points to missing values for predictors that do not have an explicit <missing> bin. For this example, use the 'MinPoints' option for the 'Missing' argument. For EmpStatus and CustIncome, the minimum numbers of points in the scorecard are 58.8072 and 29.3753, respectively.

sc = formatpoints(sc,'Missing','MinPoints');
[Scores,Points] = score(sc,tdata);
disp(Scores)
  481.2231
  520.8353
  517.7532
  451.3405
disp(Points)
    CustAge    ResStatus    EmpStatus    CustIncome    TmWBank    OtherCC    AMBalance
    _______    _________    _________    __________    _______    _______    _________

    64.984      62.138       58.807        67.893      61.858     75.622      89.922  
     78.86      74.125       58.807        82.439      61.061     75.622      89.922  
     96.76      73.248       58.807        96.969      51.132     50.914      89.922  
    69.636      90.828       58.807        29.375      61.858     50.914      89.922  

Treat Missing Data and Develop a New Credit Scorecard Without Bins for Missing Values

You can use one of two alternative workflows to develop a scorecard without explicit bins for missing data.

The first alternative is to discard the missing data during the analysis. If the creditscorecard is created with the 'BinMissingData' argument set to false (by default, it is set to false if not specified), the missing observations are discarded when computing frequencies of Good and Bad and are not reported by bininfo or plotbins. For the fitting of the logistic model, rows with missing values are also discarded. With this approach, the missing data indirectly influences the results because the total number of observations used to compute bin statistics such as Weight-of-Evidence (WOE), or the total number of rows used to fit a logistic model, is reduced by the number of missing observations. For more information on this workflow, see Credit Scorecard Modeling Workflow.

The second alternative is to first gather information about the missing values, then treat or replace the missing values so that the training data has no missing observations, and then create a creditscorecard object with the treated data set. This approach modifies the training data, allowing the reporting of missing observations in the bin counts and the inclusion of missing observations for fitting the logistic model. However, in this approach, the treatment of the training data and the treatment of any new data set that requires scoring must be the same.

The following example explains the second alternative workflow, which gathers missing data, treats the training data, develops a new creditscorecard, and treats new data before scoring.

The dataMissing table in the CreditCardData.mat file has two predictors, CustAge and ResStatus, with missing values.

load CreditCardData.mat
head(dataMissing,5)
ans=5×11 table
    CustID    CustAge    TmAtAddress     ResStatus     EmpStatus    CustIncome    TmWBank    OtherCC    AMBalance    UtilRate    status
    ______    _______    ___________    ___________    _________    __________    _______    _______    _________    ________    ______

      1          53          62         <undefined>    Unknown        50000         55         Yes       1055.9        0.22        0   
      2          61          22         Home Owner     Employed       52000         25         Yes       1161.6        0.24        0   
      3          47          30         Tenant         Employed       37000         61         No        877.23        0.29        0   
      4         NaN          75         Home Owner     Employed       53000         20         Yes       157.37        0.08        0   
      5          68          56         Home Owner     Employed       53000         14         Yes       561.84        0.11        0   

First, use the untreated training data to analyze the missing data information.

Create a creditscorecard object using the CreditCardData.mat file to load the dataMissing with missing values and set the 'BinMissingData' argument to true to explicitly report information on missing values. Apply automatic binning.

sc = creditscorecard(dataMissing,'IDVar','CustID','BinMissingData',true);
sc = autobinning(sc);

The bin information and bin plots for predictors that have missing data both show a <missing> bin at the end. The two predictors with missing values in this data set are CustAge and ResStatus.

bi = bininfo(sc,'CustAge');
disp(bi)
        Bin        Good    Bad     Odds       WOE       InfoValue 
    ___________    ____    ___    ______    ________    __________

    '[-Inf,33)'     69      52    1.3269    -0.42156      0.018993
    '[33,37)'       63      45       1.4    -0.36795      0.012839
    '[37,40)'       72      47    1.5319     -0.2779     0.0079824
    '[40,46)'      172      89    1.9326    -0.04556     0.0004549
    '[46,48)'       59      25      2.36     0.15424     0.0016199
    '[48,51)'       99      41    2.4146     0.17713     0.0035449
    '[51,58)'      157      62    2.5323     0.22469     0.0088407
    '[58,Inf]'      93      25      3.72     0.60931      0.032198
    '<missing>'     19      11    1.7273    -0.15787    0.00063885
    'Totals'       803     397    2.0227         NaN      0.087112
plotbins(sc,'CustAge')

bi = bininfo(sc,'ResStatus');
disp(bi)
        Bin         Good    Bad     Odds        WOE       InfoValue 
    ____________    ____    ___    ______    _________    __________

    'Tenant'        296     161    1.8385    -0.095463     0.0035249
    'Home Owner'    352     171    2.0585     0.017549    0.00013382
    'Other'         128      52    2.4615      0.19637     0.0055808
    '<missing>'      27      13    2.0769     0.026469    2.3248e-05
    'Totals'        803     397    2.0227          NaN     0.0092627
plotbins(sc,'ResStatus')

To treat missing values, you can apply different criteria. This example follows a straightforward approach to replace missing observations with the most common or typical value in the data distribution, which is the value of mode for the data. For this example, the mode happens to have a similar WOE value as the original <missing> bin. The similarity in values is favorable because similar WOE values means similar points in a scorecard.

For CustAge, bin 4 is the bin with the most observations and the mode value of the original data is 43.

modeCustAge = mode(dataMissing.CustAge);
disp(modeCustAge)
    43

The WOE value of the <missing> bin is similar to the WOE value of bin 4. Therefore, replacing the missing values in CustAge with the value of mode is reasonable.

To treat the data, create a copy of the data and fill the missing values.

dataTreated = dataMissing;
dataTreated.CustAge = fillmissing(dataTreated.CustAge,'constant',modeCustAge);

For ResStatus, the value of 'Home Owner' is the value of the mode of the data, and the WOE value of the <missing> bin is closest to that of the 'Home Owner' bin.

modeResStatus = mode(dataMissing.ResStatus);
disp(modeResStatus)
     Home Owner 

Replace the missing data with 'Home Owner'. Replacing the missing values preserves both the observed WOE values and the typical characteristics observed in the data set.

dataTreated.ResStatus = fillmissing(dataTreated.ResStatus,'constant',string(modeResStatus));

The treated data set now has no missing values.

disp(any(any(ismissing(dataTreated))))
   0

Using the treated data set, apply the typical creditscorecard workflow. Create a creditscorecard object with the treated data and applying automatic binning.

scTreated = creditscorecard(dataTreated,'IDVar','CustID');
scTreated = autobinning(scTreated);

Compare the bin information of the untreated data for CustAge with the bin information of the treated data for CustAge.

bi = bininfo(sc,'CustAge');
disp(bi)
        Bin        Good    Bad     Odds       WOE       InfoValue 
    ___________    ____    ___    ______    ________    __________

    '[-Inf,33)'     69      52    1.3269    -0.42156      0.018993
    '[33,37)'       63      45       1.4    -0.36795      0.012839
    '[37,40)'       72      47    1.5319     -0.2779     0.0079824
    '[40,46)'      172      89    1.9326    -0.04556     0.0004549
    '[46,48)'       59      25      2.36     0.15424     0.0016199
    '[48,51)'       99      41    2.4146     0.17713     0.0035449
    '[51,58)'      157      62    2.5323     0.22469     0.0088407
    '[58,Inf]'      93      25      3.72     0.60931      0.032198
    '<missing>'     19      11    1.7273    -0.15787    0.00063885
    'Totals'       803     397    2.0227         NaN      0.087112
biTreated = bininfo(scTreated,'CustAge');
disp(biTreated)
        Bin        Good    Bad     Odds       WOE       InfoValue
    ___________    ____    ___    ______    ________    _________

    '[-Inf,33)'     69      52    1.3269    -0.42156     0.018993
    '[33,37)'       63      45       1.4    -0.36795     0.012839
    '[37,40)'       72      47    1.5319     -0.2779    0.0079824
    '[40,45)'      156      86     1.814    -0.10891    0.0024345
    '[45,48)'       94      39    2.4103     0.17531    0.0033002
    '[48,58)'      256     103    2.4854     0.20603      0.01223
    '[58,Inf]'      93      25      3.72     0.60931     0.032198
    'Totals'       803     397    2.0227         NaN     0.089977

The first few bins are the same, but the treatment of missing values influences the binning results, starting with the bin where the missing data is placed. You can further explore your binning results using autobinning with a different algorithm or you can manually modify the bins using modifybins.

For ResStatus, the results for the treated data look similar to the initial results, except for the higher counts in the 'Home Owner' bin due to the treatment. For a categorical variable with more categories (or levels), an automatic algorithm may find category groups and the results may show more differences for before and after the treatment. You can further explore your binning results using autobinning with a different algorithm or you can manually modify the bins using modifybins.

bi = bininfo(sc,'ResStatus');
disp(bi)
        Bin         Good    Bad     Odds        WOE       InfoValue 
    ____________    ____    ___    ______    _________    __________

    'Tenant'        296     161    1.8385    -0.095463     0.0035249
    'Home Owner'    352     171    2.0585     0.017549    0.00013382
    'Other'         128      52    2.4615      0.19637     0.0055808
    '<missing>'      27      13    2.0769     0.026469    2.3248e-05
    'Totals'        803     397    2.0227          NaN     0.0092627
biTreated = bininfo(scTreated,'ResStatus');
disp(biTreated)
        Bin         Good    Bad     Odds        WOE       InfoValue 
    ____________    ____    ___    ______    _________    __________

    'Tenant'        296     161    1.8385    -0.095463     0.0035249
    'Home Owner'    379     184    2.0598     0.018182    0.00015462
    'Other'         128      52    2.4615      0.19637     0.0055808
    'Totals'        803     397    2.0227          NaN     0.0092603

Fit the logistic model, scale the points, and display the final scorecard.

scTreated = fitmodel(scTreated,'Display','off');
scTreated = formatpoints(scTreated,'PointsOddsAndPDO',[500 2 50]);
ScPoints = displaypoints(scTreated);
disp(ScPoints)
     Predictors             Bin            Points
    ____________    ___________________    ______

    'CustAge'       '[-Inf,33)'            53.507
    'CustAge'       '[33,37)'              55.798
    'CustAge'       '[37,40)'              59.646
    'CustAge'       '[40,45)'              66.868
    'CustAge'       '[45,48)'              79.013
    'CustAge'       '[48,58)'              80.326
    'CustAge'       '[58,Inf]'             97.559
    'ResStatus'     'Tenant'               62.161
    'ResStatus'     'Home Owner'           73.305
    'ResStatus'     'Other'                90.777
    'EmpStatus'     'Unknown'              58.846
    'EmpStatus'     'Employed'             86.887
    'CustIncome'    '[-Inf,29000)'         29.906
    'CustIncome'    '[29000,33000)'        56.219
    'CustIncome'    '[33000,35000)'        67.938
    'CustIncome'    '[35000,40000)'        70.123
    'CustIncome'    '[40000,42000)'        70.931
    'CustIncome'    '[42000,47000)'          82.3
    'CustIncome'    '[47000,Inf]'          96.647
    'TmWBank'       '[-Inf,12)'             51.05
    'TmWBank'       '[12,23)'              61.018
    'TmWBank'       '[23,45)'              61.818
    'TmWBank'       '[45,71)'              92.921
    'TmWBank'       '[71,Inf]'             133.14
    'OtherCC'       'No'                   50.806
    'OtherCC'       'Yes'                  75.642
    'AMBalance'     '[-Inf,558.88)'        89.788
    'AMBalance'     '[558.88,1254.28)'     63.088
    'AMBalance'     '[1254.28,1597.44)'    59.711
    'AMBalance'     '[1597.44,Inf]'        49.157

There are no explicit <missing> bins in the final scorecard. If you need to score a new data set and it contains missing data, by default the score function sets the points to NaN. To further explore the handling of missing data, take a few rows from the original data as test data and introduce some missing data.

tdata = dataTreated(11:14,mdl.PredictorNames); % Keep only the predictors retained in the model
% Set some missing values
tdata.CustAge(1) = NaN;
tdata.ResStatus(2) = '<undefined>';
tdata.EmpStatus(3) = '<undefined>';
tdata.CustIncome(4) = NaN;
disp(tdata)
    CustAge     ResStatus      EmpStatus     CustIncome    TmWBank    OtherCC    AMBalance
    _______    ___________    ___________    __________    _______    _______    _________

      NaN      Tenant         Unknown          34000         44         Yes        119.8  
       48      <undefined>    Unknown          44000         14         Yes       403.62  
       65      Home Owner     <undefined>      48000          6         No        111.88  
       44      Other          Unknown            NaN         35         No        436.41  

Score the new data and see how points are set to NaN, which leads to NaN scores.

[Scores,Points] = score(scTreated,tdata);
disp(Scores)
   NaN
   NaN
   NaN
   NaN
disp(Points)
    CustAge    ResStatus    EmpStatus    CustIncome    TmWBank    OtherCC    AMBalance
    _______    _________    _________    __________    _______    _______    _________

       NaN      62.161       58.846        67.938      61.818     75.642      89.788  
    80.326         NaN       58.846          82.3      61.018     75.642      89.788  
    97.559      73.305          NaN        96.647       51.05     50.806      89.788  
    66.868      90.777       58.846           NaN      61.818     50.806      89.788  

For untreated predictors, such as EmpStatus or CustIncome, you can use the name-value argument 'Missing' in formatpoints to choose how to assign points to missing values.

Use the 'MinPoints' option for the 'Missing' argument. This assigns the minimum number of possible points in the scorecard to the missing data. In this example, the minimum number of possible points for CustIncome is 29.906, so the last row in the table gets 29.906 points for the missing CustIncome value.

scTreated = formatpoints(scTreated,'Missing','MinPoints');
[Scores,Points] = score(scTreated,tdata);
disp(Scores)
  469.7003
  510.0812
  518.0013
  448.8099
disp(Points)
    CustAge    ResStatus    EmpStatus    CustIncome    TmWBank    OtherCC    AMBalance
    _______    _________    _________    __________    _______    _______    _________

    53.507      62.161       58.846        67.938      61.818     75.642      89.788  
    80.326      62.161       58.846          82.3      61.018     75.642      89.788  
    97.559      73.305       58.846        96.647       51.05     50.806      89.788  
    66.868      90.777       58.846        29.906      61.818     50.806      89.788  

However, for predictors that were treated in the training data, such as CustAge, the effect of the 'Missing' argument is inconsistent with the treatment of the training data. For example, for CustAge, the first observation gets 53.507 points for the missing value, yet if the new data were "treated," and the missing value for CustAge were replaced with the mode of the training data (age of 43), this observation falls in the [40,45) bin and receives 66.868 points.

Therefore, before scoring, data sets must be treated the same way the training data was treated. The use of the 'Missing' argument is still important to assign points for untreated predictors and the treated predictors receive points in a way that is consistent with the way the model was developed.

tdataTreated = tdata;
tdataTreated.CustAge = fillmissing(tdataTreated.CustAge,'constant',modeCustAge);
tdataTreated.ResStatus = fillmissing(tdataTreated.ResStatus,'constant',string(modeResStatus));
disp(tdataTreated)
    CustAge    ResStatus      EmpStatus     CustIncome    TmWBank    OtherCC    AMBalance
    _______    __________    ___________    __________    _______    _______    _________

      43       Tenant        Unknown          34000         44         Yes        119.8  
      48       Home Owner    Unknown          44000         14         Yes       403.62  
      65       Home Owner    <undefined>      48000          6         No        111.88  
      44       Other         Unknown            NaN         35         No        436.41  
[Scores,Points] = score(scTreated,tdataTreated);
disp(Scores)
  483.0606
  521.2249
  518.0013
  448.8099
disp(Points)
    CustAge    ResStatus    EmpStatus    CustIncome    TmWBank    OtherCC    AMBalance
    _______    _________    _________    __________    _______    _______    _________

    66.868      62.161       58.846        67.938      61.818     75.642      89.788  
    80.326      73.305       58.846          82.3      61.018     75.642      89.788  
    97.559      73.305       58.846        96.647       51.05     50.806      89.788  
    66.868      90.777       58.846        29.906      61.818     50.806      89.788  

See Also

| |

Related Topics