fillmissing
Replace missing values for credit scorecard predictors
Description
sc = fillmissing(sc,PredictorNames,Statistics)PredictorNames
                    with values defined by Statistics and returns an updated
                    credit scorecard object (sc). Standard missing data is
                    defined as follows:
- NaNfor numeric arrays
- <undefined>for categorical arrays
Note
If you run fillmissing after binning a predictor,
                            the existing cutpoints and bin edges are preserved and the "Good" and
                            "Bad" counts from the <missing> bin are added to
                            the corresponding bin.
sc = fillmissing(___,ConstantValue)ConstantValue to replace missing values.
Examples
This example shows how to use fillmissing to replace missing values in the CustAge and ResStatus predictors with user-defined values. For additional information on alternative approaches for "treating" missing data, see Credit Scorecard Modeling with Missing Values.
Load the credit scorecard data and use dataMissing for the training data.  
load CreditCardData.mat
disp(head(dataMissing));    CustID    CustAge    TmAtAddress     ResStatus     EmpStatus    CustIncome    TmWBank    OtherCC    AMBalance    UtilRate    status
    ______    _______    ___________    ___________    _________    __________    _______    _______    _________    ________    ______
      1          53          62         <undefined>    Unknown        50000         55         Yes       1055.9        0.22        0   
      2          61          22         Home Owner     Employed       52000         25         Yes       1161.6        0.24        0   
      3          47          30         Tenant         Employed       37000         61         No        877.23        0.29        0   
      4         NaN          75         Home Owner     Employed       53000         20         Yes       157.37        0.08        0   
      5          68          56         Home Owner     Employed       53000         14         Yes       561.84        0.11        0   
      6          65          13         Home Owner     Employed       48000         59         Yes       968.18        0.15        0   
      7          34          32         Home Owner     Unknown        32000         26         Yes       717.82        0.02        1   
      8          50          57         Other          Employed       51000         33         No        3041.2        0.13        0   
Create a creditscorecard object with 'BinMissingData' set to true.
sc = creditscorecard(dataMissing,'BinMissingData',true);
sc = autobinning(sc);Use bininfo and plotbins to display the CustAge and ResStatus predictors with missing data. 
bininfo(sc,'CustAge')ans=10×6 table
         Bin         Good    Bad     Odds       WOE       InfoValue 
    _____________    ____    ___    ______    ________    __________
    {'[-Inf,33)'}     69      52    1.3269    -0.42156      0.018993
    {'[33,37)'  }     63      45       1.4    -0.36795      0.012839
    {'[37,40)'  }     72      47    1.5319     -0.2779     0.0079824
    {'[40,46)'  }    172      89    1.9326    -0.04556     0.0004549
    {'[46,48)'  }     59      25      2.36     0.15424     0.0016199
    {'[48,51)'  }     99      41    2.4146     0.17713     0.0035449
    {'[51,58)'  }    157      62    2.5323     0.22469     0.0088407
    {'[58,Inf]' }     93      25      3.72     0.60931      0.032198
    {'<missing>'}     19      11    1.7273    -0.15787    0.00063885
    {'Totals'   }    803     397    2.0227         NaN      0.087112
plotbins(sc,'CustAge');
bininfo(sc,'ResStatus')ans=5×6 table
         Bin          Good    Bad     Odds        WOE       InfoValue 
    ______________    ____    ___    ______    _________    __________
    {'Tenant'    }    296     161    1.8385    -0.095463     0.0035249
    {'Home Owner'}    352     171    2.0585     0.017549    0.00013382
    {'Other'     }    128      52    2.4615      0.19637     0.0055808
    {'<missing>' }     27      13    2.0769     0.026469    2.3248e-05
    {'Totals'    }    803     397    2.0227          NaN     0.0092627
plotbins(sc,'ResStatus');
Use fillmissing to replace NaN values in CustAge with the median value and to replace the <missing> values in ResStatus with 'Tenant'. Use predictorinfo to verify the filled values.
sc = fillmissing(sc,{'CustAge'},'median');
sc = fillmissing(sc,{'ResStatus'},'constant','Tenant');
predictorinfo(sc,'CustAge')ans=1×4 table
               PredictorType         LatestBinning          LatestFillMissingType    LatestFillMissingValue
               _____________    ________________________    _____________________    ______________________
    CustAge     {'Numeric'}     {'Automatic / Monotone'}         {'Median'}                  {[45]}        
predictorinfo(sc,'ResStatus')ans=1×5 table
                  PredictorType     Ordinal         LatestBinning          LatestFillMissingType    LatestFillMissingValue
                 _______________    _______    ________________________    _____________________    ______________________
    ResStatus    {'Categorical'}     false     {'Automatic / Monotone'}        {'Constant'}               {'Tenant'}      
Use bininfo and plotbins to display the CustAge and ResStatus predictors to verify that the missing data has been replaced with the values defined by fillmissing.
bininfo(sc,'CustAge')ans=9×6 table
         Bin         Good    Bad     Odds        WOE       InfoValue
    _____________    ____    ___    ______    _________    _________
    {'[-Inf,33)'}     69      52    1.3269     -0.42156     0.018993
    {'[33,37)'  }     63      45       1.4     -0.36795     0.012839
    {'[37,40)'  }     72      47    1.5319      -0.2779    0.0079824
    {'[40,46)'  }    191     100      1.91    -0.057315    0.0008042
    {'[46,48)'  }     59      25      2.36      0.15424    0.0016199
    {'[48,51)'  }     99      41    2.4146      0.17713    0.0035449
    {'[51,58)'  }    157      62    2.5323      0.22469    0.0088407
    {'[58,Inf]' }     93      25      3.72      0.60931     0.032198
    {'Totals'   }    803     397    2.0227          NaN     0.086822
plotbins(sc,'CustAge');
bininfo(sc,'ResStatus')ans=4×6 table
         Bin          Good    Bad     Odds        WOE       InfoValue 
    ______________    ____    ___    ______    _________    __________
    {'Tenant'    }    323     174    1.8563    -0.085821     0.0030935
    {'Home Owner'}    352     171    2.0585     0.017549    0.00013382
    {'Other'     }    128      52    2.4615      0.19637     0.0055808
    {'Totals'    }    803     397    2.0227          NaN     0.0088081
plotbins(sc,'ResStatus');
Use fitmodel and then run formatpoints, displaypoints, and score.
sc = fitmodel(sc,'Display','off'); sc = formatpoints(sc,'WorstAndBest',[300 800]); t = displaypoints(sc)
t=31×3 table
      Predictors             Bin           Points
    ______________    _________________    ______
    {'CustAge'   }    {'[-Inf,33)'    }    72.565
    {'CustAge'   }    {'[33,37)'      }    76.588
    {'CustAge'   }    {'[37,40)'      }    83.346
    {'CustAge'   }    {'[40,46)'      }    99.902
    {'CustAge'   }    {'[46,48)'      }    115.78
    {'CustAge'   }    {'[48,51)'      }     117.5
    {'CustAge'   }    {'[51,58)'      }    121.07
    {'CustAge'   }    {'[58,Inf]'     }    149.93
    {'CustAge'   }    {'<missing>'    }    99.902
    {'EmpStatus' }    {'Unknown'      }     79.64
    {'EmpStatus' }    {'Employed'     }    133.98
    {'EmpStatus' }    {'<missing>'    }       NaN
    {'CustIncome'}    {'[-Inf,29000)' }    21.926
    {'CustIncome'}    {'[29000,33000)'}    73.949
    {'CustIncome'}    {'[33000,35000)'}    97.117
    {'CustIncome'}    {'[35000,40000)'}    101.44
      ⋮
When a validation data set has missing values and you use fillmissing with the training dataset, the missing values in the validation data set are assigned the same points as the corresponding bins containing the filled values.
As the table shows, the '<missing>' bin for the CustAge predictor is assigned the same points as the '[40,46)' bin because the missing data is filled with the median value 45.
The points assigned to the '<missing>' bin for the EmpStatus predictor are NaN because fillmissing is not used for that predictor. The assigned points are decided by the default 'NoScore' for the 'Missing' name-value pair argument in formatpoints.
Create a test validation data set (tdata) and add missing values.
tdata = data(1:10,:); tdata.CustAge(1) = NaN; tdata.ResStatus(2) = missing; [scr,pts] = score(sc,tdata)
scr = 10×1
  566.7335
  611.2547
  584.5130
  628.7876
  609.7148
  671.1048
  403.6413
  551.9461
  575.9874
  524.4789
pts=10×5 table
    CustAge    EmpStatus    CustIncome    TmWBank    AMBalance
    _______    _________    __________    _______    _________
    99.902       79.64        153.88      145.38      87.933  
    149.93      133.98        153.88      85.531      87.933  
    115.78      133.98        101.44      145.38      87.933  
     117.5      133.98        153.88      83.991      139.44  
    149.93      133.98        153.88      83.991      87.933  
    149.93      133.98        153.88      145.38      87.933  
    76.588       79.64        73.949      85.531      87.933  
     117.5      133.98        153.88      85.531       61.06  
     117.5       79.64        153.88      85.531      139.44  
     117.5       79.64        153.88      85.531      87.933  
This example shows different possibilities for handling missing data in validation data.
When scoring data from a validation data set, you have several options. If you choose to do nothing, the points assigned to the missing data are NaN, which comes from the default 'NoScore' for the 'Missing' name-value pair argument in formatpoints.
If you want to score missing values of all the predictors with one consistent metric, you can use the options 'ZeroWOE', 'MinPoints', or 'MaxPoints' for the 'Missing' name-value pair argument in formatpoints.
load CreditCardData.mat sc = creditscorecard(data); predictorinfo(sc,'CustAge')
ans=1×4 table
               PredictorType      LatestBinning      LatestFillMissingType    LatestFillMissingValue
               _____________    _________________    _____________________    ______________________
    CustAge     {'Numeric'}     {'Original Data'}        {'Original'}              {0×0 double}     
predictorinfo(sc,'ResStatus')ans=1×5 table
                  PredictorType     Ordinal      LatestBinning      LatestFillMissingType    LatestFillMissingValue
                 _______________    _______    _________________    _____________________    ______________________
    ResStatus    {'Categorical'}     false     {'Original Data'}        {'Original'}              {0×0 double}     
sc = autobinning(sc); sc = fitmodel(sc,'display','off'); displaypoints(sc)
ans=37×3 table
      Predictors            Bin            Points  
    ______________    ________________    _________
    {'CustAge'   }    {'[-Inf,33)'   }     -0.15894
    {'CustAge'   }    {'[33,37)'     }     -0.14036
    {'CustAge'   }    {'[37,40)'     }    -0.060323
    {'CustAge'   }    {'[40,46)'     }     0.046408
    {'CustAge'   }    {'[46,48)'     }      0.21445
    {'CustAge'   }    {'[48,58)'     }      0.23039
    {'CustAge'   }    {'[58,Inf]'    }        0.479
    {'CustAge'   }    {'<missing>'   }          NaN
    {'ResStatus' }    {'Tenant'      }    -0.031252
    {'ResStatus' }    {'Home Owner'  }      0.12696
    {'ResStatus' }    {'Other'       }      0.37641
    {'ResStatus' }    {'<missing>'   }          NaN
    {'EmpStatus' }    {'Unknown'     }    -0.076317
    {'EmpStatus' }    {'Employed'    }      0.31449
    {'EmpStatus' }    {'<missing>'   }          NaN
    {'CustIncome'}    {'[-Inf,29000)'}     -0.45716
      ⋮
sc = formatpoints(sc,'Missing','minpoints','WorstAndBestScores',[300 850]); displaypoints(sc)
ans=37×3 table
      Predictors            Bin           Points
    ______________    ________________    ______
    {'CustAge'   }    {'[-Inf,33)'   }    46.396
    {'CustAge'   }    {'[33,37)'     }    48.727
    {'CustAge'   }    {'[37,40)'     }    58.772
    {'CustAge'   }    {'[40,46)'     }    72.167
    {'CustAge'   }    {'[46,48)'     }    93.256
    {'CustAge'   }    {'[48,58)'     }    95.256
    {'CustAge'   }    {'[58,Inf]'    }    126.46
    {'CustAge'   }    {'<missing>'   }    46.396
    {'ResStatus' }    {'Tenant'      }    62.421
    {'ResStatus' }    {'Home Owner'  }    82.276
    {'ResStatus' }    {'Other'       }    113.58
    {'ResStatus' }    {'<missing>'   }    62.421
    {'EmpStatus' }    {'Unknown'     }    56.765
    {'EmpStatus' }    {'Employed'    }    105.81
    {'EmpStatus' }    {'<missing>'   }    56.765
    {'CustIncome'}    {'[-Inf,29000)'}    8.9706
      ⋮
The value of -32.5389 for the <missing> bin of 'CustAge' comes from the 'minPoints' argument for formatpoints.
[scr,pts] = score(sc,dataMissing(1:5,:))
scr = 5×1
  602.0394
  648.1988
  560.5569
  613.5595
  646.8109
pts=5×7 table
    CustAge    ResStatus    EmpStatus    CustIncome    TmWBank    OtherCC    AMBalance
    _______    _________    _________    __________    _______    _______    _________
    95.256      62.421       56.765        121.18      116.05     86.224       64.15  
    126.46      82.276       105.81        121.18      62.107     86.224       64.15  
    93.256      62.421       105.81        76.585      116.05     42.287       64.15  
    46.396      82.276       105.81        121.18      60.719     86.224      110.96  
    126.46      82.276       105.81        121.18      60.719     86.224       64.15  
Alternatively, you can score missing data for each individual predictor with a different statistic based on that predictor's information. To do so, use fillmissing for a creditscorecard object.
load CreditCardData.mat sc = creditscorecard(data); sc = fillmissing(sc,'CustAge','constant',35); predictorinfo(sc,'CustAge')
ans=1×4 table
               PredictorType      LatestBinning      LatestFillMissingType    LatestFillMissingValue
               _____________    _________________    _____________________    ______________________
    CustAge     {'Numeric'}     {'Original Data'}        {'Constant'}                 {[35]}        
sc = fillmissing(sc,'ResStatus','Mode'); predictorinfo(sc,'ResStatus')
ans=1×5 table
                  PredictorType     Ordinal      LatestBinning      LatestFillMissingType    LatestFillMissingValue
                 _______________    _______    _________________    _____________________    ______________________
    ResStatus    {'Categorical'}     false     {'Original Data'}          {'Mode'}               {'Home Owner'}    
sc = autobinning(sc); sc = fitmodel(sc,'display','off'); sc = formatpoints(sc,'Missing','minpoints','WorstAndBestScores',[300 850]); displaypoints(sc)
ans=37×3 table
      Predictors            Bin           Points
    ______________    ________________    ______
    {'CustAge'   }    {'[-Inf,33)'   }    46.396
    {'CustAge'   }    {'[33,37)'     }    48.727
    {'CustAge'   }    {'[37,40)'     }    58.772
    {'CustAge'   }    {'[40,46)'     }    72.167
    {'CustAge'   }    {'[46,48)'     }    93.256
    {'CustAge'   }    {'[48,58)'     }    95.256
    {'CustAge'   }    {'[58,Inf]'    }    126.46
    {'CustAge'   }    {'<missing>'   }    48.727
    {'ResStatus' }    {'Tenant'      }    62.421
    {'ResStatus' }    {'Home Owner'  }    82.276
    {'ResStatus' }    {'Other'       }    113.58
    {'ResStatus' }    {'<missing>'   }    82.276
    {'EmpStatus' }    {'Unknown'     }    56.765
    {'EmpStatus' }    {'Employed'    }    105.81
    {'EmpStatus' }    {'<missing>'   }    56.765
    {'CustIncome'}    {'[-Inf,29000)'}    8.9706
      ⋮
The value of <missing> for 'CustAge' comes from the fill value of 35 even though the training data has no missing values.
disp(dataMissing(1:5,:));
    CustID    CustAge    TmAtAddress     ResStatus     EmpStatus    CustIncome    TmWBank    OtherCC    AMBalance    UtilRate    status
    ______    _______    ___________    ___________    _________    __________    _______    _______    _________    ________    ______
      1          53          62         <undefined>    Unknown        50000         55         Yes       1055.9        0.22        0   
      2          61          22         Home Owner     Employed       52000         25         Yes       1161.6        0.24        0   
      3          47          30         Tenant         Employed       37000         61         No        877.23        0.29        0   
      4         NaN          75         Home Owner     Employed       53000         20         Yes       157.37        0.08        0   
      5          68          56         Home Owner     Employed       53000         14         Yes       561.84        0.11        0   
[scr,pts] = score(sc,dataMissing(1:5,:))
scr = 5×1
  621.8943
  648.1988
  560.5569
  615.8904
  646.8109
pts=5×7 table
    CustAge    ResStatus    EmpStatus    CustIncome    TmWBank    OtherCC    AMBalance
    _______    _________    _________    __________    _______    _______    _________
    95.256      82.276       56.765        121.18      116.05     86.224       64.15  
    126.46      82.276       105.81        121.18      62.107     86.224       64.15  
    93.256      62.421       105.81        76.585      116.05     42.287       64.15  
    48.727      82.276       105.81        121.18      60.719     86.224      110.96  
    126.46      82.276       105.81        121.18      60.719     86.224       64.15  
Input Arguments
Credit scorecard model, specified as a creditscorecard
                        object.
Name of creditscorecard predictor to fill missing data
                        for, specified as a scalar character vector, scalar string, cell array of
                        character vectors, or string array.
Data Types: char | string | cell
Statistic to use to fill missing data for the predictors, specified as a character vector or string with one of the following values.
- 'mean'— Replace missing data with the average or mean value. The option is valid only for numeric data. The- 'mean'calculates the weighted mean of the predictor by referring to the predictor column and the- Weightscolumn from the- creditscorecardobject. For more information, see Weighted Mean.
- 'median'— Replace missing data with the median value. Valid for numeric and ordinal data. The- 'median'calculates the weighted median of the predictor by referring to the predictor column and the- Weightscolumn from the- creditscorecardobject. For more information, see Weighted Median.
- 'mode'— Replace missing data with the mode. Valid for numeric and both nominal and ordinal categorical data. The- 'mode'calculates the weighted- modeof the predictor by referring to the predictor column and the- Weightscolumn from the- creditscorecardobject. For more information, see Weighted Mode.
- 'original'— Set the missing data for numeric and categorical predictors back to its original value:- NaNif numeric,- <undefined>or- <missing>if categorical.
- 'constant'— Set the missing data for numeric and categorical predictors to a constant value that you specify in the optional argument for- ConstantValue.
Data Types: char | string
(Optional) Value to fill missing entries in predictors specified in
                            PredictorNames, specified as a numeric value,
                        character vector, string, or cell array of character vectors. 
Note
You can use ConstantValue only if you set the
                                    Statistics argument to
                                    'constant'.
Data Types: char | double | string | cell
Output Arguments
Updated creditscorecard object, returned as an
                        object.
More About
The weighted mean is similar to an ordinary mean except that instead of each of the data points contributing equally to the final average, some data points contribute more than others.
The weighted mean for a nonempty finite multiset of data (x) with corresponding nonnegative weights (w) is
The weighted median is the 50% weighted percentile, where the percentage in the total weight is counted instead of the total number.
For n distinct ordered elements (x) positive weights (w) such that , the weighted median is the element xk:
In the case where the respective weights of both elements border the midpoint of the set of weights without encapsulating it, each element defines a partition equal to 1/2. These elements are referred to as the lower weighted median and upper weighted median. The weighted median is chosen based on which element keeps the partitions most equal. This median is always the weighted median with the lowest weight. In the event that the upper and lower weighted medians are equal, the lower weighted median is accepted.
The weighted mode of a set of weighted data values is the value that appears most often.
The mode of a sample is the element that occurs most often in the collection. For example, the mode of the sample [1, 3, 6, 6, 6, 6, 7, 7, 12, 12, 17] is 6.
References
[1] “Basel Committee on Banking Supervision: Studies on the Validation of Internal Rating Systems.” Working Paper No. 14, February 2005.
[2] Refaat, M. Credit Risk Scorecards: Development and Implementation Using SAS. lulu.com, 2011.
[3] Loeffler, G. and Posch, P. N. Credit Risk Modeling Using Excel and VBA. Wiley Finance, 2007.
Version History
Introduced in R2020a
See Also
creditscorecard | bininfo | predictorinfo | modifypredictor | modifybins | bindata | plotbins | fitmodel | displaypoints | formatpoints | score | setmodel | probdefault | validatemodel | table
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)