Main Content

screenpredictors

Screen credit scorecard predictors for predictive value

Description

example

metric_table = screenpredictors(data) returns the output variable, metric_table, a MATLAB® table containing the calculated values for several measures of predictive power for each predictor variable in the data.

Use the screenpredictors function as a preprocessing step in the Credit Scorecard Modeling Workflow to reduce the number of predictor variables before you create the credit scorecard using the creditscorecard function from Financial Toolbox™. In addition, you can use Threshold Predictors from Risk Management Toolbox™to interactively set credit scorecard predictor thresholds using the output from screenpredictors before you create the credit scorecard using the creditscorecard.

example

metric_table = screenpredictors(___,Name,Value) specifies options using one or more name-value pair arguments in addition to the input arguments in the previous syntax.

Examples

collapse all

Reduce the number of predictor variables by screening predictors before you create a credit scorecard.

Use the CreditCardData.mat file to load the data (using a dataset from Refaat 2011).

load CreditCardData.mat

Define 'IDVar' and 'ResponseVar'.

idvar = 'CustID';
responsevar = 'status';

Use screenpredictors to calculate the predictor screening metrics. The function returns a table containing the metrics values. Each table row corresponds to a predictor from the input table data.

metric_table = screenpredictors(data,'IDVar', idvar,'ResponseVar', responsevar)
metric_table=9×7 table
                   InfoValue    AccuracyRatio     AUROC     Entropy     Gini      Chi2PValue    PercentMissing
                   _________    _____________    _______    _______    _______    __________    ______________

    CustAge          0.18863       0.17095       0.58547    0.88729    0.42626    0.00074524          0       
    TmWBank          0.15719       0.13612       0.56806    0.89167    0.42864     0.0054591          0       
    CustIncome       0.15572       0.17758       0.58879      0.891    0.42731     0.0018428          0       
    TmAtAddress     0.094574      0.010421       0.50521    0.90089    0.43377         0.182          0       
    UtilRate        0.075086      0.035914       0.51796    0.90405    0.43575       0.45546          0       
    AMBalance        0.07159      0.087142       0.54357    0.90446    0.43592       0.48528          0       
    EmpStatus       0.048038       0.10886       0.55443    0.90814     0.4381    0.00037823          0       
    OtherCC         0.014301      0.044459       0.52223    0.91347    0.44132      0.047616          0       
    ResStatus      0.0097738       0.05039        0.5252    0.91422    0.44182       0.27875          0       

metric_table = sortrows(metric_table,'AccuracyRatio','descend')
metric_table=9×7 table
                   InfoValue    AccuracyRatio     AUROC     Entropy     Gini      Chi2PValue    PercentMissing
                   _________    _____________    _______    _______    _______    __________    ______________

    CustIncome       0.15572       0.17758       0.58879      0.891    0.42731     0.0018428          0       
    CustAge          0.18863       0.17095       0.58547    0.88729    0.42626    0.00074524          0       
    TmWBank          0.15719       0.13612       0.56806    0.89167    0.42864     0.0054591          0       
    EmpStatus       0.048038       0.10886       0.55443    0.90814     0.4381    0.00037823          0       
    AMBalance        0.07159      0.087142       0.54357    0.90446    0.43592       0.48528          0       
    ResStatus      0.0097738       0.05039        0.5252    0.91422    0.44182       0.27875          0       
    OtherCC         0.014301      0.044459       0.52223    0.91347    0.44132      0.047616          0       
    UtilRate        0.075086      0.035914       0.51796    0.90405    0.43575       0.45546          0       
    TmAtAddress     0.094574      0.010421       0.50521    0.90089    0.43377         0.182          0       

Based on the AccuracyRatio metric, select the top predictors to use when you create the creditscorecard object.

varlist = metric_table.Row(metric_table.AccuracyRatio > 0.09)
varlist = 4x1 cell
    {'CustIncome'}
    {'CustAge'   }
    {'TmWBank'   }
    {'EmpStatus' }

Use creditscorecard to create a createscorecard object based on only the "screened" predictors.

sc = creditscorecard(data,'IDVar', idvar,'ResponseVar', responsevar, 'PredictorVars', varlist)
sc = 
  creditscorecard with properties:

                GoodLabel: 0
              ResponseVar: 'status'
               WeightsVar: ''
                 VarNames: {'CustID'  'CustAge'  'TmAtAddress'  'ResStatus'  'EmpStatus'  'CustIncome'  'TmWBank'  'OtherCC'  'AMBalance'  'UtilRate'  'status'}
        NumericPredictors: {'CustAge'  'CustIncome'  'TmWBank'}
    CategoricalPredictors: {'EmpStatus'}
           BinMissingData: 0
                    IDVar: 'CustID'
            PredictorVars: {'CustAge'  'EmpStatus'  'CustIncome'  'TmWBank'}
                     Data: [1200x11 table]

Input Arguments

collapse all

Data for the creditscorecard object, specified as a MATLAB table, tall table, or tall timetable, where each column of data can be any one of the following data types:

  • Numeric

  • Logical

  • Cell array of character vectors

  • Character array

  • Categorical

  • String

Data Types: table

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: metric_table = screenpredictors(data,'IDVar','CustAge','ResponseVar','status','PredictorVars',{'CustID','CustIncome'})

Name of identifier variable, specified as the comma-separated pair consisting of 'IDVar' and a case-sensitive character vector. The 'IDVar' data can be ordinal numbers or Social Security numbers. By specifying 'IDVar', you can omit the identifier variable from the predictor variables easily.

Data Types: char

Response variable name, specified as the comma-separated pair consisting of 'ResponseVar' and a case-sensitive character vector. The response variable data must be binary, the "Good" or "Bad" indicator.

If not specified, 'ResponseVar' is set to the last column of the input data by default.

Data Types: char

Names of predictor variables, specified as the comma-separated pair consisting of 'PredictorVars' and a case-sensitive cell array of character vectors or string array. By default, when you create a creditscorecard object, all variables are predictors except for IDVar and ResponseVar. Any name you specify using 'PredictorVars' must differ from the IDVar and ResponseVar names.

Data Types: cell | string

Name of weights variable, specified as the comma-separated pair consisting of 'WeightsVar' and a case-sensitive character vector to indicate which column name in the data table contains the row weights.

If you do not specify 'WeightsVar' when you create a creditscorecard object, then the function uses the unit weights as the observation weights.

Data Types: char

Number of (equal frequency) bins for numeric predictors, specified as the comma-separated pair consisting of 'NumBins' and a scalar numeric.

Data Types: double

Small shift in frequency tables that contain zero entries, specified as the comma-separated pair consisting of 'FrequencyShift' and a scalar numeric with a value between 0 and 1.

If the frequency table of a predictor contains any "pure" bins (containing all goods or all bads) after you bin the data using autobinning, then the function adds the 'FrequencyShift' value to all bins in the table. To avoid any perturbation, set 'FrequencyShift' to 0.

Data Types: double

Output Arguments

collapse all

Calculated values for the predictor screening metrics, returned as table. Each table row corresponds to a predictor from the input table data. The table columns contain calculated values for the following metrics:

  • 'InfoValue' — Information value. This metric measures the strength of a predictor in the fitting model by determining the deviation between the distributions of "Goods" and "Bads".

  • 'AccuracyRatio' — Accuracy ratio.

  • 'AUROC' — Area under the ROC curve.

  • 'Entropy' — Entropy. This metric measures the level of unpredictability in the bins. You can use the entropy metric to validate a risk model.

  • 'Gini' — Gini. This metric measures the statistical dispersion or inequality within a sample of data.

  • 'Chi2PValue' — Chi-square p-value. This metric is computed from the chi-square metric and is a measure of the statistical difference and independence between groups.

  • 'PercentMissing' — Percentage of missing values in the predictor. This metric is expressed in decimal form.

Extended Capabilities

Version History

Introduced in R2019a