This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English version of the page.

Note: This page has been translated by MathWorks. Click here to see
To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

Feature Screening with screenpredictors

This example shows how to perform predictor screening using screenpredictors. Predictor screening is a type of univariate analysis performed as an early step in the Credit Scorecard Modeling Workflow (Financial Toolbox). Predictor screening is an important preprocessing step when you work with credit scorecards, as data sets can be prohibitively large and have dozens or hundreds of potential predictors.

The goal of screening predictors is to pare down the set of predictors to a subset that is more useful in predicting the response variable based on the calculated metrics. Screening enables you to select the top predictors as ranked by a given metric to train your credit scorecards.

Load Data

The credit card data table contains a customer ID (CustID), nine predictors, and the response variable (status). Some of the risk factors are more useful in predicting the probability of a loan default, whereas others are less useful. The screening process helps you select the best subset of predictors.

Although the data set in this example contains only a few predictors, in practice, credit scorecard data sets can be very large. The predictor screening process is important as data sets grow to contain dozens or hundreds of predictors.

% Load credit card data tables.
load CreditCardData

% Use the dataMissing data set, which contains some missing values.
data = dataMissing;

% Identify the ID and response variables.
idvar = 'CustID';
responsevar = 'status';

% Examine the structure of the table.
disp(head(data));
    CustID    CustAge    TmAtAddress     ResStatus     EmpStatus    CustIncome    TmWBank    OtherCC    AMBalance    UtilRate    status
    ______    _______    ___________    ___________    _________    __________    _______    _______    _________    ________    ______

      1          53          62         <undefined>    Unknown        50000         55         Yes       1055.9        0.22        0   
      2          61          22         Home Owner     Employed       52000         25         Yes       1161.6        0.24        0   
      3          47          30         Tenant         Employed       37000         61         No        877.23        0.29        0   
      4         NaN          75         Home Owner     Employed       53000         20         Yes       157.37        0.08        0   
      5          68          56         Home Owner     Employed       53000         14         Yes       561.84        0.11        0   
      6          65          13         Home Owner     Employed       48000         59         Yes       968.18        0.15        0   
      7          34          32         Home Owner     Unknown        32000         26         Yes       717.82        0.02        1   
      8          50          57         Other          Employed       51000         33         No        3041.2        0.13        0   

Add Additional Derived Predictors

Often, derivative predictors can capture additional information or produce better metrics results, for example, the ratio of two predictors or a predictor transformation for predictor x, such as x^2 or log(x). To demonstrate this, create a few derived predictors and add them to the data set.

data.BalanceUtilRatio = data.AMBalance ./ data.UtilRate;
data.BalanceIncomeRatio = data.AMBalance ./ data.CustIncome;

Compute Metrics

Use screenpredictors to compute several measures of risk factor predictiveness. The columns of the output table contain the metrics values for the predictors. The table is sorted by the information value.

T = screenpredictors(data,'IDVar',idvar,'ResponseVar',responsevar)
T=11×7 table
                          InfoValue    AccuracyRatio     AUROC     Entropy     Gini      Chi2PValue    PercentMissing
                          _________    _____________    _______    _______    _______    __________    ______________

    CustAge                 0.17698        0.1672        0.5836    0.88795    0.42645     0.0020599          0.025   
    TmWBank                 0.15719       0.13612       0.56806    0.89167    0.42864     0.0054591              0   
    CustIncome              0.15572       0.17758       0.58879      0.891    0.42731     0.0018428              0   
    BalanceIncomeRatio     0.097073        0.1278        0.5639    0.90024    0.43303       0.11966              0   
    TmAtAddress            0.094574      0.010421       0.50521    0.90089    0.43377         0.182              0   
    UtilRate               0.075086      0.035914       0.51796    0.90405    0.43575       0.45546              0   
    AMBalance               0.07159      0.087142       0.54357    0.90446    0.43592       0.48528              0   
    BalanceUtilRatio       0.068955      0.026538       0.51327    0.90486    0.43614       0.52517              0   
    EmpStatus              0.048038       0.10886       0.55443    0.90814     0.4381    0.00037823              0   
    OtherCC                0.014301      0.044459       0.52223    0.91347    0.44132      0.047616              0   
    ResStatus             0.0095558      0.049855       0.52493    0.91446    0.44198       0.29879       0.033333   

Threshold Metrics

Set thresholds for the predictors based on several metrics. For each metric, adjust the threshold sliders to set the range of passing values. In the plot, green bars indicate predictors that pass the threshold. Red bars indicate predictors that do not pass the threshold. You can omit predictors that do not "pass" the threshold from the final data set.

First, select predictors based on their information value.

infovalueThresh = 0.08;

Visualize the thresholds on the metric values for each predictor using the local function thresholdPlot, defined at the end of this example.

thresholdPlot(T, infovalueThresh, 'InfoValue')

Select predictors based on their accuracy ratio.

arThresh = 0.08;
thresholdPlot(T, arThresh, 'AccuracyRatio')

Screening Summary

Summarize the thresholding results in table form. The last column indicates which of the predictors passed both of the threshold tests and can be included in the final data set to create the credit scorecard. summaryTable and displaySummaryTable are local functions.

metrics = {'InfoValue', 'AccuracyRatio'};
thresholds = [infovalueThresh arThresh];
S = summaryTable(T, metrics, thresholds);
displaySummaryTable(S)
                          InfoValue    AccuracyRatio    PassedAll
                          _________    _____________    _________

    CustAge                   ✔              ✔              ✔    
    TmWBank                   ✔              ✔              ✔    
    CustIncome                ✔              ✔              ✔    
    BalanceIncomeRatio        ✔              ✔              ✔    
    TmAtAddress               ✔              ✘              ✘    
    UtilRate                  ✘              ✘              ✘    
    AMBalance                 ✘              ✔              ✘    
    BalanceUtilRatio          ✘              ✘              ✘    
    EmpStatus                 ✘              ✔              ✘    
    OtherCC                   ✘              ✘              ✘    
    ResStatus                 ✘              ✘              ✘    

Reduce Table

Create a reduced table that contains only the passing predictors. Select only the predictors that pass both of the threshold tests and create a reduced data set. The credit scorecard you create using the reduced data set requires less memory.

% Get a list of all passing predictors.
predictor_list = T.Row;
top_predictors = predictor_list(S.PassedAll);

% Trim the data table to contain only the ID, passing predictors, and
% response.
top_predictor_table = data(:,[idvar; top_predictors; responsevar]);

% Create the credit scorecard using the screened predictors.
sc = creditscorecard(top_predictor_table,'IDVar',idvar,'ResponseVar',responsevar,...
    'BinMissingData', true)
sc = 
  creditscorecard with properties:

                GoodLabel: 0
              ResponseVar: 'status'
               WeightsVar: ''
                 VarNames: {1x6 cell}
        NumericPredictors: {1x4 cell}
    CategoricalPredictors: {1x0 cell}
           BinMissingData: 1
                    IDVar: 'CustID'
            PredictorVars: {1x4 cell}
                     Data: [1200x6 table]

Local Functions

function passed = thresholdPredictor(T, threshold, metric)
% Threshold a predictor and return a logical vector to indicate passing
% predictors.

% Check which predictors pass the threshold.
switch metric
    case {'InfoValue', 'AccuracyRatio', 'AUROC'}
        passed = T.(metric) >= threshold;
    case {'Entropy', 'Gini', 'Chi2PValue', 'PercentMissing'}
        passed = T.(metric) <= threshold;
end
end


function thresholdPlot(T, threshold, metric)
% Plot bar charts to summarize predictor selection based on metrics threholds.

% Threshold the predictors.
passed = thresholdPredictor(T, threshold, metric);

% Get all predictors.
predictorNames = T.Row;
nPredictors = length(predictorNames);

% Create the bar charts.
f = figure;
ax = axes('parent',f);
bAR = bar(ax, 1:nPredictors, T.(metric), 'FaceColor', 'flat');
bAR.CData(passed,:) = repmat([0,1,0],sum(passed),1);
bAR.CData(~passed, :) = repmat([1,0,0],sum(~passed),1);
ax.TickLabelInterpreter = 'none';
xticks(ax, 1:nPredictors)
xticklabels(ax, predictorNames)
xtickangle(ax, 45)

% Scale the YLim.
delta = max(T.(metric)) - min(T.(metric));
d10 = 0.1 * delta;
ylim = [min(T.(metric)) - d10 max(T.(metric)) + d10];
set(ax,'YLim',ylim);

% Add threshold lines.
hold on
plot(xlim, [threshold threshold],'k--');
xlabel('Predictor')
ylabel(metric)
title(sprintf('Predictor Performance by %s',metric));
hold off
end


function S = summaryTable(T, metrics, thresholds)
% Create table summarizing all thresholds.
S = T;

% Remove metrics that are not thresholded.
unthresholded = setdiff(S.Properties.VariableNames, metrics);
S(:,unthresholded) = [];

% Show thresholding summary.
passed_all = true(numel(T.Row),1);
for i = 1:numel(metrics)
    metrici = metrics{i};
    thresholdi = thresholds(i);
    passed = thresholdPredictor(T, thresholdi, metrici);
    S.(metrici) = passed;
    passed_all = passed_all & passed;
end

% Add summary column.
S.PassedAll = passed_all;
end


function displaySummaryTable(S)
% Display a summary table with check marks for passed thresholds.

cols = S.Properties.VariableNames;

% Convert each column to check marks and X marks.
for i = 1:numel(cols)
    coli = cols{i};
    charvec = repmat(char(10008),size(S,1),1); % Initalize as 'X'.
    charvec(S.(coli)) = char(10004); % Check if it passes the threshold.
    S.(coli) = charvec;
end

disp(S);
end