Remove Risk Factors
This example shows how to remove or include variables from a table and record the corresponding reasons using the Modelscape™ Remove Risk Factors task.
The example also shows how to include the results of this analysis in model documents using the Modelscape reporting feature.
All columns in a table of input data may not be relevant while developing a statistical model. Not all the data in the table is necessarily usable for a statistical model. For example, randomized user identifiers (IDs) are often irrelevant, legally sensitive data such as ethnic origin or religious beliefs cannot be used, and some data can be of poor quality. This example shows you how to select relevant variables in such a table and record your reasons.
This example uses the Credit Scorecard data set, which contains three tables of customer information such as age, income, and employment status. One such table, dataMissing
, deliberately has a few blank entries in the data set. The data could be used for developing a statistical model such as a MATLAB® credit scorecard model. The example loads the data set in the Remove Risk Factors task, marks some variables for exclusion, and documents the results using Modelscape reporting.
Load Data and Launch the Tool
Load the input data from CreditCardData.mat.
load CreditCardData
Open a new live script. There are two ways to open the Remove Risk Factors task:
Type
remove
and selectRemove Risk Factors
in the drop-down selection.
2. Search for the tool under Task in the Live Editor gallery.
In the task, select your input data, for example dataMissing
variable.
Inspect and Filter Variables
The task shows the summary statistics and the histogram for the first variable in the table (in this case CustID
).
To inspect other variables, click the corresponding variable name in the Analyze data variables
section. This section contains three columns that you can sort. The Variable Names
column is read-only. The Exclude
column allows you to exclude variables from the table. To do this, check the Exclude
button to mark the corresponding variable for removal. The Comment
column lets you add reasons for the exclusion (or inclusion) by double-clicking the box.
When you exclude variables and add comments, the task dynamically produces two outputs:
filteredTable
: This is a subtable of the input table without the excluded risk factors. Use this subtable in the next step of the model development process - for example feature selection.exclusionTable
: This table includes all the data of the input table together with the exclusion flags and comments in the task. To view this information, tick the 'Preview summary tables' box in 'Display results' section. This information is stored inexclusionTable.Properties.CustomProperties
meta data.
progressSummaryPreview
lists the total number of variables, the excluded variables, the included variables, and the number of variables with comments. You can use this last datum to indicate whether the removal process is complete - in the end, every variable must have a reason for either exclusion or some indication that the variable has been inspected.
Document with Modelscape Reporting
Use Modelscape Reporting to document the findings of the analysis described above. Use the meta data stored in exclusionTable
for this purpose. To include the tables shown above as exclusionSummaryPreview
and progressSummaryPreview
in a Word document, create document holes with titles ExclusionSummary
and ProgressSummary
in the Word document.
import mrm.data.filter.*
[ExclusionSummary, ProgressSummary] = summarizeExclusionTable(exclusionTable)
To create document holes in a Word document, view the Developer tab, and click the 'Rich Text Content Control' symbol Aa in the Controls area. Then click 'Properties' and fill in the Title
fields.
Running fillReportFromWorkspace
will then pick up these new variables from the MATLAB workspace and insert them into the model document.
For more information on fillReportFromWorkspace
, see Model Documentation in Modelscape.