Main Content

gencfeatures

Perform automated feature engineering for classification

    Description

    The gencfeatures function enables you to automate the feature engineering process in the context of a machine learning workflow. Before passing tabular training data to a classifier, you can create new features from the predictors in the data by using gencfeatures. Use the returned data to train the classifier.

    To better understand the generated features, use the describe function of the returned FeatureTransformer object. To apply the same training set feature transformations to a test set, use the transform function of the FeatureTransformer object.

    example

    [Transformer,NewTbl] = gencfeatures(Tbl,ResponseVarName,q) uses automated feature engineering to create q features from the predictors in Tbl. The software assumes that the ResponseVarName variable in Tbl is the response and does not create new features from this variable. gencfeatures returns a FeatureTransformer object (Transformer) and a new table (NewTbl) that contains the transformed features.

    By default, gencfeatures assumes that generated features are used to train an interpretable linear model with a binary response variable. If you have a multiclass response variable and you want to generate features to improve the accuracy of a bagged ensemble, specify 'TargetLearner','bag'.

    example

    [Transformer,NewTbl] = gencfeatures(Tbl,Y,q) assumes that the vector Y is the response variable and creates new features from the variables in Tbl.

    [Transformer,NewTbl] = gencfeatures(Tbl,formula,q) uses the explanatory model formula to determine the response variable in Tbl and the subset of Tbl predictors from which to create new features.

    example

    [Transformer,NewTbl] = gencfeatures(___,Name,Value) specifies options using one or more name-value arguments in addition to any of the input argument combinations in previous syntaxes. For example, you can change the expected learner type, the method for selecting new features, and the standardization method for transformed data.

    Examples

    collapse all

    Use automated feature engineering to generate new features. Train a linear classifier using the generated features. Interpret the relationship between the generated features and the trained model.

    Load the patients data set. Create a table from a subset of the variables.

    load patients
    Tbl = table(Age,Diastolic,Gender,Height,SelfAssessedHealthStatus, ...
        Systolic,Weight,Smoker);

    Generate 10 new features from the variables in Tbl. Specify the Smoker variable as the response. By default, gencfeatures assumes that the new features will be used to train a binary linear classifier.

    rng("default") % For reproducibility
    [T,NewTbl] = gencfeatures(Tbl,"Smoker",10)
    T = 
      FeatureTransformer with properties:
    
                         Type: 'classification'
                TargetLearner: 'linear'
        NumEngineeredFeatures: 10
          NumOriginalFeatures: 0
             TotalNumFeatures: 10
    
    
    NewTbl=100×11 table
        zsc(Systolic.^2)    eb8(Diastolic)    q8(Systolic)    eb8(Systolic)    q8(Diastolic)    zsc(kmd9)    zsc(sin(Age))    zsc(sin(Weight))    zsc(Height-Systolic)    zsc(kmc1)    Smoker
        ________________    ______________    ____________    _____________    _____________    _________    _____________    ________________    ____________________    _________    ______
    
             0.15379              8                6                4                8           -1.7207        0.50027            0.19202               0.40418            0.76177    true  
             -1.9421              2                1                1                2          -0.22056        -1.1319            -0.4009                2.3431             1.1617    false 
             0.30311              4                6                5                5           0.57695        0.50027             -1.037              -0.78898            -1.4456    false 
            -0.85785              2                2                2                2           0.83391         1.1495             1.3039               0.85162          -0.010294    false 
            -0.14125              3                5                4                4             1.779        -1.3083           -0.42387              -0.34154            0.99368    false 
            -0.28697              1                4                3                1           0.67326         1.3761           -0.72529               0.40418             1.3755    false 
              1.0677              6                8                6                6          -0.42521         1.5181           -0.72529               -1.5347            -1.4456    true  
             -1.1361              4                2                2                5          -0.79995         1.1495            -1.0225                1.2991             1.1617    false 
             -1.1361              3                2                2                3          -0.80136        0.46343             1.0806                1.2991             -1.208    false 
            -0.71693              5                3                3                6           0.37961       -0.51304            0.16741               0.55333            -1.4456    false 
             -1.2734              2                1                1                2            1.2572         1.3025             1.0978                1.4482          -0.010294    false 
             -1.1361              1                2                2                1             1.001        -1.2545            -1.2194                1.0008          -0.010294    false 
             0.60534              1                6                5                1          -0.98493       -0.11998             -1.211             -0.043252             -1.208    false 
              1.0677              8                8                6                8          -0.27307         1.4659             1.2168              -0.34154            0.24706    true  
             -1.2734              3                1                1                4           0.93395        -1.3633           -0.17603                1.0008          -0.010294    false 
              1.0677              7                8                6                8          -0.91396          -1.04            -1.2109              -0.49069            0.24706    true  
          ⋮
    
    

    T is a FeatureTransformer object that can be used to transform new data, and newTbl contains the new features generated from the Tbl data.

    To better understand the generated features, use the describe object function of the FeatureTransformer object. For example, inspect the first two generated features.

    describe(T,1:2)
                               Type        IsOriginal    InputVariables                            Transformations
                            ___________    __________    ______________    _______________________________________________________________
    
        zsc(Systolic.^2)    Numeric          false         Systolic        power(  ,2)
                                                                           Standardization with z-score (mean = 15119.54, std = 1667.5858)
        eb8(Diastolic)      Categorical      false         Diastolic       Equal-width binning (number of bins = 8)
    

    The first feature in newTbl is a numeric variable, created by first squaring the values of the Systolic variable and then converting the results to z-scores. The second feature in newTbl is a categorical variable, created by binning the values of the Systolic variable into 50 equiprobable bins.

    Use the generated features to fit a linear classifier without any regularization.

    Mdl = fitclinear(NewTbl,"Smoker","Lambda",0);

    Plot the coefficients of the predictors used to train Mdl. Note that fitclinear expands categorical predictors before fitting a model.

    p = length(Mdl.Beta);
    [sortedCoefs,expandedIndex] = sort(Mdl.Beta,"ComparisonMethod","abs");
    sortedExpandedPreds = Mdl.ExpandedPredictorNames(expandedIndex);
    bar(sortedCoefs,"Horizontal","on")
    yticks(1:2:p)
    yticklabels(sortedExpandedPreds(1:2:end))
    xlabel("Coefficient")
    ylabel("Expanded Predictors")
    title("Coefficients for Expanded Predictors")

    Identify the predictors whose coefficients have larger absolute values.

    bigCoefs = abs(sortedCoefs) >= 4;
    flip(sortedExpandedPreds(bigCoefs))
    ans = 1×7 cell
        {'zsc(Systolic.^2)'}    {'eb8(Systolic) >= 5'}    {'q8(Diastolic) >= 3'}    {'eb8(Diastolic) >= 3'}    {'q8(Systolic) >= 6'}    {'q8(Diastolic) >= 6'}    {'zsc(Height-Systolic)'}
    
    

    You can use partial dependence plots to analyze the categorical features whose levels have large coefficients in terms of absolute value. For example, inspect the partial dependence plot for the q8(Diastolic) variable, whose levels q8(Diastolic) >= 3 and q8(Diastolic) >= 6 have coefficients with large absolute values. These two levels correspond to noticeable changes in the predicted scores.

    plotPartialDependence(Mdl,"q8(Diastolic)",Mdl.ClassNames,NewTbl);

    Generate new features to improve the model accuracy for an interpretable linear model. Compare the test set accuracy of a linear model trained on the original data to the test set accuracy of a linear model trained on the transformed features.

    Load the ionosphere data set. Convert the matrix of predictors X to a table.

    load ionosphere
    tbl = array2table(X);

    Partition the data into training and test sets. Use approximately 70% of the observations as training data, and 30% of the observations as test data. Partition the data using cvpartition.

    rng("default") % For reproducibility of the partition
    cvp = cvpartition(Y,"Holdout",0.3);
    
    trainIdx = training(cvp);
    trainTbl = tbl(training(cvp),:);
    trainY = Y(trainIdx);
    
    testIdx = test(cvp);
    testTbl = tbl(testIdx,:);
    testY = Y(testIdx);

    Use the training data to generate 45 new features. Inspect the returned FeatureTransformer object.

    [T,newTrainTbl] = gencfeatures(trainTbl,trainY,45);
    T
    T = 
      FeatureTransformer with properties:
    
                         Type: 'classification'
                TargetLearner: 'linear'
        NumEngineeredFeatures: 45
          NumOriginalFeatures: 0
             TotalNumFeatures: 45
    
    

    All the generated features are engineered features rather than original features in trainTbl.

    Apply the transformations stored in the object T to the test data.

    newTestTbl = transform(T,testTbl);

    Compare the test set performances of a linear classifier trained on the original features and a linear classifier trained on the new features.

    Fit a linear model without transforming the data. Check the test set performance of the model using a confusion matrix.

    originalMdl = fitclinear(trainTbl,trainY);
    originalPredictedLabels = predict(originalMdl,testTbl);
    cm = confusionchart(testY,originalPredictedLabels);

    confusionMatrix = cm.NormalizedValues;
    originalTestAccuracy = sum(diag(confusionMatrix))/sum(confusionMatrix,"all")
    originalTestAccuracy = 0.8952
    

    Fit a linear model with the transformed data. Check the test set performance of the model using a confusion matrix.

    newMdl = fitclinear(newTrainTbl,trainY);
    newPredictedLabels = predict(newMdl,newTestTbl);
    newcm = confusionchart(testY,newPredictedLabels);

    newConfusionMatrix = newcm.NormalizedValues;
    newTestAccuracy = sum(diag(newConfusionMatrix))/sum(newConfusionMatrix,"all")
    newTestAccuracy = 0.9238
    

    The linear classifier trained on the transformed data seems to outperform the linear classifier trained on the original data.

    Use gencfeatures to engineer new features before training a bagged ensemble classifier. Before making predictions on new data, apply the same feature transformations to the new data set. Compare the test set performance of the ensemble that uses the engineered features to the test set performance of the ensemble that uses the original features.

    Read the sample file CreditRating_Historical.dat into a table. The predictor data consists of financial ratios and industry sector information for a list of corporate customers. The response variable consists of credit ratings assigned by a rating agency. Preview the first few rows of the data set.

    creditrating = readtable("CreditRating_Historical.dat");
    head(creditrating)
    ans=8×8 table
         ID      WC_TA     RE_TA     EBIT_TA    MVE_BVTD    S_TA     Industry    Rating 
        _____    ______    ______    _______    ________    _____    ________    _______
    
        62394     0.013     0.104     0.036      0.447      0.142        3       {'BB' }
        48608     0.232     0.335     0.062      1.969      0.281        8       {'A'  }
        42444     0.311     0.367     0.074      1.935      0.366        1       {'A'  }
        48631     0.194     0.263     0.062      1.017      0.228        4       {'BBB'}
        43768     0.121     0.413     0.057      3.647      0.466       12       {'AAA'}
        39255    -0.117    -0.799      0.01      0.179      0.082        4       {'CCC'}
        62236     0.087     0.158     0.049      0.816      0.324        2       {'BBB'}
        39354     0.005     0.181     0.034      2.597      0.388        7       {'AA' }
    
    

    Because each value in the ID variable is a unique customer ID, that is, length(unique(creditrating.ID)) is equal to the number of observations in creditrating, the ID variable is a poor predictor. Remove the ID variable from the table, and convert the Industry variable to a categorical variable.

    creditrating = removevars(creditrating,"ID");
    creditrating.Industry = categorical(creditrating.Industry);

    Convert the Rating response variable to an ordinal categorical variable.

    creditrating.Rating = categorical(creditrating.Rating, ...
        ["AAA","AA","A","BBB","BB","B","CCC"],"Ordinal",true);

    Partition the data into training and test sets. Use approximately 75% of the observations as training data, and 25% of the observations as test data. Partition the data using cvpartition.

    rng("default") % For reproducibility of the partition
    c = cvpartition(creditrating.Rating,"Holdout",0.25);
    trainingIndices = training(c); % Indices for the training set
    testIndices = test(c); % Indices for the test set
    creditTrain = creditrating(trainingIndices,:);
    creditTest = creditrating(testIndices,:);

    Use the training data to generate 40 new features to fit a bagged ensemble. By default, the 40 features can include original features if the software considers them to be important variables.

    [T,newCreditTrain] = gencfeatures(creditTrain,"Rating",40, ...
        "TargetLearner","bag");
    T
    T = 
      FeatureTransformer with properties:
    
                         Type: 'classification'
                TargetLearner: 'bag'
        NumEngineeredFeatures: 34
          NumOriginalFeatures: 6
             TotalNumFeatures: 40
    
    

    Because T.NumOriginalFeatures is 6, the function keeps all the original predictors.

    Create newCreditTest by applying the transformations stored in the object T to the test data.

    newCreditTest = transform(T,creditTest);

    Compare the test set performances of a bagged ensemble trained on the original features and a bagged ensemble trained on the new features.

    Train a bagged ensemble using the original training set creditTrain. Compute the accuracy of the model on the original test set creditTest. Visualize the results using a confusion matrix.

    originalMdl = fitcensemble(creditTrain,"Rating","Method","Bag");
    originalTestAccuracy = 1 - loss(originalMdl,creditTest, ...
        "Rating","LossFun","classiferror")
    originalTestAccuracy = 0.7481
    
    predictedTestLabels = predict(originalMdl,creditTest);
    confusionchart(creditTest.Rating,predictedTestLabels);

    Train a bagged ensemble using the transformed training set newCreditTrain. Compute the accuracy of the model on the transformed test set newCreditTest. Visualize the results using a confusion matrix.

    newMdl = fitcensemble(newCreditTrain,"Rating","Method","Bag");
    newTestAccuracy = 1 - loss(newMdl,newCreditTest, ...
        "Rating","LossFun","classiferror")
    newTestAccuracy = 0.7543
    
    newPredictedTestLabels = predict(newMdl,newCreditTest);
    confusionchart(newCreditTest.Rating,newPredictedTestLabels)

    The bagged ensemble trained on the transformed data seems to outperform the bagged ensemble trained on the original data.

    Generate features to train a linear classifier. Compute the cross-validation classification error of the model by using the crossval function.

    Load the ionosphere data set, and create a table containing the predictor data.

    load ionosphere
    Tbl = array2table(X);

    Create a random partition for stratified 5-fold cross-validation.

    rng("default") % For reproducibility of the partition
    cvp = cvpartition(Y,"KFold",5);

    Compute the cross-validation classification loss for a linear model trained on the original features in Tbl.

    CVMdl = fitclinear(Tbl,Y,"CVPartition",cvp);
    cvloss = kfoldLoss(CVMdl)
    cvloss = 0.1339
    

    Create the custom function myloss (shown at the end of this example). This function generates 20 features from the training data, and then applies the same training set transformations to the test data. The function then fits a linear classifier to the training data and computes the test set loss.

    Note: If you use the live script file for this example, the myloss function is already included at the end of the file. Otherwise, you need to create this function at the end of your .m file or add it as a file on the MATLAB® path.

    Compute the cross-validation classification loss for a linear model trained on features generated from the predictors in Tbl.

    newcvloss = mean(crossval(@myloss,Tbl,Y,"Partition",cvp))
    newcvloss = 0.0770
    
    function testloss = myloss(TrainTbl,trainY,TestTbl,testY)
    [Transformer,NewTrainTbl] = gencfeatures(TrainTbl,trainY,20);
    NewTestTbl = transform(Transformer,TestTbl);
    Mdl = fitclinear(NewTrainTbl,trainY);
    testloss = loss(Mdl,NewTestTbl,testY, ...
        "LossFun","classiferror");
    end

    Input Arguments

    collapse all

    Original features, specified as a table. Each row of Tbl corresponds to one observation, and each column corresponds to one predictor variable. Optionally, Tbl can contain one additional column for the response variable. Multicolumn variables and cell arrays other than cell arrays of character vectors are not allowed.

    If Tbl contains the response variable, and you want to create new features from any of the remaining variables in Tbl, then specify the response variable by using ResponseVarName.

    If Tbl contains the response variable, and you want to create new features from only a subset of the remaining variables in Tbl, then specify a formula by using formula.

    If Tbl does not contain the response variable, then specify a response variable by using Y. The length of the response variable and the number of rows in Tbl must be equal.

    Data Types: table

    Response variable name, specified as the name of a variable in Tbl.

    You must specify ResponseVarName as a character vector or string scalar. For example, if the response variable Y is stored as Tbl.Y, then specify it as 'Y'. Otherwise, the software treats all columns of Tbl as predictors, and might create new features from Y.

    Data Types: char | string

    Number of features, specified as a positive integer scalar. For example, you can set q to approximately 1.5*size(Tbl,2), which is about 1.5 times the number of original features.

    Data Types: single | double

    Response variable, specified as a numeric or categorical vector. The length of Y must be equal to the number of rows in Tbl.

    Explanatory model of the response variable and a subset of the predictor variables, specified as a character vector or string scalar in the form 'Y~X1+X2+X3'. In this form, Y represents the response variable, and X1, X2, and X3 represent the predictor variables.

    To create new features from only a subset of the predictor variables in Tbl, use a formula. If you specify a formula, then the software does not create new features from any variables in Tbl that do not appear in formula.

    The variable names in the formula must be both variable names in Tbl (Tbl.Properties.VariableNames) and valid MATLAB® identifiers. You can verify the variable names in Tbl by using the isvarname function. If the variable names are not valid, then you can convert them by using the matlab.lang.makeValidName function.

    Data Types: char | string

    Name-Value Pair Arguments

    Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

    Example: gencfeatures(Tbl,'Y',10,'TargetLearner','bag','FeatureSelection','oob') specifies that the expected learner type is a bagged ensemble classifier and the method for selecting features is an out-of-bag, predictor importance technique.

    Expected learner type, specified as 'linear' or 'bag'. The software creates and selects new features assuming that they will be used to train this type of model.

    ValueExpected Model
    'linear'ClassificationLinear — Appropriate for binary classification only
    'bag'ClassificationBaggedEnsemble — Appropriate for binary and multiclass classification

    By default, TargetLearner is 'linear', which supports binary response variables only. If you have a multiclass response variable and you want to generate new features, you must set TargetLearner to 'bag'.

    Example: 'TargetLearner','bag'

    Method for including the original features in Tbl in the new table NewTbl, specified as one of the values in this table.

    ValueDescription
    'auto'

    This value is equivalent to:

    • 'select' when TargetLearner is 'linear'

    • 'include' when TargetLearner is 'bag'

    'include'The software includes all original features that can be used as predictors by the target learner, and excludes unsupported features, such as datetime and duration variables.
    'select'The software includes original features that are supported by the target learner and considered to be important by the specified feature selection method (FeatureSelectionMethod).
    'omit'The software omits the original features.

    Example: 'IncludeInputVariables','include'

    Data Types: logical

    Method for selecting new features, specified as one of the values in this table. The software generates many features and uses this method to select the important features to include in NewTbl.

    ValueDescription
    'auto'

    This value is equivalent to:

    • 'lasso' when TargetLearner is 'linear'

    • 'oob' when TargetLearner is 'bag'

    'oob'Out-of-bag, predictor importance estimates by permutation — Available when TargetLearner is 'bag'
    'mrmr'Minimum redundancy maximum relevance (MRMR) — Available when TargetLearner is 'linear' or 'bag'
    'lasso'Lasso regularization — Available when TargetLearner is 'linear'

    Example: 'FeatureSelection','mrmr'

    Standardization method for the transformed data, specified as one of the values in this table.

    ValueDescription
    'auto'

    This value is equivalent to:

    • 'zscore' when TargetLearner is 'linear'

    • 'none' when TargetLearner is 'bag'

    'none'Use raw data
    'zscore'Center and scale to have mean 0 and standard deviation 1
    'mad'Center and scale to have median 0 and median absolute deviation 1
    'range'Scale range of data to [0,1]

    Example: 'TransformedDataStandardization','range'

    Maximum number of categories allowed in a categorical predictor, specified as a nonnegative integer scalar. If a categorical predictor has more than the specified number of categories, than gencfeatures does not create new features from the predictor. The default value is 50 when TargetLearner is 'linear' and Inf when TargetLearner is 'ensemble'.

    Example: 'CategoricalEncodingLimit',20

    Data Types: single | double

    Output Arguments

    collapse all

    Engineered feature transformer, returned as a FeatureTransformer object. To better understand the engineered features, use the describe object function of Transformer. To apply the same feature transformations on a new data set, use the transform object function of Transformer.

    Generated features, returned as a table. Each row corresponds to an observation, and each column corresponds to a generated feature. If the response variable is included in Tbl, then NewTbl also includes the response variable. Use this table to train a classification model of type TargetLearner.

    NewTbl contains generated features in the following order: original features, engineered features as ranked by the feature selection method, and the response variable.

    Tips

    • By default, when TargetLearner is 'linear', the software generates new features from numeric predictors by using z-scores (see TransformedDataStandardization). You can change the type of standardization for the transformed features; however, using some method of standardization, thereby avoiding the 'none' specification, is strongly recommended. Linear model fitting works best with standardized data.

    Introduced in R2021a