What Is Overfitting?
Overfitting is a machine learning behavior that occurs when the model is so closely aligned to the training data that it does not know how to respond to new data. Overfitting can happen because:
- The machine learning model is too complex; it memorizes very subtle patterns in the training data that don’t generalize well.
- The training data size is too small for the model complexity and/or contains large amounts of irrelevant information.
You can prevent overfitting by managing model complexity and improving the training data set.
Overfitting vs. Underfitting
Underfitting is the opposite concept of overfitting; the model doesn’t align well with the training data or generalize well to new data. Overfitting and underfitting can be present in both classification and regression models. The following figure illustrates how the classification decision boundary and regression line follow the training data too closely for an overfitted model and not closely enough for an underfitted model.
When only looking at the computed error of a machine learning model for the training data, overfitting is harder to detect than underfitting. So, to avoid overfitting, it is important to validate a machine learning model before using it on test data.
Error |
Overfitting |
Right Fit |
Underfitting |
Training |
Low |
Low |
High |
Test |
High |
Low |
High |
Using MATLAB® with Statistics and Machine Learning Toolbox™ and Deep Learning Toolbox™, you can prevent overfitting of machine learning and deep learning models. MATLAB provides functions and methods specifically designed to avoid overfitting of models. You can use these tools when you train or tune your model to protect it from overfitting.
How to Avoid Overfitting by Reducing Model Complexity
With MATLAB, you can train machine learning models and deep learning models (such as CNNs) from scratch or take advantage of pretrained deep learning models. To prevent overfitting, perform model validation to ensure that you choose a model with the right level of complexity for your data or use regularization to reduce the complexity of the model.
Model Validation
The error of an overfitted model is low when computed for the training data. It is good practice to validate your model on a separate data set (i.e., validation data set) before introducing new data. For MATLAB machine learning models, you can use the cvpartition
function to randomly partition a data set into training and validation sets. For deep learning models, you can monitor the validation accuracy during training. Improving the properly validated accuracy measure for your models through model selection and hyperparameter tuning should translate into improved accuracy when the model sees new data.
Cross-validation is a model assessment technique used to evaluate a machine learning algorithm’s performance in making predictions on data sets that it has not been trained on. Cross-validation helps you choose a not overly complex algorithm that will cause overfitting. Use the crossval
function to compute the cross-validation error estimate for machine learning models by using common cross-validation techniques, such as k-fold (partitions data into k randomly chosen subsets of roughly equal size) and holdout (partitions data randomly into exactly two subsets of specified ratio).
Regularization
Regularization is a technique used to prevent statistical overfitting in a machine learning model. Regularization algorithms typically work by applying either a penalty for complexity or roughness. By introducing additional information into the model, regularization algorithms can deal with multicollinearity and redundant predictors by making the model more parsimonious and accurate.
For machine learning, you can choose between three popular regularization techniques: lasso (L1 norm), ridge (L2 norm), and elastic net, with several types of linear machine learning models. For deep learning, you can increase the L2 regularization factor in the specified training options or use dropout layers in your network to avoid overfitting.
Examples and How To
How to Avoid Overfitting by Enhancing the Training Data Set
Cross-validation and regularization prevent overfitting by managing model complexity. Another approach is to improve the data set. Deep learning models, especially, require large amounts of data to avoid overfitting.
Data Augmentation
When data availability is limited, data augmentation is a method to artificially expand the data points of the training data set by adding randomized versions of the existing data to the data set. With MATLAB, you can augment image, audio, and other types of data. For example, augment image data by randomizing the scale and rotation of existing images.
Data Generation
Synthetic data generation is another method to expand a data set. With MATLAB, you can generate synthetic data by using generative adversarial networks (GANs) or digital twins (data generation through simulation).
Data Clean Up
Data noisiness contributes to overfitting. One common approach to reduce undesired data points is to remove outliers from the data by using the rmoutliers
function.