What Is Unsupervised Learning? - MATLAB & Simulink

Unsupervised Learning

What Is Unsupervised Learning?

Unsupervised learning is a type of machine learning technique that draws inferences from unlabeled data. Unsupervised learning aims to identify hidden patterns and relationships within the data, without any supervision or prior knowledge of the outcomes.

How Unsupervised Learning Works

Unsupervised learning algorithms discover hidden patterns, structures, and groupings within data, without any prior knowledge of the outcomes. These algorithms rely on unlabeled data, data that has no predefined labels.

A typical unsupervised learning process involves data preparation, applying the right unsupervised learning algorithm to it, and, finally, interpreting and evaluating the results. This approach is particularly useful for tasks such as clustering, where the goal is to group similar data points together, and dimensionality reduction, which simplifies data by reducing the number of features (dimensions). By analyzing the inherent structure of the data, unsupervised learning can provide a better understanding of your data sets.

Unsupervised learning can also be applied before supervised learning to identify features in exploratory data analysis and establish classes based on groupings. This is part of feature engineering, a process for transforming raw data into features suitable for supervised machine learning.

A set of different-colored shapes (unlabeled data) is input into an unsupervised learning algorithm, resulting in output of three homogenous groups (classes).

Organizing unlabeled data into groups using unsupervised learning.

Types of Unsupervised Learning Methods

Clustering

Clustering is the most common unsupervised learning method and helps you understand the natural grouping or inherent structure of a data set. It is used for exploratory data analysis, pattern recognition, anomaly detection, image segmentation, and more. Clustering algorithms, such as k-means or hierarchical clustering, group data points such that data points in the same group (or cluster) are more similar to each other than to data points in other groups.

For example, if a cell phone company wants to optimize the locations where it builds cell phone towers, it can use machine learning to estimate the number of clusters of people relying on its towers. A phone can only talk to one tower at a time, so the team uses clustering algorithms to design the best placement of cell towers to optimize signal reception for groups, or clusters, of customers.

A plot with unlabeled points undergoes clustering, a type of unsupervised learning, resulting in three color-coded clusters of data points.

Using clustering to find hidden patterns in your data.

Clustering is divided into two main categories:

  • Hard or exclusive clustering, where each data point belongs to only one cluster, such as the popular k-means method.
  • Soft or overlapping clustering, where each data point can belong to more than one cluster, such as in Gaussian mixture models.

Popular clustering algorithms include:

  • Hierarchical clustering builds a multilevel hierarchy of clusters by creating a cluster tree.
  • k-means partitions data into k distinct clusters based on the distance to the centroid of a cluster.
  • Gaussian mixture models form clusters as a mixture of multivariate normal density components.
  • Density-based spatial clustering of applications with noise (DBSCAN) groups points that are close to each other in areas of high density, keeping track of outliers in low-density regions. It can handle arbitrary non-convex shapes.
  • Self-organizing maps use neural networks that learn the topology and distribution of the data.
  • Spectral clustering transforms input data into a graph-based representation where the clusters are better separated than in the original feature space. The number of clusters can be estimated by studying the eigenvalues of the graph.
  • Hidden Markov models can be used to discover patterns in sequences, such as genes and proteins in bioinformatics.
  • Fuzzy c-means (FCM) groups data into N clusters, with every data point in the data set belonging to every cluster to a certain degree.

Clustering is used in various applications, such as image segmentation, anomaly detection, and pattern recognition.

A 2D plot showing the petal width and length measurements for three species of iris, and a plot showing the three resulting clusters using GMM clustering.

Left: MATLAB scatter plot of petal measurements from several specimens of three iris species. Right: Petal measurements segmented into three clusters using the Gaussian mixture model (GMM) clustering technique.

Dimensionality Reduction

Multivariate data often includes a large number of variables or features. This can affect run time and memory requirements. Dimensionality reduction techniques reduce the number of features (dimensions) while preserving the necessary information of the original data. Using dimensionality reduction with unsupervised learning can help in lowering the computational load and increasing the speed and efficiency of machine learning algorithms.

Another difficulty inherent in data with many variables is the problem of visualizing it. By simplifying the data without losing significant information, dimensionality reduction techniques make it easier to visualize and analyze.

Take an example of human activity data having 60 dimensions collected using smartphone accelerometer sensors during five different activities (sitting, standing, walking, running, and dancing). High dimensionality makes this data difficult to visualize and analyze. Using dimensionality reduction, you can reduce these dimensions to just two or three, without losing significant information.

Some popular unsupervised learning methods for reducing dimensionality are:

  • Principal component analysis (PCA) transforms data into a set of orthogonal components that capture the maximum variance with fewer variables. The new variables are called principal components. Each principal component is a linear combination of the original variables. The first principal component is a single axis in space. When you project each observation on that axis, the resulting values form a new variable, and the variance of this variable is the maximum among all possible choices of the first axis. The second principal component is another axis in space, perpendicular to the first. Projecting the observations on this axis generates another new variable. The variance of this variable is the maximum among all possible choices of this second axis. The full set of principal components is as large as the original set of variables but often the first few components capture over 80% of the total variance of the original data.
  • t-distributed stochastic neighbor embedding (t-SNE) is well-suited for visualizing high-dimensional data. It embeds high-dimensionality data points in low dimensions in a way that respects similarities between the points. Typically, you can visualize the low-dimensional points to see natural clusters in the original high-dimensional data.
  • Factor analysis is a way to fit a model to multivariate data to estimate interdependence between the variables by identifying underlying factors that explain the observed correlations among the variables. In this unsupervised learning technique, the measured variables depend on a smaller number of unobserved (latent) factors. Because each factor might affect several variables in common, they are known as common factors. Each variable is assumed to be dependent on a linear combination of the common factors, and the coefficients are known as loadings. Each measured variable also includes a component due to independent random variability, known as specific variance, because it is specific to one variable.
  • Autoencoders are neural networks trained to replicate their input data. Autoencoders can be used for different data types, including images, time series, and text. They are useful in many applications, such as anomaly detection, text generation, image generation, image denoising, and digital communications. Autoencoders are often used for dimensionality reduction. The autoencoder consists of two smaller networks: an encoder and a decoder. During training, the encoder learns a set of features, known as a latent representation, from input data. At the same time, the decoder is trained to reconstruct the data based on these features.
An autoencoder detects an anomaly (red r) in an image (white background with black dot pattern and red r).

Image-based anomaly detection using an autoencoder.

Association Rules

Association rule learning identifies interesting relations between variables in large databases. For example, in transactional data, association rules can be used to identify which items are most likely to be bought together by the users. Algorithms used in association rule mining include:

  • Apriori algorithms identify frequent item sets in data by performing a breadth-first search and then derive association rules from these item sets.
  • Equivalence class clustering and bottom-up lattice traversal (ECLAT) algorithms use a depth-first search strategy to find frequent item sets.

Association rules find their most common use cases in market basket analysis, but they can also be used for predictive maintenance. For instance, based on different sensors’ data, algorithms can be used to identify a failure pattern and create rules to predict component failure.

Other methods that apply unsupervised learning include semi-supervised learning and unsupervised feature ranking. Semi-supervised learning reduces the need for labeled data in supervised learning. Clustering applied to the whole data set establishes similarity between labeled and unlabeled data, and labels are propagated to previously unlabeled and similar cluster members. Unsupervised feature ranking assigns scores to features without a given prediction target or response.

Why Unsupervised Learning Is Important

Unsupervised learning is a major area of machine learning and artificial intelligence that plays a crucial role in exploring and understanding data. Unlike supervised learning, which relies on labeled data to train models, unsupervised learning works with unlabeled data, making it particularly valuable in real-world scenarios where labeling data is often expensive, time consuming, or impractical.

By uncovering hidden patterns, structures, and relationships within data, unsupervised learning enables businesses and researchers to gain meaningful insights that were previously inaccessible. Common tasks in unsupervised learning include pattern recognition, exploratory data analysis, segmentation, anomaly detection, and feature reduction.

The Difference Between Supervised and Unsupervised Learning

Supervised learning involves training a model on a labeled data set to perform classification or regression. This means that each training example is paired with an output label. The model is trained using a known data set (called the training data set) with a known set of input data (called features) and known responses to make predictions. An example of supervised learning is predicting house prices based on features such as size and number of rooms. Popular machine learning models are linear regression, logistic regression, k-nearest neighbors (KNNs), and support vector machines. Deep learning models are also trained by using large sets of labeled data and can often learn features directly from the data without the need for manual feature extraction.

In contrast, unsupervised learning deals with unlabeled data. The unsupervised learning algorithm tries to learn the underlying structure of the data without any prior knowledge. The main objective in unsupervised learning is to find hidden patterns or intrinsic structures in the input data. An example of unsupervised learning is grouping fruits based on similarity in color, size, and taste, without knowing what the fruits are. Common unsupervised learning algorithms include clustering methods such as k-means, hierarchical clustering, and dimensionality reduction techniques such as principal component analysis (PCA).

Unsupervised learning results are generally less accurate than supervised learning results due to the absence of labeled data. However, acquiring labeled data requires human intervention and can be time consuming and even impossible in some cases, such as for biological data. Ground truth labeling also might require domain knowledge, especially when labeling complex signals and not images of commonly encountered objects.

Machine learning techniques: unsupervised learning (clustering) and supervised learning (classification and regression).

Supervised and unsupervised learning are types of machine learning.

Examples of Unsupervised Learning

The ability of unsupervised learning to identify hidden patterns and relationships without the need for predefined labels makes it an indispensable tool in various applications, including:

  • Exploratory data analysis: Unsupervised learning techniques are widely used to explore data to uncover hidden inherent structures and draw insights from them. For example, factor analysis can be used to analyze if companies within the same sector experience similar week-to-week changes in stock price.
  • Anomaly detection: Unsupervised learning methods such as isolation forests and Gaussian mixture models (GMMs) are used to detect anomalies.
  • Medical imaging: Clustering, an unsupervised learning technique, is extremely useful for image segmentation. Clustering algorithms can be applied to medical images and segment them based on pixel density, color, or other features. Doctors can use this information to identify areas of interest, such as differentiating between healthy tissue and tumors or segmenting the brain into white matter, gray matter, and cerebrospinal fluid.
  • Genomics and bioinformatics: Genetic clustering and sequence analysis are used in bioinformatics. For example, clustering can be used to identify relationships between gene expression profiles. 
  • Recommendation systems: Unsupervised learning techniques, such as singular value decomposition (SVD), are used in collaborative filtering to decompose the user-item interaction matrix. This approach is used by popular video streaming platforms to recommend content to individual users.
  • Natural language processing (NLP): In natural language processing, unsupervised learning techniques are used for tasks such as topic modeling, document clustering, and building AI language models.

Unsupervised learning has diverse applications in various domains. By revealing hidden patterns and relationships, unsupervised learning enables engineers and researchers to make informed decisions. As data continues to grow exponentially, the importance and impact of unsupervised learning will only continue to expand.

Unsupervised Learning with MATLAB

MATLAB® enables you to create unsupervised learning pipelines from data preparation to model evaluation and deployment:

  • With Statistics and Machine Learning Toolbox™, you can apply unsupervised learning methods, such as clustering and dimensionality reduction, to your data and evaluate model performance.
  • With Deep Learning Toolbox™, you can perform unsupervised learning with autoencoder neural networks.
  • With MATLAB Coder™, you can generate C/C++ code for deploying unsupervised learning methods to a variety of hardware platforms.
Access and explore data, preprocess it, apply an unsupervised learning algorithm, evaluate the results to draw insights, and share those insights using MATLAB.

Extended unsupervised learning workflow in MATLAB.

Data Preparation

You can clean your data programmatically or you can use the low-code Data Cleaner app and Preprocess Text Data Live Editor task for interactive data preparation and automatic code generation.

Clustering

MATLAB supports all popular clustering algorithms, such as k-means, hierarchical, DBSCAN, and GMM. Using Fuzzy Logic Toolbox™, you can also perform fuzzy c-means clustering on your data set.

You can also perform k-means and Hierarchical clustering interactively using the Cluster Data Live Editor task. Specify the clustering algorithm, number of clusters, and distance metric. The task computes the cluster indices and displays a visualization of the clustered data.

The user interface for the Cluster Data task in Live Editor with the resulting 2D scatter plot (PCA).

k-means clustering using the Cluster Data Live Editor task. (See MATLAB documentation.)

Dimensionality Reduction

MATLAB supports all popular dimensionality reduction techniques, including PCA, t-SNE, and factor analysis. You can use built-in functions to apply these techniques to your data. For PCA, you can also use the Reduce Dimensionality Live Editor task to perform the steps interactively.

The user interface for the Reduce Dimensionality task in Live Editor with the resulting scree plot.

Reducing dimensionality using a Live Editor Task. (See MATLAB documentation.)

With MATLAB, you can also rank features for unsupervised learning using Laplacian scores.

Result Evaluation

You can visualize clusters to evaluate clustering results using scatter, dendrogram, and silhouette plots. You can also assess clustering outcomes by using the evalclusters function to evaluate the optimal number of data clusters. To determine how well your data fits into a particular number of clusters, you can compute index values using different evaluation criteria, such as gap or silhouette.

For reducing dimensionality, you can use scatter plots, scree plots, and biplots to inspect the results. Using the Reduce Dimensionality Live Editor task, you can determine the number of components required to explain the variance of a fixed percentage of the data, such as 95% or 99%.

Color-coded groupings by activity: running, walking, dancing, sitting, standing.

Scatter plot of high-dimensional data with 60 original dimensions reduced to two dimensions using t-distributed stochastic neighbor embedding (t-SNE). (See MATLAB code.)