Image Category Classification

Classify images using bag-of-features, CNNs, vision transformers and vision-language models

Image category classification tools in Computer Vision Toolbox™ enable you to classify images into predefined categories using either deep learning-based vision transformer models or traditional bag-of-visual-words techniques. Image category classification capability is essential for applications such as scene recognition, content filtering, and automated tagging. You can start by creating labeled data sets using the Image Labeler and Video Labeler apps, which support interactive and AI-assisted annotation of scene-level labels for images and video frames, respectively. These labels serve as ground truth for training and evaluating image classification models.

For deep learning-based classification, the toolbox provides access to pretrained vision transformer (ViT) models through the visionTransformer function. These models use self-attention mechanisms to capture global image context, and can be fine-tuned for custom data sets. Supporting layers such as patchEmbeddingLayer enable you to design and extend ViT architectures. Additionally, the toolbox includes support for CLIP networks, which combine vision and language understanding to perform image classification. Use the clipNetwork object and the classify object function to perform image classification tasks that align visual content with textual descriptions, enabling multimodal applications.

For traditional approaches, the toolbox supports the bag-of-features (BoF) framework, which represents images as histograms of visual word occurrences. You can use the bagOfFeatures object to extract features and build a visual vocabulary, then train classifiers using the trainImageCategoryClassifier function and make predictions with the imageCategoryClassifier function. This method is particularly useful for lightweight applications, or when interpretability is a priority. For more information, see Image Classification with Bag of Visual Words.

Apps

Image Labeler	Label images for computer vision applications
Video Labeler	Label video for computer vision applications

Functions

expand all

Vision Transformer(ViT)

`visionTransformer`	Pretrained vision transformer (ViT) neural network (Since R2023b)
`patchEmbeddingLayer`	Patch embedding layer (Since R2023b)

CLIP Network

`clipNetwork`	Create pretrained CLIP deep learning neural network for vision-language tasks (Since R2026a)
`classify`	Classify image using CLIP network (Since R2026a)

Bag of Features

`bagOfFeatures`	Bag of visual words object
`trainImageCategoryClassifier`	Train an image category classifier
`imageCategoryClassifier`	Predict image category
`imageDatastore`	Datastore for image data
`splitlabels`	Find indices to split labels according to specified proportions
`countlabels`	Count number of unique labels
`folders2labels`	Get list of labels from folder names

Topics

Create Ground Truth for Image Classification

Get Started with the Image Labeler
Interactively label rectangular ROIs for object detection, pixels for semantic segmentation, polygons for instance segmentation, and scenes for image classification.
Get Started with the Video Labeler
Interactively label rectangular ROIs for object detection, pixels for semantic segmentation, polygons for instance segmentation, and scenes for image classification in a video or image sequence.

Classify Images Using Deep learning Models

Train Vision Transformer Network for Image Classification
This example shows how to fine-tune a pretrained vision transformer (ViT) neural network to perform classification on a new collection of images.
Create Simple Image Classification Network (Deep Learning Toolbox)
This example shows how to create and train a simple convolutional neural network for deep learning classification.
Get Started with Image Classification (Deep Learning Toolbox)
This example shows how to create a simple convolutional neural network for deep learning classification using the Deep Network Designer app.

Classify Images Using Bag of Features Approach

Create a Custom Feature Extractor
You can use the bag-of-features (BoF) framework with many different types of image features.
Image Classification with Bag of Visual Words
Use the Computer Vision Toolbox functions for image category classification by creating a bag of visual words.

Featured Examples

Train Vision Transformer Network for Image Classification

Fine-tune a pretrained vision transformer (ViT) neural network to perform classification on a new collection of images.

Open Live Script

Image Category Classification Using Bag of Features

Use a bag of features approach for image category classification. This technique is also often referred to as bag of words. Visual image categorization is a process of assigning a category label to an image under test. Categories may contain images representing just about anything, for example, dogs, cats, trains, boats.

Open Live Script

Image Category Classification Using Deep Learning

Use a pretrained Convolutional Neural Network (CNN) as a feature extractor for training an image category classifier.

Open Live Script