Main Content

Image Category Classification

Classify images using bag-of-features, CNNs, vision transformers and vision-language models

Image category classification tools in Computer Vision Toolbox™ enable you to classify images into predefined categories using either deep learning-based vision transformer models or traditional bag-of-visual-words techniques. Image category classification capability is essential for applications such as scene recognition, content filtering, and automated tagging. You can start by creating labeled data sets using the Image Labeler and Video Labeler apps, which support interactive and AI-assisted annotation of scene-level labels for images and video frames, respectively. These labels serve as ground truth for training and evaluating image classification models.

For deep learning-based classification, the toolbox provides access to pretrained vision transformer (ViT) models through the visionTransformer function. These models use self-attention mechanisms to capture global image context, and can be fine-tuned for custom data sets. Supporting layers such as patchEmbeddingLayer enable you to design and extend ViT architectures. Additionally, the toolbox includes support for CLIP networks, which combine vision and language understanding to perform image classification. Use the clipNetwork object and the classify object function to perform image classification tasks that align visual content with textual descriptions, enabling multimodal applications.

For traditional approaches, the toolbox supports the bag-of-features (BoF) framework, which represents images as histograms of visual word occurrences. You can use the bagOfFeatures object to extract features and build a visual vocabulary, then train classifiers using the trainImageCategoryClassifier function and make predictions with the imageCategoryClassifier function. This method is particularly useful for lightweight applications, or when interpretability is a priority. For more information, see Image Classification with Bag of Visual Words.

Apps

Image LabelerLabel images for computer vision applications
Video LabelerLabel video for computer vision applications

Functions

expand all

visionTransformerPretrained vision transformer (ViT) neural network (Since R2023b)
patchEmbeddingLayerPatch embedding layer (Since R2023b)
clipNetworkCreate pretrained CLIP deep learning neural network for vision-language tasks (Since R2026a)
classifyClassify image using CLIP network (Since R2026a)
bagOfFeaturesBag of visual words object
trainImageCategoryClassifierTrain an image category classifier
imageCategoryClassifierPredict image category
imageDatastoreDatastore for image data
splitlabelsFind indices to split labels according to specified proportions
countlabelsCount number of unique labels
folders2labelsGet list of labels from folder names

Topics

Create Ground Truth for Image Classification

  • Get Started with the Image Labeler
    Interactively label rectangular ROIs for object detection, pixels for semantic segmentation, polygons for instance segmentation, and scenes for image classification.
  • Get Started with the Video Labeler
    Interactively label rectangular ROIs for object detection, pixels for semantic segmentation, polygons for instance segmentation, and scenes for image classification in a video or image sequence.

Classify Images Using Deep learning Models

Classify Images Using Bag of Features Approach

Featured Examples