clipNetwork

Create pretrained CLIP deep learning neural network for vision-language tasks

Since R2026a

Description

Add-On Required: This feature requires the Computer Vision Toolbox Model for OpenAI CLIP Network add-on.

The clipNetwork object configures a pretrained Contrastive Language-Image Pre-Training (CLIP) network. Use the CLIP network to connect and compare images and text for tasks like image classification, retrieval, and zero-shot learning.

Use the CLIP network for image retrieval and image classification.

Image retrieval — To search for images that best match a text query, first extract embeddings for each of the images in the data set using the extractImageEmbeddings object function. Then, extract the embeddings for the text search terms using the extractTextEmbeddings object function, compute the similarity between the text embeddings and the image embeddings, and select the classes with the closest match.
Zero-shot classification without retraining — To classify images by comparing extracted image features to the features of candidate class names or descriptions, use the classify object function and select the classes with the closest match.

To perform a forward pass on the image and text encoders prior to training the CLIP network, use the forward object function.

Note

This functionality requires Deep Learning Toolbox™.

Creation

Syntax

net = clipNetwork(backbone)

net = clipNetwork(backbone,Name=Value)

Description

net = clipNetwork(backbone) creates a pretrained CLIP network with a pretrained image encoder, backbone.

example

net = clipNetwork(backbone,Name=Value) sets writable properties using one or more name-value arguments. For example, ModelName="TrainedCLIP" specifies the CLIP network name as "TrainedCLIP".

Input Arguments

expand all

`backbone` — Pretrained image encoder backbone
`"vit-b-16"` | `"vit-l-14"` | `"resnet50`

Pretrained image encoder backbone of the CLIP network, specified as one of these options:

"vit-b-16" — ViT-B/16 (Vision Transformer Base with 16-by-16 input patches).
"vit-l-14" — ViT-L/14 (Vision Transformer Large with 14-by-14 input patches).
"resnet50" — ResNet-50 deep convolutional neural network, scaled by 16 times.

The clipNetwork function uses the backbone image encoder to process input images and convert them into high-dimensional feature vectors.

Data Types: char | string

Name-Value Arguments

expand all

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: clipNetwork("resnet50",ModelName="trainedCLIP") specifies the CLIP network model name as "TrainedCLIP".

`ModelName` — Name of pretrained CLIP network model
`""` (default) | string scalar | character vector

Name of the pretrained CLIP network model, specified as a string scalar or a character vector.

`ImageNormalizationStatistics` — Z-score normalization statistics
`[]` (default) | structure

Z-score normalization statistics, specified as a structure with these fields:

Field	Description	Default Value
`Mean`	1-by-C vector of means per channel.	`[122.7709 116.7460 104.0937]`
`StandardDeviation`	1-by-C vector of standard deviations per channel.	`[68.5005 66.6321 70.3232]`

C is the number of channels in each image.

Properties

expand all

`ModelName` — Name of pretrained CLIP network model
`""` (default) | string scalar | character vector

This property is read-only after object creation. To set this property, use the ModelName argument during object creation.

Name of the pretrained CLIP network model, represented as a string scalar or a character vector.

`ImageNormalizationStatistics` — Z-score normalization statistics
`[]` (default) | structure

This property is read-only after object creation. To set this property, use the ImageNormalizationStatistics argument during object creation.

Z-score normalization statistics, represented as a structure with these fields:

Field	Description	Default Value
`Mean`	1-by-C vector of means per channel.	`[122.7709 116.7460 104.0937]`
`StandardDeviation`	1-by-C vector of standard deviations per channel.	`[68.5005 66.6321 70.3232]`

The number of channels C must match the number of channels in each image.

`TextEncoderNetwork` — Network containing CLIP text encoder parameters
Read-only: `dlnetwork` object

This property is read-only.

Network containing the CLIP text encoder parameters, represented as a dlnetwork (Deep Learning Toolbox) object.

`ImageEncoderNetwork` — Network containing CLIP image encoder parameters
Read-only: `dlnetwork` object

This property is read-only.

Network containing the CLIP image encoder parameters, represented as a dlnetwork object.

Object Functions

`classify`	Classify image using CLIP network
`extractImageEmbeddings`	Extract feature embeddings from image using CLIP network image encoder
`extractTextEmbeddings`	Extract text embeddings from search text using CLIP network text encoder
`forward`	Run forward pass on CLIP network

Examples

collapse all

Create CLIP Network

This example uses:

Open Live Script

Create a pretrained CLIP network.

clip = clipNetwork("resnet50")

clip = 
  clipNetwork with properties:

                       ModelName: "resnet50"
             ImageEncoderNetwork: [1×1 dlnetwork]
              TextEncoderNetwork: [1×1 dlnetwork]
    ImageNormalizationStatistics: [1×1 struct]

Perform Image Retrieval using CLIP Network

This example uses:

Open Live Script

Create a pretrained CLIP network with the ViT-B/16 backbone.

clip = clipNetwork("vit-b-16");

Create a datastore of images, imds, and display a montage of the images.

pathToImages = fullfile(toolboxdir("vision"),"visiondata","imageSets");
imds = imageDatastore(pathToImages,IncludeSubfolders=true);
montage(imds)

Figure contains an axes object. The hidden axes object contains an object of type image.

Extract Image Embeddings

Extract the feature embeddings for each image in the datastore.

imageEmbeddings = extractImageEmbeddings(clip,imds);

Extract Text Embeddings

Define a search text search term, and extract the text embeddings using the CLIP network encoder. You can modify the search term to empirically test image retrieval.

search = "A photo of a children's book.";
textEmbeddings = extractTextEmbeddings(clip,search);

Retrieve Images Related to Search Text

Compute the cosine similarity between the text embeddings and all image embeddings.

function simScores = cosineSim(x,y)
    simScores = (x'*y)'./(vecnorm(x)*norm(y));
end
simScores = cosineSim(imageEmbeddings,textEmbeddings)

simScores = 1×12 single row vector

    0.3019    0.3202    0.2359    0.2261    0.1954    0.2838    0.1628    0.1735    0.2029    0.1854    0.1826    0.1993

Identify the three highest similarity scores.

[~,topIdx] = maxk(simScores,3);

Display the images with the highest similarity scores related to the search text.

imageFiles = imds.Files(topIdx);
montage(imageFiles)

Figure contains an axes object. The hidden axes object contains an object of type image.

References

[1] Radford, Alec, et al. Learning Transferable Visual Models From Natural Language Supervision. arXiv, 2021. DOI.org (Datacite), https://doi.org/10.48550/ARXIV.2103.00020.

Version History

Introduced in R2026a

clipNetwork

Description

Creation

Syntax

Description

Input Arguments

`backbone` — Pretrained image encoder backbone
`"vit-b-16"` | `"vit-l-14"` | `"resnet50`

Name-Value Arguments

`ModelName` — Name of pretrained CLIP network model
`""` (default) | string scalar | character vector

`ImageNormalizationStatistics` — Z-score normalization statistics
`[]` (default) | structure

Properties

`ModelName` — Name of pretrained CLIP network model
`""` (default) | string scalar | character vector

`ImageNormalizationStatistics` — Z-score normalization statistics
`[]` (default) | structure

`TextEncoderNetwork` — Network containing CLIP text encoder parameters
Read-only: `dlnetwork` object

`ImageEncoderNetwork` — Network containing CLIP image encoder parameters
Read-only: `dlnetwork` object

Object Functions

Examples

Create CLIP Network

Perform Image Retrieval using CLIP Network

References

Version History

See Also

Topics

clipNetwork

Description

Creation

Syntax

Description

Input Arguments

backbone — Pretrained image encoder backbone "vit-b-16" | "vit-l-14" | "resnet50

Name-Value Arguments

ModelName — Name of pretrained CLIP network model "" (default) | string scalar | character vector

ImageNormalizationStatistics — Z-score normalization statistics [] (default) | structure

Properties

ModelName — Name of pretrained CLIP network model "" (default) | string scalar | character vector

ImageNormalizationStatistics — Z-score normalization statistics [] (default) | structure

TextEncoderNetwork — Network containing CLIP text encoder parameters Read-only: dlnetwork object

ImageEncoderNetwork — Network containing CLIP image encoder parameters Read-only: dlnetwork object

Object Functions

Examples

Create CLIP Network

Perform Image Retrieval using CLIP Network

References

Version History

See Also

Topics

`backbone` — Pretrained image encoder backbone
`"vit-b-16"` | `"vit-l-14"` | `"resnet50`

`ModelName` — Name of pretrained CLIP network model
`""` (default) | string scalar | character vector

`ImageNormalizationStatistics` — Z-score normalization statistics
`[]` (default) | structure

`ModelName` — Name of pretrained CLIP network model
`""` (default) | string scalar | character vector

`ImageNormalizationStatistics` — Z-score normalization statistics
`[]` (default) | structure

`TextEncoderNetwork` — Network containing CLIP text encoder parameters
Read-only: `dlnetwork` object

`ImageEncoderNetwork` — Network containing CLIP image encoder parameters
Read-only: `dlnetwork` object