Main Content

clipNetwork

Create pretrained CLIP deep learning neural network for vision-language tasks

Since R2026a

    Description

    Add-On Required: This feature requires the Computer Vision Toolbox Model for OpenAI CLIP Network add-on.

    The clipNetwork object configures a pretrained Contrastive Language-Image Pre-Training (CLIP) network. Use the CLIP network to connect and compare images and text for tasks like image classification, retrieval, and zero-shot learning.

    Use the CLIP network for image retrieval and image classification.

    • Image retrieval — To search for images that best match a text query, first extract embeddings for each of the images in the data set using the extractImageEmbeddings object function. Then, extract the embeddings for the text search terms using the extractTextEmbeddings object function, compute the similarity between the text embeddings and the image embeddings, and select the classes with the closest match.

    • Zero-shot classification without retraining — To classify images by comparing extracted image features to the features of candidate class names or descriptions, use the classify object function and select the classes with the closest match.

    To perform a forward pass on the image and text encoders prior to training the CLIP network, use the forward object function.

    Note

    This functionality requires Deep Learning Toolbox™.

    Creation

    Description

    net = clipNetwork(backbone) creates a pretrained CLIP network with a pretrained image encoder, backbone.

    example

    net = clipNetwork(backbone,Name=Value) sets writable properties using one or more name-value arguments. For example, ModelName="TrainedCLIP" specifies the CLIP network name as "TrainedCLIP".

    Input Arguments

    expand all

    Pretrained image encoder backbone of the CLIP network, specified as one of these options:

    • "vit-b-16" — ViT-B/16 (Vision Transformer Base with 16-by-16 input patches).

    • "vit-l-14" — ViT-L/14 (Vision Transformer Large with 14-by-14 input patches).

    • "resnet50" — ResNet-50 deep convolutional neural network, scaled by 16 times.

    The clipNetwork function uses the backbone image encoder to process input images and convert them into high-dimensional feature vectors.

    Data Types: char | string

    Name-Value Arguments

    expand all

    Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

    Example: clipNetwork("resnet50",ModelName="trainedCLIP") specifies the CLIP network model name as "TrainedCLIP".

    Name of the pretrained CLIP network model, specified as a string scalar or a character vector.

    Z-score normalization statistics, specified as a structure with these fields:

    FieldDescriptionDefault Value
    Mean1-by-C vector of means per channel.[122.7709 116.7460 104.0937]
    StandardDeviation1-by-C vector of standard deviations per channel.[68.5005 66.6321 70.3232]

    C is the number of channels in each image.

    Properties

    expand all

    This property is read-only after object creation. To set this property, use the ModelName argument during object creation.

    Name of the pretrained CLIP network model, represented as a string scalar or a character vector.

    This property is read-only after object creation. To set this property, use the ImageNormalizationStatistics argument during object creation.

    Z-score normalization statistics, represented as a structure with these fields:

    FieldDescriptionDefault Value
    Mean1-by-C vector of means per channel.[122.7709 116.7460 104.0937]
    StandardDeviation1-by-C vector of standard deviations per channel.[68.5005 66.6321 70.3232]

    The number of channels C must match the number of channels in each image.

    This property is read-only.

    Network containing the CLIP text encoder parameters, represented as a dlnetwork (Deep Learning Toolbox) object.

    This property is read-only.

    Network containing the CLIP image encoder parameters, represented as a dlnetwork object.

    Object Functions

    classifyClassify image using CLIP network
    extractImageEmbeddingsExtract feature embeddings from image using CLIP network image encoder
    extractTextEmbeddingsExtract text embeddings from search text using CLIP network text encoder
    forwardRun forward pass on CLIP network

    Examples

    collapse all

    Create a pretrained CLIP network.

    clip = clipNetwork("resnet50")
    clip = 
      clipNetwork with properties:
    
                           ModelName: "resnet50"
                 ImageEncoderNetwork: [1×1 dlnetwork]
                  TextEncoderNetwork: [1×1 dlnetwork]
        ImageNormalizationStatistics: [1×1 struct]
    
    

    Create a pretrained CLIP network with the ViT-B/16 backbone.

    clip = clipNetwork("vit-b-16");

    Create a datastore of images, imds, and display a montage of the images.

    pathToImages = fullfile(toolboxdir("vision"),"visiondata","imageSets");
    imds = imageDatastore(pathToImages,IncludeSubfolders=true);
    montage(imds)

    Figure contains an axes object. The hidden axes object contains an object of type image.

    Extract Image Embeddings

    Extract the feature embeddings for each image in the datastore.

    imageEmbeddings = extractImageEmbeddings(clip,imds);

    Extract Text Embeddings

    Define a search text search term, and extract the text embeddings using the CLIP network encoder. You can modify the search term to empirically test image retrieval.

    search = "A photo of a children's book.";
    textEmbeddings = extractTextEmbeddings(clip,search);

    Retrieve Images Related to Search Text

    Compute the cosine similarity between the text embeddings and all image embeddings.

    function simScores = cosineSim(x,y)
        simScores = (x'*y)'./(vecnorm(x)*norm(y));
    end
    simScores = cosineSim(imageEmbeddings,textEmbeddings)
    simScores = 1×12 single row vector
    
        0.3019    0.3202    0.2359    0.2261    0.1954    0.2838    0.1628    0.1735    0.2029    0.1854    0.1826    0.1993
    
    

    Identify the three highest similarity scores.

    [~,topIdx] = maxk(simScores,3);

    Display the images with the highest similarity scores related to the search text.

    imageFiles = imds.Files(topIdx);
    montage(imageFiles)

    Figure contains an axes object. The hidden axes object contains an object of type image.

    References

    [1] Radford, Alec, et al. Learning Transferable Visual Models From Natural Language Supervision. arXiv, 2021. DOI.org (Datacite), https://doi.org/10.48550/ARXIV.2103.00020.

    Version History

    Introduced in R2026a