clipNetwork
Create pretrained CLIP deep learning neural network for vision-language tasks
Since R2026a
Description
Add-On Required: This feature requires the Computer Vision Toolbox Model for OpenAI CLIP Network add-on.
The clipNetwork object configures a pretrained Contrastive
Language-Image Pre-Training (CLIP) network. Use the CLIP network to connect and compare
images and text for tasks like image classification, retrieval, and zero-shot
learning.
Use the CLIP network for image retrieval and image classification.
Image retrieval — To search for images that best match a text query, first extract embeddings for each of the images in the data set using the
extractImageEmbeddingsobject function. Then, extract the embeddings for the text search terms using theextractTextEmbeddingsobject function, compute the similarity between the text embeddings and the image embeddings, and select the classes with the closest match.Zero-shot classification without retraining — To classify images by comparing extracted image features to the features of candidate class names or descriptions, use the
classifyobject function and select the classes with the closest match.
To perform a forward pass on the image and text encoders prior to training
the CLIP network, use the forward
object function.
Note
This functionality requires Deep Learning Toolbox™.
Creation
Description
net = clipNetwork( creates a
pretrained CLIP network with a pretrained image encoder,
backbone)backbone.
net = clipNetwork(
sets writable properties using one or more name-value arguments. For example,
backbone,Name=Value)ModelName="TrainedCLIP" specifies the CLIP network name as
"TrainedCLIP".
Input Arguments
Name-Value Arguments
Properties
Object Functions
classify | Classify image using CLIP network |
extractImageEmbeddings | Extract feature embeddings from image using CLIP network image encoder |
extractTextEmbeddings | Extract text embeddings from search text using CLIP network text encoder |
forward | Run forward pass on CLIP network |
Examples
References
[1] Radford, Alec, et al. Learning Transferable Visual Models From Natural Language Supervision. arXiv, 2021. DOI.org (Datacite), https://doi.org/10.48550/ARXIV.2103.00020.
Version History
Introduced in R2026a

