The Embeddings for Multimodal (multimodalembedding
) model generates dimension
vectors (128, 256, 512, or 1408 dimensions) based on the input that you provide. This
input can include any combination of text, image, or video. The embedding
vectors can then be used for other subsequent tasks like image classification or
content moderation.
The text, image, and video embedding vectors are in the same semantic space with the same dimensionality. Therefore, these vectors can be used interchangeably for use cases like searching images by text, or searching video by image.
Use cases
Some common use cases for multimodal embeddings are:
- Image or video classification: Takes an image or video as input and predicts one or more classes (labels).
- Image search: Search for relevant or similar images.
- Video content search
- Using semantic search: Take a text as an input, and return a set of ranked frames matching the query.
- Using similarity search:
- Take a video as an input, and return a set of videos matching the query.
- Take an image as an input, and return a set of videos matching the query.
- Recommendations: Generate product or advertisement recommendations based on images or videos (similarity search).
To explore this model in the console, see the Embeddings for Multimodal model card in the Model Garden.
HTTP request
POST https://us-central1-aiplatform.googleapis.com/v1/projects/${PROJECT}/locations/us-central1/publishers/google/models/multimodalembedding:predict
Request body
{
"instances": [
{
"text": string,
"image": {
// Union field can be only one of the following:
"bytesBase64Encoded": string,
"gcsUri": string,
// End of list of possible types for union field.
"mimeType": string
},
"video": {
// Union field can be only one of the following:
"bytesBase64Encoded": string,
"gcsUri": string,
// End of list of possible types for union field.
"videoSegmentConfig": {
"startOffsetSec": integer,
"endOffsetSec": integer,
"intervalSec": integer
}
},
"parameters": {
"dimension": integer
}
}
]
}
Use the following parameters for the multimodal generation model multimodal
embeddings
. For more information, see Get multimodal embeddings.
Parameter | Description | Acceptable values |
---|---|---|
instances |
An array that contains the object with data (text, image, and video) to get information about. | array (1 object allowed) |
text |
The input text you want to create an embedding for. | String (32 tokens max) |
image.bytesBase64Encoded |
The image to get embeddings for. If you specify image.bytesBase64Encoded you can't set image.gcsUri . |
Base64-encoded image string (BMP, GIF, JPG, or PNG file, 20 MB max) |
image.gcsUri |
The Cloud Storage URI of the image to get embeddings for. If you specify image.gcsUri you can't set image.bytesBase64Encoded . |
string URI of the image file in Cloud Storage (BMP, GIF, JPG, or PNG file, 20 MB max) |
image.mimeType |
Optional. The MIME type of the image you specify. | string (image/bmp , image/gif , image/jpeg , or image/png ) |
video.bytesBase64Encoded |
The video to get embeddings for. If you specify video.bytesBase64Encoded you can't set video.gcsUri . |
Base64-encoded video string (AVI, FLV, MKV, MOV, MP4, MPEG, MPG, WEBM, or WMV file) |
video.gcsUri |
The Cloud Storage URI of the video to get embeddings for. If you specify video.gcsUri you can't set video.bytesBase64Encoded . |
string URI of the video file in Cloud Storage (AVI, FLV, MKV, MOV, MP4, MPEG, MPG, WEBM, or WMV file) |
videoSegmentConfig.startOffsetSec |
Optional. The time (in seconds) where the model begins embedding detection. Default: 0 | integer |
videoSegmentConfig.endOffsetSec |
Optional. The time (in seconds) where the model ends embedding detection. Default: 120 | integer |
videoSegmentConfig.intervalSec |
Optional. The time (in seconds) of video data segments that embeddings are generated for. This value corresponds to the video embedding mode (Essential, Standard, or Plus), which affects feature pricing. Essential mode ( intervalSec >= 15): Fewest segments of video that embeddings are generated for. The lowest cost option.Standard tier (8 <= intervalSec < 15): More segments of video that embeddings are generated for than Essential mode, but fewer than Plus mode. Intermediate cost option.Plus mode (4 <= intervalSec < 8): Most segments of video that embeddings are generated for. The highest cost option.Default: 16 (Essential mode) |
integer (minimum value: 4) |
parameters.dimension |
Optional. The vector dimension to generate embeddings for (text or image only). If not set, the default value of 1408 is used. | integer (128 , 256 , 512 or 1408 [default]) |
Sample request
REST
The following example uses image, text, and video data. You can use any combination of these data types in your request body.
Additionally, this
sample uses a video located in Cloud Storage. You can
also use the video.bytesBase64Encoded
field to provide a
base64-encoded string representation of the
video.
Before using any of the request data, make the following replacements:
- LOCATION: Your project's region. For example,
us-central1
,europe-west2
, orasia-northeast3
. For a list of available regions, see Generative AI on Vertex AI locations. - PROJECT_ID: Your Google Cloud project ID.
- TEXT: The target text to get embeddings for. For example,
a cat
. - IMAGE_URI: The Cloud Storage URI of the target video to get embeddings for.
For example,
gs://my-bucket/embeddings/supermarket-img.png
.You can also provide the image as a base64-encoded byte string:
[...] "image": { "bytesBase64Encoded": "B64_ENCODED_IMAGE" } [...]
- VIDEO_URI: The Cloud Storage URI of the target video to get embeddings for.
For example,
gs://my-bucket/embeddings/supermarket-video.mp4
.You can also provide the video as a base64-encoded byte string:
[...] "video": { "bytesBase64Encoded": "B64_ENCODED_VIDEO" } [...]
videoSegmentConfig
(START_SECOND, END_SECOND, INTERVAL_SECONDS). Optional. The specific video segments (in seconds) the embeddings are generated for.For example:
[...] "videoSegmentConfig": { "startOffsetSec": 10, "endOffsetSec": 60, "intervalSec": 10 } [...]
Using this config specifies video data from 10 seconds to 60 seconds and generates embeddings for the following 10 second video intervals: [10, 20), [20, 30), [30, 40), [40, 50), [50, 60). This video interval (
"intervalSec": 10
) falls in the Standard video embedding mode, and the user is charged at the Standard mode pricing rate.If you omit
videoSegmentConfig
, the service uses the following default values:"videoSegmentConfig": { "startOffsetSec": 0, "endOffsetSec": 120, "intervalSec": 16 }
. This video interval ("intervalSec": 16
) falls in the Essential video embedding mode, and the user is charged at the Essential mode pricing rate.
HTTP method and URL:
POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict
Request JSON body:
{ "instances": [ { "text": "TEXT", "image": { "gcsUri": "IMAGE_URI" }, "video": { "gcsUri": "VIDEO_URI", "videoSegmentConfig": { "startOffsetSec": START_SECOND, "endOffsetSec": END_SECOND, "intervalSec": INTERVAL_SECONDS } } } ] }
To send your request, choose one of these options:
curl
Save the request body in a file named request.json
,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict"
PowerShell
Save the request body in a file named request.json
,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict" | Select-Object -Expand Content
{ "predictions": [ { "textEmbedding": [ 0.0105433334, -0.00302835181, 0.00656806398, 0.00603460241, [...] 0.00445805816, 0.0139605571, -0.00170318608, -0.00490092579 ], "videoEmbeddings": [ { "startOffsetSec": 0, "endOffsetSec": 7, "embedding": [ -0.00673126569, 0.0248149596, 0.0128901172, 0.0107588246, [...] -0.00180952181, -0.0054573305, 0.0117037306, 0.0169312079 ] } ], "imageEmbedding": [ -0.00728622358, 0.031021487, -0.00206603738, 0.0273937676, [...] -0.00204976718, 0.00321615417, 0.0121978866, 0.0193375275 ] } ], "deployedModelId": "DEPLOYED_MODEL_ID" }
Python
To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.
Node.js
Before trying this sample, follow the Node.js setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Node.js API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Java
Before trying this sample, follow the Java setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Java API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Response body
{
"predictions": [
{
"textEmbedding": [
float,
// array of 128, 256, 512, or 1408 float values
float
],
"imageEmbedding": [
float,
// array of 128, 256, 512, or 1408 float values
float
],
"videoEmbeddings": [
{
"startOffsetSec": integer,
"endOffsetSec": integer,
"embedding": [
float,
// array of 1408 float values
float
]
}
]
}
],
"deployedModelId": string
}
Response element | Description |
---|---|
imageEmbedding |
128, 256, 512, or 1408 dimension list of floats. |
textEmbedding |
128, 256, 512, or 1408 dimension list of floats. |
videoEmbeddings |
1408 dimension list of floats with the start and end time (in seconds) of the video segment that the embeddings are generated for. |