Multimodal embeddings

The Embeddings for Multimodal (multimodalembedding) model generates dimension vectors (128, 256, 512, or 1408 dimensions) based on the input that you provide. This input can include any combination of text, image, or video. The embedding vectors can then be used for other subsequent tasks like image classification or content moderation.

The text, image, and video embedding vectors are in the same semantic space with the same dimensionality. Therefore, these vectors can be used interchangeably for use cases like searching images by text, or searching video by image.

Use cases

Some common use cases for multimodal embeddings are:

  • Image or video classification: Takes an image or video as input and predicts one or more classes (labels).
  • Image search: Search for relevant or similar images.
  • Video content search
    • Using semantic search: Take a text as an input, and return a set of ranked frames matching the query.
    • Using similarity search:
      • Take a video as an input, and return a set of videos matching the query.
      • Take an image as an input, and return a set of videos matching the query.
  • Recommendations: Generate product or advertisement recommendations based on images or videos (similarity search).

To explore this model in the console, see the Embeddings for Multimodal model card in the Model Garden.

Go to the Model Garden

HTTP request

POST https://us-central1-aiplatform.googleapis.com/v1/projects/${PROJECT}/locations/us-central1/publishers/google/models/multimodalembedding:predict

Request body

{
  "instances": [
    {
      "text": string,
      "image": {
        // Union field can be only one of the following:
        "bytesBase64Encoded": string,
        "gcsUri": string,
        // End of list of possible types for union field.
        "mimeType": string
      },
      "video": {
        // Union field can be only one of the following:
        "bytesBase64Encoded": string,
        "gcsUri": string,
        // End of list of possible types for union field.
        "videoSegmentConfig": {
          "startOffsetSec": integer,
          "endOffsetSec": integer,
          "intervalSec": integer
        }
      },
      "parameters": {
        "dimension": integer
      }
    }
  ]
}

Use the following parameters for the multimodal generation model multimodal embeddings. For more information, see Get multimodal embeddings.

Parameter Description Acceptable values
instances An array that contains the object with data (text, image, and video) to get information about. array (1 object allowed)
text The input text you want to create an embedding for. String (32 tokens max)
image.bytesBase64Encoded The image to get embeddings for. If you specify image.bytesBase64Encoded you can't set image.gcsUri. Base64-encoded image string (BMP, GIF, JPG, or PNG file, 20 MB max)
image.gcsUri The Cloud Storage URI of the image to get embeddings for. If you specify image.gcsUri you can't set image.bytesBase64Encoded. string URI of the image file in Cloud Storage (BMP, GIF, JPG, or PNG file, 20 MB max)
image.mimeType Optional. The MIME type of the image you specify. string (image/bmp, image/gif, image/jpeg, or image/png)
video.bytesBase64Encoded The video to get embeddings for. If you specify video.bytesBase64Encoded you can't set video.gcsUri. Base64-encoded video string (AVI, FLV, MKV, MOV, MP4, MPEG, MPG, WEBM, or WMV file)
video.gcsUri The Cloud Storage URI of the video to get embeddings for. If you specify video.gcsUri you can't set video.bytesBase64Encoded. string URI of the video file in Cloud Storage (AVI, FLV, MKV, MOV, MP4, MPEG, MPG, WEBM, or WMV file)
videoSegmentConfig.startOffsetSec Optional. The time (in seconds) where the model begins embedding detection. Default: 0 integer
videoSegmentConfig.endOffsetSec Optional. The time (in seconds) where the model ends embedding detection. Default: 120 integer
videoSegmentConfig.intervalSec Optional. The time (in seconds) of video data segments that embeddings are generated for. This value corresponds to the video embedding mode (Essential, Standard, or Plus), which affects feature pricing.

Essential mode (intervalSec >= 15): Fewest segments of video that embeddings are generated for. The lowest cost option.
Standard tier (8 <= intervalSec < 15): More segments of video that embeddings are generated for than Essential mode, but fewer than Plus mode. Intermediate cost option.
Plus mode (4 <= intervalSec < 8): Most segments of video that embeddings are generated for. The highest cost option.

Default: 16 (Essential mode)
integer (minimum value: 4)
parameters.dimension Optional. The vector dimension to generate embeddings for (text or image only). If not set, the default value of 1408 is used. integer (128, 256, 512 or 1408 [default])

Sample request

REST

The following example uses image, text, and video data. You can use any combination of these data types in your request body.

Additionally, this sample uses a video located in Cloud Storage. You can also use the video.bytesBase64Encoded field to provide a base64-encoded string representation of the video.

Before using any of the request data, make the following replacements:

  • LOCATION: Your project's region. For example, us-central1, europe-west2, or asia-northeast3. For a list of available regions, see Generative AI on Vertex AI locations.
  • PROJECT_ID: Your Google Cloud project ID.
  • TEXT: The target text to get embeddings for. For example, a cat.
  • IMAGE_URI: The Cloud Storage URI of the target video to get embeddings for. For example, gs://my-bucket/embeddings/supermarket-img.png.

    You can also provide the image as a base64-encoded byte string:

    [...]
    "image": {
      "bytesBase64Encoded": "B64_ENCODED_IMAGE"
    }
    [...]
    
  • VIDEO_URI: The Cloud Storage URI of the target video to get embeddings for. For example, gs://my-bucket/embeddings/supermarket-video.mp4.

    You can also provide the video as a base64-encoded byte string:

    [...]
    "video": {
      "bytesBase64Encoded": "B64_ENCODED_VIDEO"
    }
    [...]
    
  • videoSegmentConfig (START_SECOND, END_SECOND, INTERVAL_SECONDS). Optional. The specific video segments (in seconds) the embeddings are generated for.

    For example:

    [...]
    "videoSegmentConfig": {
      "startOffsetSec": 10,
      "endOffsetSec": 60,
      "intervalSec": 10
    }
    [...]

    Using this config specifies video data from 10 seconds to 60 seconds and generates embeddings for the following 10 second video intervals: [10, 20), [20, 30), [30, 40), [40, 50), [50, 60). This video interval ("intervalSec": 10) falls in the Standard video embedding mode, and the user is charged at the Standard mode pricing rate.

    If you omit videoSegmentConfig, the service uses the following default values: "videoSegmentConfig": { "startOffsetSec": 0, "endOffsetSec": 120, "intervalSec": 16 }. This video interval ("intervalSec": 16) falls in the Essential video embedding mode, and the user is charged at the Essential mode pricing rate.

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict

Request JSON body:

{
  "instances": [
    {
      "text": "TEXT",
      "image": {
        "gcsUri": "IMAGE_URI"
      },
      "video": {
        "gcsUri": "VIDEO_URI",
        "videoSegmentConfig": {
          "startOffsetSec": START_SECOND,
          "endOffsetSec": END_SECOND,
          "intervalSec": INTERVAL_SECONDS
        }
      }
    }
  ]
}

To send your request, choose one of these options:

curl

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict"

PowerShell

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict" | Select-Object -Expand Content
The embedding the model returns is a 1408 float vector. The following sample response is shortened for space.
{
  "predictions": [
    {
      "textEmbedding": [
        0.0105433334,
        -0.00302835181,
        0.00656806398,
        0.00603460241,
        [...]
        0.00445805816,
        0.0139605571,
        -0.00170318608,
        -0.00490092579
      ],
      "videoEmbeddings": [
        {
          "startOffsetSec": 0,
          "endOffsetSec": 7,
          "embedding": [
            -0.00673126569,
            0.0248149596,
            0.0128901172,
            0.0107588246,
            [...]
            -0.00180952181,
            -0.0054573305,
            0.0117037306,
            0.0169312079
          ]
        }
      ],
      "imageEmbedding": [
        -0.00728622358,
        0.031021487,
        -0.00206603738,
        0.0273937676,
        [...]
        -0.00204976718,
        0.00321615417,
        0.0121978866,
        0.0193375275
      ]
    }
  ],
  "deployedModelId": "DEPLOYED_MODEL_ID"
}

Python

To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.

from typing import Optional

import vertexai
from vertexai.vision_models import (
    Image,
    MultiModalEmbeddingModel,
    MultiModalEmbeddingResponse,
    Video,
    VideoSegmentConfig,
)


def get_image_video_text_embeddings(
    project_id: str,
    location: str,
    image_path: str,
    video_path: str,
    contextual_text: Optional[str] = None,
    dimension: Optional[int] = 1408,
    video_segment_config: Optional[VideoSegmentConfig] = None,
) -> MultiModalEmbeddingResponse:
    """Example of how to generate multimodal embeddings from image, video, and text.

    Args:
        project_id: Google Cloud Project ID, used to initialize vertexai
        location: Google Cloud Region, used to initialize vertexai
        image_path: Path to image (local or Google Cloud Storage) to generate embeddings for.
        video_path: Path to video (local or Google Cloud Storage) to generate embeddings for.
        contextual_text: Text to generate embeddings for.
        dimension: Dimension for the returned embeddings.
            https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-multimodal-embeddings#low-dimension
        video_segment_config: Define specific segments to generate embeddings for.
            https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-multimodal-embeddings#video-best-practices
    """

    vertexai.init(project=project_id, location=location)

    model = MultiModalEmbeddingModel.from_pretrained("multimodalembedding")
    image = Image.load_from_file(image_path)
    video = Video.load_from_file(video_path)

    embeddings = model.get_embeddings(
        image=image,
        video=video,
        video_segment_config=video_segment_config,
        contextual_text=contextual_text,
        dimension=dimension,
    )

    print(f"Image Embedding: {embeddings.image_embedding}")

    # Video Embeddings are segmented based on the video_segment_config.
    print("Video Embeddings:")
    for video_embedding in embeddings.video_embeddings:
        print(
            f"Video Segment: {video_embedding.start_offset_sec} - {video_embedding.end_offset_sec}"
        )
        print(f"Embedding: {video_embedding.embedding}")

    print(f"Text Embedding: {embeddings.text_embedding}")

Node.js

Before trying this sample, follow the Node.js setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Node.js API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

/**
 * TODO(developer): Uncomment these variables before running the sample.\
 * (Not necessary if passing values as arguments)
 */
// const project = 'YOUR_PROJECT_ID';
// const location = 'YOUR_PROJECT_LOCATION';
// const bastImagePath = "YOUR_BASED_IMAGE_PATH"
// const textPrompt = 'YOUR_TEXT_PROMPT';
const aiplatform = require('@google-cloud/aiplatform');

// Imports the Google Cloud Prediction service client
const {PredictionServiceClient} = aiplatform.v1;

// Import the helper module for converting arbitrary protobuf.Value objects.
const {helpers} = aiplatform;

// Specifies the location of the api endpoint
const clientOptions = {
  apiEndpoint: 'us-central1-aiplatform.googleapis.com',
};
const publisher = 'google';
const model = 'multimodalembedding@001';

// Instantiates a client
const predictionServiceClient = new PredictionServiceClient(clientOptions);

async function predictImageFromImageAndText() {
  // Configure the parent resource
  const endpoint = `projects/${project}/locations/${location}/publishers/${publisher}/models/${model}`;

  const fs = require('fs');
  const imageFile = fs.readFileSync(baseImagePath);

  // Convert the image data to a Buffer and base64 encode it.
  const encodedImage = Buffer.from(imageFile).toString('base64');

  const prompt = {
    text: textPrompt,
    image: {
      bytesBase64Encoded: encodedImage,
    },
  };
  const instanceValue = helpers.toValue(prompt);
  const instances = [instanceValue];

  const parameter = {
    sampleCount: 1,
  };
  const parameters = helpers.toValue(parameter);

  const request = {
    endpoint,
    instances,
    parameters,
  };

  // Predict request
  const [response] = await predictionServiceClient.predict(request);
  console.log('Get image embedding response');
  const predictions = response.predictions;
  console.log('\tPredictions :');
  for (const prediction of predictions) {
    console.log(`\t\tPrediction : ${JSON.stringify(prediction)}`);
  }
}

await predictImageFromImageAndText();

Java

Before trying this sample, follow the Java setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Java API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.


import com.google.cloud.aiplatform.v1beta1.EndpointName;
import com.google.cloud.aiplatform.v1beta1.PredictResponse;
import com.google.cloud.aiplatform.v1beta1.PredictionServiceClient;
import com.google.cloud.aiplatform.v1beta1.PredictionServiceSettings;
import com.google.gson.Gson;
import com.google.gson.JsonObject;
import com.google.protobuf.InvalidProtocolBufferException;
import com.google.protobuf.Value;
import com.google.protobuf.util.JsonFormat;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Base64;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class PredictImageFromImageAndTextSample {

  public static void main(String[] args) throws IOException {
    // TODO(developer): Replace this variable before running the sample.
    String project = "YOUR_PROJECT_ID";
    String textPrompt = "YOUR_TEXT_PROMPT";
    String baseImagePath = "YOUR_BASE_IMAGE_PATH";

    // Learn how to use text prompts to update an image:
    // https://cloud.google.com/vertex-ai/docs/generative-ai/image/edit-images
    Map<String, Object> parameters = new HashMap<String, Object>();
    parameters.put("sampleCount", 1);

    String location = "us-central1";
    String publisher = "google";
    String model = "multimodalembedding@001";

    predictImageFromImageAndText(
        project, location, publisher, model, textPrompt, baseImagePath, parameters);
  }

  // Update images using text prompts
  public static void predictImageFromImageAndText(
      String project,
      String location,
      String publisher,
      String model,
      String textPrompt,
      String baseImagePath,
      Map<String, Object> parameters)
      throws IOException {
    final String endpoint = String.format("%s-aiplatform.googleapis.com:443", location);
    final PredictionServiceSettings predictionServiceSettings =
        PredictionServiceSettings.newBuilder().setEndpoint(endpoint).build();

    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    try (PredictionServiceClient predictionServiceClient =
        PredictionServiceClient.create(predictionServiceSettings)) {
      final EndpointName endpointName =
          EndpointName.ofProjectLocationPublisherModelName(project, location, publisher, model);

      // Convert the image to Base64
      byte[] imageData = Base64.getEncoder().encode(Files.readAllBytes(Paths.get(baseImagePath)));
      String encodedImage = new String(imageData, StandardCharsets.UTF_8);

      JsonObject jsonInstance = new JsonObject();
      jsonInstance.addProperty("text", textPrompt);
      JsonObject jsonImage = new JsonObject();
      jsonImage.addProperty("bytesBase64Encoded", encodedImage);
      jsonInstance.add("image", jsonImage);

      Value instanceValue = stringToValue(jsonInstance.toString());
      List<Value> instances = new ArrayList<>();
      instances.add(instanceValue);

      Gson gson = new Gson();
      String gsonString = gson.toJson(parameters);
      Value parameterValue = stringToValue(gsonString);

      PredictResponse predictResponse =
          predictionServiceClient.predict(endpointName, instances, parameterValue);
      System.out.println("Predict Response");
      System.out.println(predictResponse);
      for (Value prediction : predictResponse.getPredictionsList()) {
        System.out.format("\tPrediction: %s\n", prediction);
      }
    }
  }

  // Convert a Json string to a protobuf.Value
  static Value stringToValue(String value) throws InvalidProtocolBufferException {
    Value.Builder builder = Value.newBuilder();
    JsonFormat.parser().merge(value, builder);
    return builder.build();
  }
}

Response body

{
  "predictions": [
    {
      "textEmbedding": [
        float,
        // array of 128, 256, 512, or 1408 float values
        float
      ],
      "imageEmbedding": [
        float,
        // array of 128, 256, 512, or 1408 float values
        float
      ],
      "videoEmbeddings": [
        {
          "startOffsetSec": integer,
          "endOffsetSec": integer,
          "embedding": [
            float,
            // array of 1408 float values
            float
          ]
        }
      ]
    }
  ],
  "deployedModelId": string
}
Response element Description
imageEmbedding 128, 256, 512, or 1408 dimension list of floats.
textEmbedding 128, 256, 512, or 1408 dimension list of floats.
videoEmbeddings 1408 dimension list of floats with the start and end time (in seconds) of the video segment that the embeddings are generated for.