Class MultiModalEmbeddingModel (1.48.0)

MultiModalEmbeddingModel(model_id: str, endpoint_name: typing.Optional[str] = None)

Generates embedding vectors from images and videos.

Examples::

model = MultiModalEmbeddingModel.from_pretrained("multimodalembedding@001")
image = Image.load_from_file("image.png")
video = Video.load_from_file("video.mp4")

embeddings = model.get_embeddings(
    image=image,
    video=video,
    contextual_text="Hello world",
)
image_embedding = embeddings.image_embedding
video_embeddings = embeddings.video_embeddings
text_embedding = embeddings.text_embedding

Methods

MultiModalEmbeddingModel

MultiModalEmbeddingModel(model_id: str, endpoint_name: typing.Optional[str] = None)

Creates a _ModelGardenModel.

This constructor should not be called directly. Use {model_class}.from_pretrained(model_name=...) instead.

Parameters
NameDescription
model_id str

Identifier of a Model Garden Model. Example: "text-bison@001"

endpoint_name typing.Optional[str]

Vertex Endpoint resource name for the model

from_pretrained

from_pretrained(model_name: str) -> vertexai._model_garden._model_garden_models.T

Loads a _ModelGardenModel.

Parameter
NameDescription
model_name str

Name of the model.

Exceptions
TypeDescription
ValueErrorIf model_name is unknown.
ValueErrorIf model does not support this class.

get_embeddings

get_embeddings(
    image: typing.Optional[vertexai.vision_models.Image] = None,
    video: typing.Optional[vertexai.vision_models.Video] = None,
    contextual_text: typing.Optional[str] = None,
    dimension: typing.Optional[int] = None,
    video_segment_config: typing.Optional[
        vertexai.vision_models.VideoSegmentConfig
    ] = None,
) -> vertexai.vision_models.MultiModalEmbeddingResponse

Gets embedding vectors from the provided image.

Parameters
NameDescription
image Image

Optional. The image to generate embeddings for. One of image, video, or contextual_text is required.

video Video

Optional. The video to generate embeddings for. One of image, video or contextual_text is required.

contextual_text str

Optional. Contextual text for your input image or video. If provided, the model will also generate an embedding vector for the provided contextual text. The returned image and text embedding vectors are in the same semantic space with the same dimensionality, and the vectors can be used interchangeably for use cases like searching image by text or searching text by image. One of image, video or contextual_text is required.

dimension int

Optional. The number of embedding dimensions. Lower values offer decreased latency when using these embeddings for subsequent tasks, while higher values offer better accuracy. Available values: 128, 256, 512, and 1408 (default).

video_segment_config VideoSegmentConfig

Optional. The specific video segments (in seconds) the embeddings are generated for.

Returns
TypeDescription
MultiModalEmbeddingResponseThe image and text embedding vectors.