サポートされているモデル:
- multimodalembedding@001
構文
- PROJECT_ID =
PROJECT_ID
- REGION =
us-central1
curl
curl -X POST \ -H "Authorization: Bearer $(gcloud auth print-access-token)" \ -H "Content-Type: application/json" \ https://${REGION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${REGION}/publishers/google/models/${MODEL_ID}:predict \ -d '{ "instances": [ ... ], }'
Python
from vertexai.vision_models import MultiModalEmbeddingModel model = MultiModalEmbeddingModel.from_pretrained("multimodalembedding") model.get_embeddings(...)
パラメータ リスト
リクエストの本文
パラメータ | |
---|---|
|
省略可。 エンベディングを生成するテキスト。 |
|
省略可。 エンベディングを生成する画像。 |
|
省略可。 エンベディングを生成する動画セグメント。 |
|
省略可。 このパラメータは、128、256、512、1408 のいずれかの値を受け入れます。レスポンスには、そのディメンションのエンベディングが含まれます。これはテキストと画像の入力にのみ適用されます。 |
画像
パラメータ | |
---|---|
|
省略可。 base64 文字列でエンコードされた画像バイト。 |
|
省略可。 エンベディングを実行する画像の Cloud Storage ロケーション。 |
|
省略可。 画像のコンテンツの MIME タイプ。 |
VideoSegmentConfig
パラメータ | |
---|---|
|
省略可。 動画セグメントの開始オフセット(秒単位)。開始オフセットが指定されていない場合、開始オフセットは |
|
省略可。 動画セグメントの終了オフセット(秒単位)。終了オフセットが指定されていない場合、終了オフセットは |
|
省略可。 エンベディングが生成される動画の間隔。 |
動画
パラメータ | |
---|---|
|
省略可。 base64 文字列でエンコードされた動画バイト。 |
|
省略可。 エンベディングを実行する動画の Cloud Storage のロケーション。 |
|
省略可。 動画セグメントの構成。 |
例
- PROJECT_ID =
PROJECT_ID
- REGION =
us-central1
- MODEL_ID =
multimodalembedding@001
基本的なユースケース
マルチモーダル エンベディング モデルは、提供された入力に基づいてベクトルを生成します。これには、画像、テキスト、動画データの組み合わせが含まれます。
curl
curl -X POST \ -H "Authorization: Bearer $(gcloud auth print-access-token)" \ -H "Content-Type: application/json" \ https://${REGION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${REGION}/publishers/google/models/${MODEL_ID}:predict \ -d '{ "instances": [ { "image": { "gcsUri": "gs://your-public-uri-test/flower.jpg" }, "text": "white shoes", "video": { "gcsUri": "gs://your-public-uri-test/Okabashi.mp4" }, } ], }'
Python
# @title Client for multimodal embedding import base64 import time import typing from dataclasses import dataclass from absl import app from absl import flags # Need to do pip install google-cloud-aiplatform for the following two imports. # Also run: gcloud auth application-default login. from google.cloud import aiplatform from google.protobuf import struct_pb2 PROJECT_ID = {PROJECT_ID} IMAGE_URI = "gs://your-public-uri-test/flower.jpg" # @param {type:"string"} TEXT = "white shoes" # @param {type:"string"} VIDEO_URI = "gs://your-public-uri-test/Okabashi.mp4" # @param {type:"string"} VIDEO_START_OFFSET_SEC=0 VIDEO_END_OFFSET_SEC=120 VIDEO_EMBEDDING_INTERVAL_SEC=16 # Inspired from https://stackoverflow.com/questions/34269772/type-hints-in-namedtuple. class EmbeddingResponse(typing.NamedTuple): @dataclass class VideoEmbedding: start_offset_sec: int end_offset_sec: int embedding: typing.Sequence[float] text_embedding: typing.Sequence[float] image_embedding: typing.Sequence[float] video_embeddings: typing.Sequence[VideoEmbedding] class EmbeddingPredictionClient: """Wrapper around Prediction Service Client.""" def __init__(self, project: str, location: str = "us-central1", api_regional_endpoint: str = "us-central1-aiplatform.googleapis.com"): client_options = {"api_endpoint": api_regional_endpoint} # Initialize client that will be used to create and send requests. # This client only needs to be created once, and can be reused for multiple requests. self.client = aiplatform.gapic.PredictionServiceClient(client_options=client_options) self.location = location self.project = project def get_embedding(self, text: str = None, image_uri: str = None, video_uri: str = None, start_offset_sec: int = 0, end_offset_sec: int = 120, interval_sec: int = 16): if not text and not image_uri and not video_uri: raise ValueError('At least one of text or image_uri or video_uri must be specified.') instance = struct_pb2.Struct() if text: instance.fields['text'].string_value = text if image_uri: image_struct = instance.fields['image'].struct_value image_struct.fields['gcsUri'].string_value = image_uri if video_uri: video_struct = instance.fields['video'].struct_value video_struct.fields['gcsUri'].string_value = video_uri video_config_struct = video_struct.fields['videoSegmentConfig'].struct_value video_config_struct.fields['startOffsetSec'].number_value = start_offset_sec video_config_struct.fields['endOffsetSec'].number_value = end_offset_sec video_config_struct.fields['intervalSec'].number_value = interval_sec instances = [instance] endpoint = (f"projects/{self.project}/locations/{self.location}" "/publishers/google/models/multimodalembedding@001") response = self.client.predict(endpoint=endpoint, instances=instances) text_embedding = None if text: text_emb_value = response.predictions[0]['textEmbedding'] text_embedding = [v for v in text_emb_value] image_embedding = None if image_uri: image_emb_value = response.predictions[0]['imageEmbedding'] image_embedding = [v for v in image_emb_value] video_embeddings = None if video_uri: video_emb_values = response.predictions[0]['videoEmbeddings'] video_embeddings = [ EmbeddingResponse.VideoEmbedding(start_offset_sec=v['startOffsetSec'], end_offset_sec=v['endOffsetSec'], embedding=[x for x in v['embedding']]) for v in video_emb_values] return EmbeddingResponse( text_embedding=text_embedding, image_embedding=image_embedding, video_embeddings=video_embeddings) # client can be reused. client = EmbeddingPredictionClient(project=PROJECT_ID) start = time.time() response = client.get_embedding(text=TEXT, image_uri=IMAGE_URI, video_uri=VIDEO_URI, start_offset_sec=VIDEO_START_OFFSET_SEC, end_offset_sec=VIDEO_END_OFFSET_SEC, interval_sec=VIDEO_EMBEDDING_INTERVAL_SEC) end = time.time() print(response) print('Time taken: ', end - start)
上級者向けのユースケース
テキストと画像のエンベディングのサイズを指定できます。動画のエンベディングでは、動画セグメントとエンベディング密度を指定できます。
curl - 画像
curl -X POST \ -H "Authorization: Bearer $(gcloud auth print-access-token)" \ -H "Content-Type: application/json" \ https://${REGION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${REGION}/publishers/google/models/${MODEL_ID}:predict \ -d '{ "instances": [ { "image": { "gcsUri": "gs://your-public-uri-test/flower.jpg" }, "text": "white shoes", } ], "parameters": { "dimension": 128 } }'
curl - 動画
curl -X POST \ -H "Authorization: Bearer $(gcloud auth print-access-token)" \ -H "Content-Type: application/json" \ https://${REGION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${REGION}/publishers/google/models/${MODEL_ID}:predict \ -d '{ "instances": [ { "video": { "gcsUri": "gs://your-public-uri-test/Okabashi.mp4", "videoSegmentConfig": { "startOffsetSec": 10, "endOffsetSec": 60, "intervalSec": 10 } }, } ], }'
Python
# @title Client for multimodal embedding import base64 import time import typing from dataclasses import dataclass from absl import app from absl import flags # Need to do pip install google-cloud-aiplatform for the following two imports. # Also run: gcloud auth application-default login. from google.cloud import aiplatform from google.protobuf import struct_pb2 PROJECT_ID = {PROJECT_ID} IMAGE_URI = "gs://your-public-uri-test/flower.jpg" TEXT = "white shoes" VIDEO_URI = "gs://your-public-uri-test/brahms.mp4" VIDEO_START_OFFSET_SEC=10 VIDEO_END_OFFSET_SEC=60 VIDEO_EMBEDDING_INTERVAL_SEC=10 DIMENSION= 128 # Inspired from https://stackoverflow.com/questions/34269772/type-hints-in-namedtuple. class EmbeddingResponse(typing.NamedTuple): @dataclass class VideoEmbedding: start_offset_sec: int end_offset_sec: int embedding: typing.Sequence[float] text_embedding: typing.Sequence[float] image_embedding: typing.Sequence[float] video_embeddings: typing.Sequence[VideoEmbedding] class EmbeddingPredictionClient: """Wrapper around Prediction Service Client.""" def __init__(self, project: str, location: str = "us-central1", api_regional_endpoint: str = "us-central1-aiplatform.googleapis.com"): client_options = {"api_endpoint": api_regional_endpoint} # Initialize client that will be used to create and send requests. # This client only needs to be created once, and can be reused for multiple requests. self.client = aiplatform.gapic.PredictionServiceClient(client_options=client_options) self.location = location self.project = project def get_embedding(self, text: str = None, image_uri: str = None, video_uri: str = None, start_offset_sec: int = 0, end_offset_sec: int = 120, interval_sec: int = 16, dimension=1408): if not text and not image_uri and not video_uri: raise ValueError('At least one of text or image_uri or video_uri must be specified.') instance = struct_pb2.Struct() if text: instance.fields['text'].string_value = text if image_uri: image_struct = instance.fields['image'].struct_value image_struct.fields['gcsUri'].string_value = image_uri if video_uri: video_struct = instance.fields['video'].struct_value video_struct.fields['gcsUri'].string_value = video_uri video_config_struct = video_struct.fields['videoSegmentConfig'].struct_value video_config_struct.fields['startOffsetSec'].number_value = start_offset_sec video_config_struct.fields['endOffsetSec'].number_value = end_offset_sec video_config_struct.fields['intervalSec'].number_value = interval_sec parameters = struct_pb2.Struct() parameters.fields['dimension'].number_value = dimension instances = [instance] endpoint = (f"projects/{self.project}/locations/{self.location}" "/publishers/google/models/multimodalembedding@001") response = self.client.predict(endpoint=endpoint, instances=instances, parameters=parameters) text_embedding = None if text: text_emb_value = response.predictions[0]['textEmbedding'] text_embedding = [v for v in text_emb_value] image_embedding = None if image_uri: image_emb_value = response.predictions[0]['imageEmbedding'] image_embedding = [v for v in image_emb_value] video_embeddings = None if video_uri: video_emb_values = response.predictions[0]['videoEmbeddings'] video_embeddings = [ EmbeddingResponse.VideoEmbedding(start_offset_sec=v['startOffsetSec'], end_offset_sec=v['endOffsetSec'], embedding=[x for x in v['embedding']]) for v in video_emb_values] return EmbeddingResponse( text_embedding=text_embedding, image_embedding=image_embedding, video_embeddings=video_embeddings) # client can be reused. client = EmbeddingPredictionClient(project=PROJECT_ID) start = time.time() response = client.get_embedding(text=TEXT, image_uri=IMAGE_URI, video_uri=VIDEO_URI, start_offset_sec=VIDEO_START_OFFSET_SEC, end_offset_sec=VIDEO_END_OFFSET_SEC, interval_sec=VIDEO_EMBEDDING_INTERVAL_SEC, dimension=DIMENSION) end = time.time() print(response) print('Time taken: ', end - start)
さらに詳しい情報
詳細なドキュメントについては、以下をご覧ください。