API Embeddings Multimodal

Multimodal Embeddings API menghasilkan vektor berdasarkan input yang Anda berikan, yang dapat mencakup kombinasi data gambar, teks, dan video. Selanjutnya, vektor embedding dapat digunakan untuk tugas selanjutnya seperti klasifikasi gambar atau moderasi konten video.

Model yang Didukung:

  • multimodalembedding@001

Sintaksis

  • ID_PROJECT = PROJECT_ID
  • WILAYAH = us-central1

curl

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  https://${REGION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${REGION}/publishers/google/models/${MODEL_ID}:predict \
  -d '{
  "instances": [
    ...
  ],
}'

Python

from vertexai.vision_models import MultiModalEmbeddingModel

model = MultiModalEmbeddingModel.from_pretrained("multimodalembedding")
model.get_embeddings(...)

Daftar parameter

Isi Permintaan

Parameter

image

Opsional. Image

Teks yang ingin Anda buatkan embedding-nya.

text

Opsional. String

Gambar yang ingin Anda buatkan embeddingsnya.

video

Opsional. Video

Segmen video yang ingin Anda buatkan embeddingsnya.

dimension

Opsional. Int

Parameter ini menerima salah satu nilai berikut: 128, 256, 512, atau 1408. Responsnya menyertakan embedding dimensi tersebut. Ini hanya berlaku untuk input teks dan gambar.

Gambar

Parameter

bytesBase64Encoded

Opsional. String

Byte gambar yang dienkode dalam string base64. Salah satu dari bytesBase64Encoded atau gcsUri harus ditetapkan.

gcsUri

Opsional. String

Lokasi Cloud Storage gambar tempat penyematan gambar dilakukan. Salah satu dari bytesBase64Encoded atau gcsUri harus ditetapkan.

mimeType

Opsional. String

Jenis MIME konten gambar. image/jpeg dan image/png didukung.

VideoSegmentConfig

Parameter

startOffsetSec

Opsional. Int

Offset awal segmen video dalam detik. Jika offset awal tidak ditentukan, offset awal akan dihitung dengan max(0, endOffsetSec - 120).

endOffsetSec

Opsional. Int

Offset akhir segmen video dalam detik. Jika offset akhir tidak ditentukan, offset akhir akan dihitung dengan min(video length, startOffSec + 120). Jika startOffSec dan endOffSec ditentukan, endOffsetSec will be adjusted to min(startOffsetSec+120, endOffsetSec).

intervalSec

Opsional. Int

Interval video saat penyematan akan dibuat. Nilai minimum untuk interval_sec adalah 4. Jika intervalnya kurang dari 4, InvalidArgumentError akan ditampilkan. Tidak ada batasan pada nilai maksimum interval. Namun, jika interval lebih besar dari min(video length, 120s), kualitas embedding yang dihasilkan akan terpengaruh. Default-nya adalah 16.

Video

Parameter

bytesBase64Encoded

Opsional. String

Byte video yang dienkode dalam string base64. Salah satu dari bytesBase64Encoded atau gcsUri harus ditetapkan.

gcsUri

Opsional. String

Lokasi Cloud Storage video yang akan menjadi tempat penyematan. Salah satu dari bytesBase64Encoded atau gcsUri harus ditetapkan.

videoSegmentConfig

Opsional. VideoSegmentConfig

Konfigurasi segmen video.

Contoh

  • ID_PROJECT = PROJECT_ID
  • WILAYAH = us-central1
  • MODEL_ID = multimodalembedding@001

Kasus Penggunaan Dasar

Model embedding multimodal menghasilkan vektor berdasarkan input yang Anda berikan, yang dapat mencakup kombinasi data gambar, teks, dan video.

curl

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  https://${REGION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${REGION}/publishers/google/models/${MODEL_ID}:predict \
  -d '{
  "instances": [
    {
      "image": {
        "gcsUri": "gs://your-public-uri-test/flower.jpg"
      },
      "text": "white shoes",
      "video": {
        "gcsUri": "gs://your-public-uri-test/Okabashi.mp4"
      },
    }
  ],
}'

Python

# @title Client for multimodal embedding
import base64
import time
import typing
from dataclasses import dataclass

from absl import app
from absl import flags
# Need to do pip install google-cloud-aiplatform for the following two imports.
# Also run: gcloud auth application-default login.
from google.cloud import aiplatform
from google.protobuf import struct_pb2

PROJECT_ID = {PROJECT_ID}
IMAGE_URI = "gs://your-public-uri-test/flower.jpg" # @param {type:"string"}
TEXT = "white shoes" # @param {type:"string"}
VIDEO_URI = "gs://your-public-uri-test/Okabashi.mp4" # @param {type:"string"}
VIDEO_START_OFFSET_SEC=0
VIDEO_END_OFFSET_SEC=120
VIDEO_EMBEDDING_INTERVAL_SEC=16

# Inspired from https://stackoverflow.com/questions/34269772/type-hints-in-namedtuple.
class EmbeddingResponse(typing.NamedTuple):
    @dataclass
    class VideoEmbedding:
        start_offset_sec: int
        end_offset_sec: int
        embedding: typing.Sequence[float]

    text_embedding: typing.Sequence[float]
    image_embedding: typing.Sequence[float]
    video_embeddings: typing.Sequence[VideoEmbedding]

class EmbeddingPredictionClient:
    """Wrapper around Prediction Service Client."""

    def __init__(self, project: str,
                 location: str = "us-central1",
                 api_regional_endpoint: str = "us-central1-aiplatform.googleapis.com"):
        client_options = {"api_endpoint": api_regional_endpoint}
        # Initialize client that will be used to create and send requests.
        # This client only needs to be created once, and can be reused for multiple requests.
        self.client = aiplatform.gapic.PredictionServiceClient(client_options=client_options)
        self.location = location
        self.project = project

    def get_embedding(self, text: str = None, image_uri: str = None, video_uri: str = None,
                      start_offset_sec: int = 0, end_offset_sec: int = 120, interval_sec: int = 16):
        if not text and not image_uri and not video_uri:
            raise ValueError('At least one of text or image_uri or video_uri must be specified.')

        instance = struct_pb2.Struct()
        if text:
            instance.fields['text'].string_value = text

        if image_uri:
            image_struct = instance.fields['image'].struct_value
            image_struct.fields['gcsUri'].string_value = image_uri

        if video_uri:
            video_struct = instance.fields['video'].struct_value
            video_struct.fields['gcsUri'].string_value = video_uri
            video_config_struct = video_struct.fields['videoSegmentConfig'].struct_value
            video_config_struct.fields['startOffsetSec'].number_value = start_offset_sec
            video_config_struct.fields['endOffsetSec'].number_value = end_offset_sec
            video_config_struct.fields['intervalSec'].number_value = interval_sec

        instances = [instance]
        endpoint = (f"projects/{self.project}/locations/{self.location}"
                    "/publishers/google/models/multimodalembedding@001")
        response = self.client.predict(endpoint=endpoint, instances=instances)

        text_embedding = None
        if text:
            text_emb_value = response.predictions[0]['textEmbedding']
            text_embedding = [v for v in text_emb_value]

        image_embedding = None
        if image_uri:
            image_emb_value = response.predictions[0]['imageEmbedding']
            image_embedding = [v for v in image_emb_value]

        video_embeddings = None
        if video_uri:
            video_emb_values = response.predictions[0]['videoEmbeddings']
            video_embeddings = [
                EmbeddingResponse.VideoEmbedding(start_offset_sec=v['startOffsetSec'], end_offset_sec=v['endOffsetSec'],
                                                 embedding=[x for x in v['embedding']])
                for v in
                video_emb_values]

        return EmbeddingResponse(
            text_embedding=text_embedding,
            image_embedding=image_embedding,
            video_embeddings=video_embeddings)

# client can be reused.
client = EmbeddingPredictionClient(project=PROJECT_ID)
start = time.time()
response = client.get_embedding(text=TEXT, image_uri=IMAGE_URI, video_uri=VIDEO_URI,
                                    start_offset_sec=VIDEO_START_OFFSET_SEC,
                                    end_offset_sec=VIDEO_END_OFFSET_SEC,
                                    interval_sec=VIDEO_EMBEDDING_INTERVAL_SEC)
end = time.time()

print(response)
print('Time taken: ', end - start)

Kasus Penggunaan Lanjutan

Pengguna dapat menentukan dimensi untuk penyematan teks dan gambar. Untuk penyematan video, pengguna dapat menentukan segmen video dan kepadatan penyematan.

curl - gambar

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  https://${REGION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${REGION}/publishers/google/models/${MODEL_ID}:predict \
  -d '{
  "instances": [
    {
      "image": {
        "gcsUri": "gs://your-public-uri-test/flower.jpg"
      },
      "text": "white shoes",
    }
  ],
  "parameters": {
    "dimension": 128
  }
}'

curl - video

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  https://${REGION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${REGION}/publishers/google/models/${MODEL_ID}:predict \
  -d '{
  "instances": [
    {
      "video": {
        "gcsUri": "gs://your-public-uri-test/Okabashi.mp4",
        "videoSegmentConfig": {
          "startOffsetSec": 10,
          "endOffsetSec": 60,
          "intervalSec": 10
        }
      },
    }
  ],
}'

Python

# @title Client for multimodal embedding
import base64
import time
import typing
from dataclasses import dataclass

from absl import app
from absl import flags
# Need to do pip install google-cloud-aiplatform for the following two imports.
# Also run: gcloud auth application-default login.
from google.cloud import aiplatform
from google.protobuf import struct_pb2

PROJECT_ID = {PROJECT_ID}
IMAGE_URI = "gs://your-public-uri-test/flower.jpg"
TEXT = "white shoes"
VIDEO_URI = "gs://your-public-uri-test/brahms.mp4"
VIDEO_START_OFFSET_SEC=10
VIDEO_END_OFFSET_SEC=60
VIDEO_EMBEDDING_INTERVAL_SEC=10
DIMENSION= 128

# Inspired from https://stackoverflow.com/questions/34269772/type-hints-in-namedtuple.
class EmbeddingResponse(typing.NamedTuple):
    @dataclass
    class VideoEmbedding:
        start_offset_sec: int
        end_offset_sec: int
        embedding: typing.Sequence[float]

    text_embedding: typing.Sequence[float]
    image_embedding: typing.Sequence[float]
    video_embeddings: typing.Sequence[VideoEmbedding]

class EmbeddingPredictionClient:
    """Wrapper around Prediction Service Client."""

    def __init__(self, project: str,
                 location: str = "us-central1",
                 api_regional_endpoint: str = "us-central1-aiplatform.googleapis.com"):
        client_options = {"api_endpoint": api_regional_endpoint}
        # Initialize client that will be used to create and send requests.
        # This client only needs to be created once, and can be reused for multiple requests.
        self.client = aiplatform.gapic.PredictionServiceClient(client_options=client_options)
        self.location = location
        self.project = project

    def get_embedding(self, text: str = None, image_uri: str = None, video_uri: str = None,
                      start_offset_sec: int = 0, end_offset_sec: int = 120, interval_sec: int = 16, dimension=1408):
        if not text and not image_uri and not video_uri:
            raise ValueError('At least one of text or image_uri or video_uri must be specified.')

        instance = struct_pb2.Struct()
        if text:
            instance.fields['text'].string_value = text

        if image_uri:
            image_struct = instance.fields['image'].struct_value
            image_struct.fields['gcsUri'].string_value = image_uri

        if video_uri:
            video_struct = instance.fields['video'].struct_value
            video_struct.fields['gcsUri'].string_value = video_uri
            video_config_struct = video_struct.fields['videoSegmentConfig'].struct_value
            video_config_struct.fields['startOffsetSec'].number_value = start_offset_sec
            video_config_struct.fields['endOffsetSec'].number_value = end_offset_sec
            video_config_struct.fields['intervalSec'].number_value = interval_sec

        parameters = struct_pb2.Struct()
        parameters.fields['dimension'].number_value = dimension

        instances = [instance]
        endpoint = (f"projects/{self.project}/locations/{self.location}"
                    "/publishers/google/models/multimodalembedding@001")
        response = self.client.predict(endpoint=endpoint, instances=instances, parameters=parameters)

        text_embedding = None
        if text:
            text_emb_value = response.predictions[0]['textEmbedding']
            text_embedding = [v for v in text_emb_value]

        image_embedding = None
        if image_uri:
            image_emb_value = response.predictions[0]['imageEmbedding']
            image_embedding = [v for v in image_emb_value]

        video_embeddings = None
        if video_uri:
            video_emb_values = response.predictions[0]['videoEmbeddings']
            video_embeddings = [
                EmbeddingResponse.VideoEmbedding(start_offset_sec=v['startOffsetSec'], end_offset_sec=v['endOffsetSec'],
                                                 embedding=[x for x in v['embedding']])
                for v in
                video_emb_values]

        return EmbeddingResponse(
            text_embedding=text_embedding,
            image_embedding=image_embedding,
            video_embeddings=video_embeddings)

# client can be reused.
client = EmbeddingPredictionClient(project=PROJECT_ID)
start = time.time()
response = client.get_embedding(text=TEXT, image_uri=IMAGE_URI, video_uri=VIDEO_URI,
                                    start_offset_sec=VIDEO_START_OFFSET_SEC,
                                    end_offset_sec=VIDEO_END_OFFSET_SEC,
                                    interval_sec=VIDEO_EMBEDDING_INTERVAL_SEC,
                                    dimension=DIMENSION)
end = time.time()

print(response)
print('Time taken: ', end - start)

Jelajahi lebih lanjut

Untuk dokumentasi mendetail, lihat berikut ini: