Multimodal Embeddings API

Multimodal Embeddings API는 제공된 입력에 따라 벡터를 생성하며, 여기에는 이미지, 텍스트, 동영상 데이터의 조합이 포함될 수 있습니다. 그런 다음 임베딩 벡터를 이미지 분류 또는 동영상 콘텐츠 검토와 같은 후속 태스크에 사용할 수 있습니다.

추가적인 개념 정보는 멀티모달 임베딩을 참조하세요.

지원되는 모델:

모델	코드
멀티모달 임베딩	`multimodalembedding@001`

예시 문법

멀티모달 임베딩 API 요청을 보내는 문법입니다.

curl

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \

https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}/publishers/google/models/${MODEL_ID}:predict \
-d '{
"instances": [
  ...
],
}'

Python

from vertexai.vision_models import MultiModalEmbeddingModel

model = MultiModalEmbeddingModel.from_pretrained("multimodalembedding")
model.get_embeddings(...)

매개변수 목록

구현 세부정보는 예시를 참고하세요.

요청 본문

{
  "instances": [
    {
      "text": string,
      "image": {
        // Union field can be only one of the following:
        "bytesBase64Encoded": string,
        "gcsUri": string,
        // End of list of possible types for union field.
        "mimeType": string
      },
      "video": {
        // Union field can be only one of the following:
        "bytesBase64Encoded": string,
        "gcsUri": string,
        // End of list of possible types for union field.
        "videoSegmentConfig": {
          "startOffsetSec": integer,
          "endOffsetSec": integer,
          "intervalSec": integer
        }
      },
      "parameters": {
        "dimension": integer
      }
    }
  ]
}

매개변수
`image`	선택사항: `Image` 임베딩을 생성할 이미지입니다.
`text`	선택사항: `String` 임베딩을 생성할 텍스트입니다.
`video`	선택사항: `Video` 임베딩을 생성할 동영상 세그먼트입니다.
`dimension`	선택사항: `Int` 응답에 포함된 임베딩의 측정기준입니다. 텍스트 및 이미지 입력에만 적용됩니다. 허용되는 값은 `128`, `256`, `512`, 또는 `1408`입니다.

이미지

매개변수

매개변수
`bytesBase64Encoded`	선택사항: `String` base64 문자열로 인코딩된 이미지 바이트입니다. `bytesBase64Encoded` 또는 `gcsUri` 중 하나여야 합니다.
`gcsUri`	선택사항. `String` 임베딩을 수행할 이미지의 Cloud Storage 위치입니다. `bytesBase64Encoded` 또는 `gcsUri` 중 하나입니다.
`mimeType`	선택사항. `String` 이미지 콘텐츠의 MIME 유형입니다. 지원되는 값은 `image/jpeg` 및 `image/png`입니다.

bytesBase64Encoded

선택사항: String

base64 문자열로 인코딩된 이미지 바이트입니다. bytesBase64Encoded 또는 gcsUri 중 하나여야 합니다.

gcsUri

선택사항. String

임베딩을 수행할 이미지의 Cloud Storage 위치입니다. bytesBase64Encoded 또는 gcsUri 중 하나입니다.

mimeType

선택사항. String

이미지 콘텐츠의 MIME 유형입니다. 지원되는 값은 image/jpeg 및 image/png입니다.

동영상

매개변수

매개변수
`bytesBase64Encoded`	선택사항: `String` base64 문자열로 인코딩된 동영상 바이트입니다. `bytesBase64Encoded` 또는 `gcsUri` 중 하나입니다.
`gcsUri`	선택사항: `String` 임베딩을 수행할 동영상의 Cloud Storage 위치입니다. `bytesBase64Encoded` 또는 `gcsUri` 중 하나입니다.
`videoSegmentConfig`	선택사항: `VideoSegmentConfig` 동영상 세그먼트 구성입니다.

bytesBase64Encoded

선택사항: String

base64 문자열로 인코딩된 동영상 바이트입니다. bytesBase64Encoded 또는 gcsUri 중 하나입니다.

gcsUri

선택사항: String

임베딩을 수행할 동영상의 Cloud Storage 위치입니다. bytesBase64Encoded 또는 gcsUri 중 하나입니다.

videoSegmentConfig

선택사항: VideoSegmentConfig

동영상 세그먼트 구성입니다.

VideoSegmentConfig

매개변수

매개변수
`startOffsetSec`	선택사항: `Int` 동영상 세그먼트의 시작 오프셋(초)입니다. 지정하지 않으면 `max(0, endOffsetSec - 120)`으로 계산됩니다.
`endOffsetSec`	선택사항: `Int` 동영상 세그먼트의 종료 오프셋(초)입니다. 지정하지 않으면 `min(video length, startOffSec + 120)`으로 계산됩니다. `startOffSec`와 `endOffSec`가 모두 지정된 경우 `endOffsetSec`가 `min(startOffsetSec+120, endOffsetSec)`로 조정됩니다.
`intervalSec`	선택사항. `Int` 임베딩이 생성되는 동영상의 간격입니다. `interval_sec`의 최솟값은 4입니다. 간격이 `4`보다 작으면 `InvalidArgumentError`가 반환됩니다. 간격의 최댓값에는 제한이 없습니다. 그러나 간격이 `min(video length, 120s)`보다 크면 생성된 임베딩의 품질에 영향을 미칩니다. 기본값은 `16`입니다.

startOffsetSec

선택사항: Int

동영상 세그먼트의 시작 오프셋(초)입니다. 지정하지 않으면 max(0, endOffsetSec - 120)으로 계산됩니다.

endOffsetSec

선택사항: Int

동영상 세그먼트의 종료 오프셋(초)입니다. 지정하지 않으면 min(video length, startOffSec + 120)으로 계산됩니다. startOffSec와 endOffSec가 모두 지정된 경우 endOffsetSec가 min(startOffsetSec+120, endOffsetSec)로 조정됩니다.

intervalSec

선택사항. Int

임베딩이 생성되는 동영상의 간격입니다. interval_sec의 최솟값은 4입니다. 간격이 4보다 작으면 InvalidArgumentError가 반환됩니다. 간격의 최댓값에는 제한이 없습니다. 그러나 간격이 min(video length, 120s)보다 크면 생성된 임베딩의 품질에 영향을 미칩니다. 기본값은 16입니다.

응답 본문

{
  "predictions": [
    {
      "textEmbedding": [
        float,
        // array of 128, 256, 512, or 1408 float values
        float
      ],
      "imageEmbedding": [
        float,
        // array of 128, 256, 512, or 1408 float values
        float
      ],
      "videoEmbeddings": [
        {
          "startOffsetSec": integer,
          "endOffsetSec": integer,
          "embedding": [
            float,
            // array of 1408 float values
            float
          ]
        }
      ]
    }
  ],
  "deployedModelId": string
}

응답 요소	설명
`imageEmbedding`	부동 소수점 수의 128, 256, 512 또는 1,408 차원 목록입니다.
`textEmbedding`	부동 소수점 수의 128, 256, 512 또는 1,408 차원 목록입니다.
`videoEmbeddings`	임베딩이 생성된 동영상 세그먼트의 시작 및 종료 시간(초)이 포함된 부동 소수점 수의 1,408 차원 목록입니다.

예시

기본 사용 사례

이미지에서 임베딩 생성

다음 샘플을 사용하여 이미지의 임베딩을 생성합니다.

REST

요청 데이터를 사용하기 전에 다음을 바꿉니다.

LOCATION: 프로젝트의 리전. 예를 들면 us-central1, europe-west2, asia-northeast3입니다. 사용 가능한 리전 목록은 Vertex AI의 생성형 AI 위치를 참조하세요.
PROJECT_ID: Google Cloud 프로젝트 ID입니다.
TEXT: 임베딩을 가져올 대상 텍스트. 예를 들면 a cat입니다.
B64_ENCODED_IMG: 임베딩을 가져올 대상 이미지. 이미지는 base64 인코딩 바이트 문자열로 지정되어야 합니다.

HTTP 메서드 및 URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict

JSON 요청 본문:

{
  "instances": [
    {
      "text": "TEXT",
      "image": {
        "bytesBase64Encoded": "B64_ENCODED_IMG"
      }
    }
  ]
}

요청을 보내려면 다음 옵션 중 하나를 선택합니다.

curl

참고: 다음 명령어는 gcloud init 또는 gcloud auth login을 실행하거나 gcloud CLI에 자동으로 로그인하는 Cloud Shell을 사용하여 사용자 계정으로 gcloud CLI에 로그인했다고 가정합니다. gcloud auth list를 실행하면 현재 활성 계정을 확인할 수 있습니다.

요청 본문을 request.json 파일에 저장하고 다음 명령어를 실행합니다.

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict"

PowerShell

참고: 다음 명령어는 gcloud init 또는 gcloud auth login을 실행하여 사용자 계정으로 gcloud CLI에 로그인했다고 가정합니다. gcloud auth list를 실행하면 현재 활성 계정을 확인할 수 있습니다.

요청 본문을 request.json 파일에 저장하고 다음 명령어를 실행합니다.

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict" | Select-Object -Expand Content

모델이 반환하는 임베딩은 1408 부동 소수점 벡터입니다. 다음은 공간 절약을 위해 짧게 줄인 샘플 응답입니다.

{
  "predictions": [
    {
      "textEmbedding": [
        0.010477379,
        -0.00399621,
        0.00576670747,
        [...]
        -0.00823613815,
        -0.0169572588,
        -0.00472954148
      ],
      "imageEmbedding": [
        0.00262696808,
        -0.00198890246,
        0.0152047109,
        -0.0103145819,
        [...]
        0.0324628279,
        0.0284924973,
        0.011650892,
        -0.00452344026
      ]
    }
  ],
  "deployedModelId": "DEPLOYED_MODEL_ID"
}

Python

Vertex AI SDK for Python을 설치하거나 업데이트하는 방법은 Vertex AI SDK for Python 설치를 참조하세요. 자세한 내용은 Python API 참고 문서를 참조하세요.

import vertexai
from vertexai.vision_models import Image, MultiModalEmbeddingModel

# TODO(developer): Update & uncomment line below
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

model = MultiModalEmbeddingModel.from_pretrained("multimodalembedding@001")
image = Image.load_from_file(
    "gs://cloud-samples-data/vertex-ai/llm/prompts/landmark1.png"
)

embeddings = model.get_embeddings(
    image=image,
    contextual_text="Colosseum",
    dimension=1408,
)
print(f"Image Embedding: {embeddings.image_embedding}")
print(f"Text Embedding: {embeddings.text_embedding}")
# Example response:
# Image Embedding: [-0.0123147098, 0.0727171078, ...]
# Text Embedding: [0.00230263756, 0.0278981831, ...]

Node.js

이 샘플을 사용해 보기 전에 Vertex AI 빠른 시작: 클라이언트 라이브러리 사용의 Node.js 설정 안내를 따르세요. 자세한 내용은 Vertex AI Node.js API 참고 문서를 참조하세요.

Vertex AI에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.

/**
 * TODO(developer): Uncomment these variables before running the sample.\
 * (Not necessary if passing values as arguments)
 */
// const project = 'YOUR_PROJECT_ID';
// const location = 'YOUR_PROJECT_LOCATION';
// const baseImagePath = 'YOUR_BASE_IMAGE_PATH';
// const textPrompt = 'YOUR_TEXT_PROMPT';
const aiplatform = require('@google-cloud/aiplatform');

// Imports the Google Cloud Prediction service client
const {PredictionServiceClient} = aiplatform.v1;

// Import the helper module for converting arbitrary protobuf.Value objects.
const {helpers} = aiplatform;

// Specifies the location of the api endpoint
const clientOptions = {
  apiEndpoint: 'us-central1-aiplatform.googleapis.com',
};
const publisher = 'google';
const model = 'multimodalembedding@001';

// Instantiates a client
const predictionServiceClient = new PredictionServiceClient(clientOptions);

async function predictImageFromImageAndText() {
  // Configure the parent resource
  const endpoint = `projects/${project}/locations/${location}/publishers/${publisher}/models/${model}`;

  const fs = require('fs');
  const imageFile = fs.readFileSync(baseImagePath);

  // Convert the image data to a Buffer and base64 encode it.
  const encodedImage = Buffer.from(imageFile).toString('base64');

  const prompt = {
    text: textPrompt,
    image: {
      bytesBase64Encoded: encodedImage,
    },
  };
  const instanceValue = helpers.toValue(prompt);
  const instances = [instanceValue];

  const parameter = {
    sampleCount: 1,
  };
  const parameters = helpers.toValue(parameter);

  const request = {
    endpoint,
    instances,
    parameters,
  };

  // Predict request
  const [response] = await predictionServiceClient.predict(request);
  console.log('Get image embedding response');
  const predictions = response.predictions;
  console.log('\tPredictions :');
  for (const prediction of predictions) {
    console.log(`\t\tPrediction : ${JSON.stringify(prediction)}`);
  }
}

await predictImageFromImageAndText();

Java

이 샘플을 사용해 보기 전에 Vertex AI 빠른 시작: 클라이언트 라이브러리 사용의 Java 설정 안내를 따르세요. 자세한 내용은 Vertex AI Java API 참고 문서를 참조하세요.

Vertex AI에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.


import com.google.cloud.aiplatform.v1beta1.EndpointName;
import com.google.cloud.aiplatform.v1beta1.PredictResponse;
import com.google.cloud.aiplatform.v1beta1.PredictionServiceClient;
import com.google.cloud.aiplatform.v1beta1.PredictionServiceSettings;
import com.google.gson.Gson;
import com.google.gson.JsonObject;
import com.google.protobuf.InvalidProtocolBufferException;
import com.google.protobuf.Value;
import com.google.protobuf.util.JsonFormat;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Base64;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class PredictImageFromImageAndTextSample {

  public static void main(String[] args) throws IOException {
    // TODO(developer): Replace this variable before running the sample.
    String project = "YOUR_PROJECT_ID";
    String textPrompt = "YOUR_TEXT_PROMPT";
    String baseImagePath = "YOUR_BASE_IMAGE_PATH";

    // Learn how to use text prompts to update an image:
    // https://cloud.google.com/vertex-ai/docs/generative-ai/image/edit-images
    Map<String, Object> parameters = new HashMap<String, Object>();
    parameters.put("sampleCount", 1);

    String location = "us-central1";
    String publisher = "google";
    String model = "multimodalembedding@001";

    predictImageFromImageAndText(
        project, location, publisher, model, textPrompt, baseImagePath, parameters);
  }

  // Update images using text prompts
  public static void predictImageFromImageAndText(
      String project,
      String location,
      String publisher,
      String model,
      String textPrompt,
      String baseImagePath,
      Map<String, Object> parameters)
      throws IOException {
    final String endpoint = String.format("%s-aiplatform.googleapis.com:443", location);
    final PredictionServiceSettings predictionServiceSettings =
        PredictionServiceSettings.newBuilder().setEndpoint(endpoint).build();

    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    try (PredictionServiceClient predictionServiceClient =
        PredictionServiceClient.create(predictionServiceSettings)) {
      final EndpointName endpointName =
          EndpointName.ofProjectLocationPublisherModelName(project, location, publisher, model);

      // Convert the image to Base64
      byte[] imageData = Base64.getEncoder().encode(Files.readAllBytes(Paths.get(baseImagePath)));
      String encodedImage = new String(imageData, StandardCharsets.UTF_8);

      JsonObject jsonInstance = new JsonObject();
      jsonInstance.addProperty("text", textPrompt);
      JsonObject jsonImage = new JsonObject();
      jsonImage.addProperty("bytesBase64Encoded", encodedImage);
      jsonInstance.add("image", jsonImage);

      Value instanceValue = stringToValue(jsonInstance.toString());
      List<Value> instances = new ArrayList<>();
      instances.add(instanceValue);

      Gson gson = new Gson();
      String gsonString = gson.toJson(parameters);
      Value parameterValue = stringToValue(gsonString);

      PredictResponse predictResponse =
          predictionServiceClient.predict(endpointName, instances, parameterValue);
      System.out.println("Predict Response");
      System.out.println(predictResponse);
      for (Value prediction : predictResponse.getPredictionsList()) {
        System.out.format("\tPrediction: %s\n", prediction);
      }
    }
  }

  // Convert a Json string to a protobuf.Value
  static Value stringToValue(String value) throws InvalidProtocolBufferException {
    Value.Builder builder = Value.newBuilder();
    JsonFormat.parser().merge(value, builder);
    return builder.build();
  }
}

Go

이 샘플을 사용해 보기 전에 Vertex AI 빠른 시작: 클라이언트 라이브러리 사용의 Go 설정 안내를 따르세요. 자세한 내용은 Vertex AI Go API 참고 문서를 참조하세요.

Vertex AI에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.

import (
	"context"
	"encoding/json"
	"fmt"
	"io"

	aiplatform "cloud.google.com/go/aiplatform/apiv1beta1"
	aiplatformpb "cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
	"google.golang.org/api/option"
	"google.golang.org/protobuf/encoding/protojson"
	"google.golang.org/protobuf/types/known/structpb"
)

// generateForTextAndImage shows how to use the multimodal model to generate embeddings for
// text and image inputs.
func generateForTextAndImage(w io.Writer, project, location string) error {
	// location = "us-central1"
	ctx := context.Background()
	apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	client, err := aiplatform.NewPredictionClient(ctx, option.WithEndpoint(apiEndpoint))
	if err != nil {
		return fmt.Errorf("failed to construct API client: %w", err)
	}
	defer client.Close()

	model := "multimodalembedding@001"
	endpoint := fmt.Sprintf("projects/%s/locations/%s/publishers/google/models/%s", project, location, model)

	// This is the input to the model's prediction call. For schema, see:
	// https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-embeddings-api#request_body
	instance, err := structpb.NewValue(map[string]any{
		"image": map[string]any{
			// Image input can be provided either as a Google Cloud Storage URI or as
			// base64-encoded bytes using the "bytesBase64Encoded" field.
			"gcsUri": "gs://cloud-samples-data/vertex-ai/llm/prompts/landmark1.png",
		},
		"text": "Colosseum",
	})
	if err != nil {
		return fmt.Errorf("failed to construct request payload: %w", err)
	}

	req := &aiplatformpb.PredictRequest{
		Endpoint: endpoint,
		// The model supports only 1 instance per request.
		Instances: []*structpb.Value{instance},
	}

	resp, err := client.Predict(ctx, req)
	if err != nil {
		return fmt.Errorf("failed to generate embeddings: %w", err)
	}

	instanceEmbeddingsJson, err := protojson.Marshal(resp.GetPredictions()[0])
	if err != nil {
		return fmt.Errorf("failed to convert protobuf value to JSON: %w", err)
	}
	// For response schema, see:
	// https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-embeddings-api#response-body
	var instanceEmbeddings struct {
		ImageEmbeddings []float32 `json:"imageEmbedding"`
		TextEmbeddings  []float32 `json:"textEmbedding"`
	}
	if err := json.Unmarshal(instanceEmbeddingsJson, &instanceEmbeddings); err != nil {
		return fmt.Errorf("failed to unmarshal JSON: %w", err)
	}

	imageEmbedding := instanceEmbeddings.ImageEmbeddings
	textEmbedding := instanceEmbeddings.TextEmbeddings

	fmt.Fprintf(w, "Text embedding (length=%d): %v\n", len(textEmbedding), textEmbedding)
	fmt.Fprintf(w, "Image embedding (length=%d): %v\n", len(imageEmbedding), imageEmbedding)
	// Example response:
	// Text embedding (length=1408): [0.0023026613 0.027898183 -0.011858357 ... ]
	// Image embedding (length=1408): [-0.012314269 0.07271844 0.00020170923 ... ]

	return nil
}

동영상에서 임베딩 생성

다음 샘플을 사용하여 동영상 콘텐츠의 임베딩을 생성합니다.

REST

다음 예시에서는 Cloud Storage에 있는 동영상을 사용합니다. video.bytesBase64Encoded 필드를 사용하여 동영상의 base64 인코딩 문자열 표현을 제공할 수도 있습니다.

요청 데이터를 사용하기 전에 다음을 바꿉니다.

LOCATION: 프로젝트의 리전. 예를 들면 us-central1, europe-west2, asia-northeast3입니다. 사용 가능한 리전 목록은 Vertex AI의 생성형 AI 위치를 참조하세요.
PROJECT_ID: Google Cloud 프로젝트 ID입니다.
VIDEO_URI: 임베딩을 가져올 대상 동영상의 Cloud Storage URI입니다. 예를 들면 gs://my-bucket/embeddings/supermarket-video.mp4입니다.
동영상을 base64 인코딩 바이트 문자열로 제공할 수도 있습니다.
```
[...]
"video": {
  "bytesBase64Encoded": "B64_ENCODED_VIDEO"
}
[...]
```
videoSegmentConfig(START_SECOND, END_SECOND, INTERVAL_SECONDS). 선택사항. 임베딩이 생성될 특정 동영상 세그먼트(초)입니다.
videoSegmentConfig.intervalSec에 설정한 값은 요금이 청구되는 가격 책정 등급에 영향을 줍니다. 자세한 내용은 동영상 임베딩 모드 섹션 및 가격 책정 페이지를 참조하세요.

예를 들면 다음과 같습니다.
```
[...]
"videoSegmentConfig": {
  "startOffsetSec": 10,
  "endOffsetSec": 60,
  "intervalSec": 10
}
[...]
```
이 구성을 사용하면 10초에서 60초 사이의 동영상 데이터가 지정되고 다음 10초 비디오 간격([10, 20), [20, 30), [30, 40), [40, 50), [50, 60))에 대한 임베딩이 생성됩니다. 이 동영상 간격("intervalSec": 10)은 Standard 동영상 임베딩 모드에 속하며 사용자에게 Standard 모드 가격 책정 요금이 부과됩니다.

videoSegmentConfig를 생략하면 서비스에서 기본값 "videoSegmentConfig": { "startOffsetSec": 0, "endOffsetSec": 120, "intervalSec": 16 }을 사용합니다. 이 동영상 간격("intervalSec": 16)은 Essential 동영상 임베딩 모드에 속하며 사용자에게 Essential 모드 가격 책정 요금이 부과됩니다.

HTTP 메서드 및 URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict

JSON 요청 본문:

{
  "instances": [
    {
      "video": {
        "gcsUri": "VIDEO_URI",
        "videoSegmentConfig": {
          "startOffsetSec": START_SECOND,
          "endOffsetSec": END_SECOND,
          "intervalSec": INTERVAL_SECONDS
        }
      }
    }
  ]
}

요청을 보내려면 다음 옵션 중 하나를 선택합니다.

curl

요청 본문을 request.json 파일에 저장하고 다음 명령어를 실행합니다.

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict"

PowerShell

요청 본문을 request.json 파일에 저장하고 다음 명령어를 실행합니다.

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict" | Select-Object -Expand Content

모델이 반환하는 임베딩은 1408 부동 소수점 벡터입니다. 다음은 공간 절약을 위해 짧게 줄인 샘플 응답입니다.

응답(7초 동영상, 지정된 videoSegmentConfig 없음):

{
  "predictions": [
    {
      "videoEmbeddings": [
        {
          "endOffsetSec": 7,
          "embedding": [
            -0.0045467657,
            0.0258095954,
            0.0146885719,
            0.00945400633,
            [...]
            -0.0023291884,
            -0.00493789,
            0.00975185353,
            0.0168156829
          ],
          "startOffsetSec": 0
        }
      ]
    }
  ],
  "deployedModelId": "DEPLOYED_MODEL_ID"
}

응답(59초 동영상, "videoSegmentConfig": { "startOffsetSec": 0, "endOffsetSec": 60, "intervalSec": 10 } 동영상 세그먼트 구성 포함):

{
  "predictions": [
    {
      "videoEmbeddings": [
        {
          "endOffsetSec": 10,
          "startOffsetSec": 0,
          "embedding": [
            -0.00683252793,
            0.0390476175,
            [...]
            0.00657121744,
            0.013023301
          ]
        },
        {
          "startOffsetSec": 10,
          "endOffsetSec": 20,
          "embedding": [
            -0.0104404651,
            0.0357737206,
            [...]
            0.00509833824,
            0.0131902946
          ]
        },
        {
          "startOffsetSec": 20,
          "embedding": [
            -0.0113538112,
            0.0305239167,
            [...]
            -0.00195809244,
            0.00941874553
          ],
          "endOffsetSec": 30
        },
        {
          "embedding": [
            -0.00299320649,
            0.0322436653,
            [...]
            -0.00993082579,
            0.00968887936
          ],
          "startOffsetSec": 30,
          "endOffsetSec": 40
        },
        {
          "endOffsetSec": 50,
          "startOffsetSec": 40,
          "embedding": [
            -0.00591270532,
            0.0368893594,
            [...]
            -0.00219071587,
            0.0042470959
          ]
        },
        {
          "embedding": [
            -0.00458270218,
            0.0368121453,
            [...]
            -0.00317760976,
            0.00595594104
          ],
          "endOffsetSec": 59,
          "startOffsetSec": 50
        }
      ]
    }
  ],
  "deployedModelId": "DEPLOYED_MODEL_ID"
}

Python

Vertex AI SDK for Python을 설치하거나 업데이트하는 방법은 Vertex AI SDK for Python 설치를 참조하세요. 자세한 내용은 Python API 참고 문서를 참조하세요.

import vertexai

from vertexai.vision_models import MultiModalEmbeddingModel, Video
from vertexai.vision_models import VideoSegmentConfig

# TODO(developer): Update & uncomment line below
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

model = MultiModalEmbeddingModel.from_pretrained("multimodalembedding@001")

embeddings = model.get_embeddings(
    video=Video.load_from_file(
        "gs://cloud-samples-data/vertex-ai-vision/highway_vehicles.mp4"
    ),
    video_segment_config=VideoSegmentConfig(end_offset_sec=1),
)

# Video Embeddings are segmented based on the video_segment_config.
print("Video Embeddings:")
for video_embedding in embeddings.video_embeddings:
    print(
        f"Video Segment: {video_embedding.start_offset_sec} - {video_embedding.end_offset_sec}"
    )
    print(f"Embedding: {video_embedding.embedding}")

# Example response:
# Video Embeddings:
# Video Segment: 0.0 - 1.0
# Embedding: [-0.0206376351, 0.0123456789, ...]

Go

Vertex AI에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.

import (
	"context"
	"encoding/json"
	"fmt"
	"io"
	"time"

	aiplatform "cloud.google.com/go/aiplatform/apiv1beta1"
	aiplatformpb "cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
	"google.golang.org/api/option"
	"google.golang.org/protobuf/encoding/protojson"
	"google.golang.org/protobuf/types/known/structpb"
)

// generateForVideo shows how to use the multimodal model to generate embeddings for video input.
func generateForVideo(w io.Writer, project, location string) error {
	// location = "us-central1"

	// The default context timeout may be not enough to process a video input.
	ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second)
	defer cancel()

	apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	client, err := aiplatform.NewPredictionClient(ctx, option.WithEndpoint(apiEndpoint))
	if err != nil {
		return fmt.Errorf("failed to construct API client: %w", err)
	}
	defer client.Close()

	model := "multimodalembedding@001"
	endpoint := fmt.Sprintf("projects/%s/locations/%s/publishers/google/models/%s", project, location, model)

	// This is the input to the model's prediction call. For schema, see:
	// https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-embeddings-api#request_body
	instances, err := structpb.NewValue(map[string]any{
		"video": map[string]any{
			// Video input can be provided either as a Google Cloud Storage URI or as base64-encoded
			// bytes using the "bytesBase64Encoded" field.
			"gcsUri": "gs://cloud-samples-data/vertex-ai-vision/highway_vehicles.mp4",
			"videoSegmentConfig": map[string]any{
				"startOffsetSec": 1,
				"endOffsetSec":   5,
			},
		},
	})
	if err != nil {
		return fmt.Errorf("failed to construct request payload: %w", err)
	}

	req := &aiplatformpb.PredictRequest{
		Endpoint: endpoint,
		// The model supports only 1 instance per request.
		Instances: []*structpb.Value{instances},
	}
	resp, err := client.Predict(ctx, req)
	if err != nil {
		return fmt.Errorf("failed to generate embeddings: %w", err)
	}

	instanceEmbeddingsJson, err := protojson.Marshal(resp.GetPredictions()[0])
	if err != nil {
		return fmt.Errorf("failed to convert protobuf value to JSON: %w", err)
	}
	// For response schema, see:
	// https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-embeddings-api#response-body
	var instanceEmbeddings struct {
		VideoEmbeddings []struct {
			Embedding      []float32 `json:"embedding"`
			StartOffsetSec float64   `json:"startOffsetSec"`
			EndOffsetSec   float64   `json:"endOffsetSec"`
		} `json:"videoEmbeddings"`
	}
	if err := json.Unmarshal(instanceEmbeddingsJson, &instanceEmbeddings); err != nil {
		return fmt.Errorf("failed to unmarshal json: %w", err)
	}
	// Get the embedding for our single video segment (`.videoEmbeddings` object has one entry per
	// each processed segment).
	videoEmbedding := instanceEmbeddings.VideoEmbeddings[0]

	fmt.Fprintf(w, "Video embedding (seconds: %.f-%.f; length=%d): %v\n",
		videoEmbedding.StartOffsetSec,
		videoEmbedding.EndOffsetSec,
		len(videoEmbedding.Embedding),
		videoEmbedding.Embedding,
	)
	// Example response:
	// Video embedding (seconds: 1-5; length=1408): [-0.016427778 0.032878537 -0.030755188 ... ]

	return nil
}

고급 사용 사례

다음 샘플을 사용하여 동영상, 텍스트, 이미지 콘텐츠의 임베딩을 가져옵니다.

동영상 임베딩의 경우 동영상 세그먼트와 임베딩 밀도를 지정할 수 있습니다.

REST

다음 예시에서는 이미지, 텍스트, 동영상 데이터를 사용합니다. 요청 본문에서 이러한 데이터 유형의 조합을 사용할 수 있습니다.

이 샘플에서는 Cloud Storage에 있는 동영상을 사용합니다. video.bytesBase64Encoded 필드를 사용하여 동영상의 base64 인코딩 문자열 표현을 제공할 수도 있습니다.

요청 데이터를 사용하기 전에 다음을 바꿉니다.

LOCATION: 프로젝트의 리전. 예를 들면 us-central1, europe-west2, asia-northeast3입니다. 사용 가능한 리전 목록은 Vertex AI의 생성형 AI 위치를 참조하세요.
PROJECT_ID: Google Cloud 프로젝트 ID입니다.
TEXT: 임베딩을 가져올 대상 텍스트. 예를 들면 a cat입니다.
IMAGE_URI: 임베딩을 가져올 대상 이미지의 Cloud Storage URI입니다. 예를 들면 gs://my-bucket/embeddings/supermarket-img.png입니다.
이미지를 base64 인코딩 바이트 문자열로 제공할 수도 있습니다.
```
[...]
"image": {
  "bytesBase64Encoded": "B64_ENCODED_IMAGE"
}
[...]
```
VIDEO_URI: 임베딩을 가져올 대상 동영상의 Cloud Storage URI입니다. 예를 들면 gs://my-bucket/embeddings/supermarket-video.mp4입니다.
동영상을 base64 인코딩 바이트 문자열로 제공할 수도 있습니다.
```
[...]
"video": {
  "bytesBase64Encoded": "B64_ENCODED_VIDEO"
}
[...]
```
videoSegmentConfig(START_SECOND, END_SECOND, INTERVAL_SECONDS). 선택사항. 임베딩이 생성될 특정 동영상 세그먼트(초)입니다.
videoSegmentConfig.intervalSec에 설정한 값은 요금이 청구되는 가격 책정 등급에 영향을 줍니다. 자세한 내용은 동영상 임베딩 모드 섹션 및 가격 책정 페이지를 참조하세요.

예를 들면 다음과 같습니다.
```
[...]
"videoSegmentConfig": {
  "startOffsetSec": 10,
  "endOffsetSec": 60,
  "intervalSec": 10
}
[...]
```
이 구성을 사용하면 10초에서 60초 사이의 동영상 데이터가 지정되고 다음 10초 비디오 간격([10, 20), [20, 30), [30, 40), [40, 50), [50, 60))에 대한 임베딩이 생성됩니다. 이 동영상 간격("intervalSec": 10)은 Standard 동영상 임베딩 모드에 속하며 사용자에게 Standard 모드 가격 책정 요금이 부과됩니다.

videoSegmentConfig를 생략하면 서비스에서 기본값 "videoSegmentConfig": { "startOffsetSec": 0, "endOffsetSec": 120, "intervalSec": 16 }을 사용합니다. 이 동영상 간격("intervalSec": 16)은 Essential 동영상 임베딩 모드에 속하며 사용자에게 Essential 모드 가격 책정 요금이 부과됩니다.

HTTP 메서드 및 URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict

JSON 요청 본문:

{
  "instances": [
    {
      "text": "TEXT",
      "image": {
        "gcsUri": "IMAGE_URI"
      },
      "video": {
        "gcsUri": "VIDEO_URI",
        "videoSegmentConfig": {
          "startOffsetSec": START_SECOND,
          "endOffsetSec": END_SECOND,
          "intervalSec": INTERVAL_SECONDS
        }
      }
    }
  ]
}

요청을 보내려면 다음 옵션 중 하나를 선택합니다.

curl

요청 본문을 request.json 파일에 저장하고 다음 명령어를 실행합니다.

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict"

PowerShell

요청 본문을 request.json 파일에 저장하고 다음 명령어를 실행합니다.

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict" | Select-Object -Expand Content

모델이 반환하는 임베딩은 1408 부동 소수점 벡터입니다. 다음은 공간 절약을 위해 짧게 줄인 샘플 응답입니다.

{
  "predictions": [
    {
      "textEmbedding": [
        0.0105433334,
        -0.00302835181,
        0.00656806398,
        0.00603460241,
        [...]
        0.00445805816,
        0.0139605571,
        -0.00170318608,
        -0.00490092579
      ],
      "videoEmbeddings": [
        {
          "startOffsetSec": 0,
          "endOffsetSec": 7,
          "embedding": [
            -0.00673126569,
            0.0248149596,
            0.0128901172,
            0.0107588246,
            [...]
            -0.00180952181,
            -0.0054573305,
            0.0117037306,
            0.0169312079
          ]
        }
      ],
      "imageEmbedding": [
        -0.00728622358,
        0.031021487,
        -0.00206603738,
        0.0273937676,
        [...]
        -0.00204976718,
        0.00321615417,
        0.0121978866,
        0.0193375275
      ]
    }
  ],
  "deployedModelId": "DEPLOYED_MODEL_ID"
}

Python

Vertex AI SDK for Python을 설치하거나 업데이트하는 방법은 Vertex AI SDK for Python 설치를 참조하세요. 자세한 내용은 Python API 참고 문서를 참조하세요.

import vertexai

from vertexai.vision_models import Image, MultiModalEmbeddingModel, Video
from vertexai.vision_models import VideoSegmentConfig

# TODO(developer): Update & uncomment line below
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

model = MultiModalEmbeddingModel.from_pretrained("multimodalembedding@001")

image = Image.load_from_file(
    "gs://cloud-samples-data/vertex-ai/llm/prompts/landmark1.png"
)
video = Video.load_from_file(
    "gs://cloud-samples-data/vertex-ai-vision/highway_vehicles.mp4"
)

embeddings = model.get_embeddings(
    image=image,
    video=video,
    video_segment_config=VideoSegmentConfig(end_offset_sec=1),
    contextual_text="Cars on Highway",
)

print(f"Image Embedding: {embeddings.image_embedding}")

# Video Embeddings are segmented based on the video_segment_config.
print("Video Embeddings:")
for video_embedding in embeddings.video_embeddings:
    print(
        f"Video Segment: {video_embedding.start_offset_sec} - {video_embedding.end_offset_sec}"
    )
    print(f"Embedding: {video_embedding.embedding}")

print(f"Text Embedding: {embeddings.text_embedding}")
# Example response:
# Image Embedding: [-0.0123144267, 0.0727186054, 0.000201397663, ...]
# Video Embeddings:
# Video Segment: 0.0 - 1.0
# Embedding: [-0.0206376351, 0.0345234685, ...]
# Text Embedding: [-0.0207006838, -0.00251058186, ...]

Go

Vertex AI에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.

import (
	"context"
	"encoding/json"
	"fmt"
	"io"
	"time"

	aiplatform "cloud.google.com/go/aiplatform/apiv1beta1"
	aiplatformpb "cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
	"google.golang.org/api/option"
	"google.golang.org/protobuf/encoding/protojson"
	"google.golang.org/protobuf/types/known/structpb"
)

// generateForImageTextAndVideo shows how to use the multimodal model to generate embeddings for
// image, text and video data.
func generateForImageTextAndVideo(w io.Writer, project, location string) error {
	// location = "us-central1"

	// The default context timeout may be not enough to process a video input.
	ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second)
	defer cancel()

	apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	client, err := aiplatform.NewPredictionClient(ctx, option.WithEndpoint(apiEndpoint))
	if err != nil {
		return fmt.Errorf("failed to construct API client: %w", err)
	}
	defer client.Close()

	model := "multimodalembedding@001"
	endpoint := fmt.Sprintf("projects/%s/locations/%s/publishers/google/models/%s", project, location, model)

	// This is the input to the model's prediction call. For schema, see:
	// https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-embeddings-api#request_body
	instance, err := structpb.NewValue(map[string]any{
		"text": "Domestic cats in natural conditions",
		"image": map[string]any{
			// Image and video inputs can be provided either as a Google Cloud Storage URI or as
			// base64-encoded bytes using the "bytesBase64Encoded" field.
			"gcsUri": "gs://cloud-samples-data/generative-ai/image/320px-Felis_catus-cat_on_snow.jpg",
		},
		"video": map[string]any{
			"gcsUri": "gs://cloud-samples-data/video/cat.mp4",
		},
	})
	if err != nil {
		return fmt.Errorf("failed to construct request payload: %w", err)
	}

	req := &aiplatformpb.PredictRequest{
		Endpoint: endpoint,
		// The model supports only 1 instance per request.
		Instances: []*structpb.Value{instance},
	}

	resp, err := client.Predict(ctx, req)
	if err != nil {
		return fmt.Errorf("failed to generate embeddings: %w", err)
	}

	instanceEmbeddingsJson, err := protojson.Marshal(resp.GetPredictions()[0])
	if err != nil {
		return fmt.Errorf("failed to convert protobuf value to JSON: %w", err)
	}
	// For response schema, see:
	// https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-embeddings-api#response-body
	var instanceEmbeddings struct {
		ImageEmbeddings []float32 `json:"imageEmbedding"`
		TextEmbeddings  []float32 `json:"textEmbedding"`
		VideoEmbeddings []struct {
			Embedding      []float32 `json:"embedding"`
			StartOffsetSec float64   `json:"startOffsetSec"`
			EndOffsetSec   float64   `json:"endOffsetSec"`
		} `json:"videoEmbeddings"`
	}
	if err := json.Unmarshal(instanceEmbeddingsJson, &instanceEmbeddings); err != nil {
		return fmt.Errorf("failed to unmarshal JSON: %w", err)
	}

	imageEmbedding := instanceEmbeddings.ImageEmbeddings
	textEmbedding := instanceEmbeddings.TextEmbeddings
	// Get the embedding for our single video segment (`.videoEmbeddings` object has one entry per
	// each processed segment).
	videoEmbedding := instanceEmbeddings.VideoEmbeddings[0].Embedding

	fmt.Fprintf(w, "Image embedding (length=%d): %v\n", len(imageEmbedding), imageEmbedding)
	fmt.Fprintf(w, "Text embedding (length=%d): %v\n", len(textEmbedding), textEmbedding)
	fmt.Fprintf(w, "Video embedding (length=%d): %v\n", len(videoEmbedding), videoEmbedding)
	// Example response:
	// Image embedding (length=1408): [-0.01558477 0.0258355 0.016342038 ... ]
	// Text embedding (length=1408): [-0.005894961 0.008349559 0.015355394 ... ]
	// Video embedding (length=1408): [-0.018867437 0.013997682 0.0012682161 ... ]

	return nil
}

다음 단계

자세한 문서는 다음을 참조하세요.

멀티모달 임베딩 가져오기