Visual Question Answering（VQA）機能を使用して画像情報を取得する

注意: 2025 年 6 月 24 日以降、Imagen バージョン 1 と 2 は非推奨になります。Imagen モデル imagegeneration@002、imagegeneration@005、imagegeneration@006 は、2025 年 9 月 24 日に削除されます。Imagen 3 への移行の詳細については、Imagen 3 に移行するをご覧ください。

Visual Question & Answering（VQA）機能を使用すると、モデルに画像を渡して画像の内容について質問できます。質問に対して、自然言語の回答が 1 つ以上返されます。

コンソールに表示される VQA の画像、質問、回答のサンプル — ^{画像の出典（ Google Cloud コンソールに表示）: Sharon Pittaway 氏（Unsplash より）

プロンプトの質問: 画像に写っているものは何？

回答 1: ビー玉

回答 2: ガラスのビー玉}

サポートされている言語

VQA は以下の言語でご利用いただけます。

English (en)

パフォーマンスと制限事項

このモデルを使用するときは次の上限が適用されます。

上限	値
各プロジェクト 1 分あたりの最大 API リクエスト数（短形式）	500
レスポンスで返されるトークンの最大数（短形式）	64 トークン
リクエストで受け入れられるトークンの最大数（VQA の短形式のみ）	80 トークン

このモデルを使用する場合は、次のサービスレイテンシの見積もりが適用されます。これらの値は例示を目的としたものであり、サービスを約束するものではありません。

レイテンシ	値
API リクエスト（短形式）	1.5 秒

ロケーション

ロケーションは、データの保存場所を制御するためにリクエストで指定できるリージョンです。使用可能なリージョンの一覧については、Vertex AI の生成 AI のロケーションをご覧ください。

責任ある AI の安全フィルタリング

画像キャプションと Visual Question Answering（VQA）の機能モデルは、ユーザーが構成可能な安全フィルタをサポートしていません。ただし、Imagen の全体的な安全フィルタリングは、次のデータに対して行われます。

ユーザー入力
モデル出力

その結果、Imagen がこれらの安全フィルタを適用すると、出力がサンプル出力と異なる場合があります。以下の例を考えてみましょう。

フィルタされた入力

入力がフィルタされている場合、レスポンスは次のようになります。

{
  "error": {
    "code": 400,
    "message": "Media reasoning failed with the following error: The response is blocked, as it may violate our policies. If you believe this is an error, please send feedback to your account team. Error Code: 63429089, 72817394",
    "status": "INVALID_ARGUMENT",
    "details": [
      {
        "@type": "type.googleapis.com/google.rpc.DebugInfo",
        "detail": "[ORIGINAL ERROR] generic::invalid_argument: Media reasoning failed with the following error: The response is blocked, as it may violate our policies. If you believe this is an error, please send feedback to your account team. Error Code: 63429089, 72817394 [google.rpc.error_details_ext] { message: \"Media reasoning failed with the following error: The response is blocked, as it may violate our policies. If you believe this is an error, please send feedback to your account team. Error Code: 63429089, 72817394\" }"
      }
    ]
  }
}

フィルタされた出力

返されたレスポンス数が指定したサンプル数より少ない場合は、欠落しているレスポンスが責任ある AI によってフィルタされていることを示しています。たとえば、"sampleCount": 2 を含むリクエストに対するレスポンスは次のようになりますが、レスポンスの一つは除外されます。

{
  "predictions": [
    "cappuccino"
  ]
}

すべての出力がフィルタされている場合、レスポンスは次のような空のオブジェクトになります。

{}

画像で VQA を使用する（短形式のレスポンス）

次のサンプルを使用して画像について質問し、回答を得てみましょう。

REST

imagetext モデルリクエストの詳細については、imagetext モデル API リファレンスをご覧ください。

リクエストのデータを使用する前に、次のように置き換えます。

PROJECT_ID: 実際の Google Cloud プロジェクト ID。
LOCATION: プロジェクトのリージョン。たとえば、us-central1、europe-west2、asia-northeast3 です。使用可能なリージョンの一覧については、Vertex AI の生成 AI のロケーションをご覧ください。
VQA_PROMPT: 回答を得たい画像に関する質問。
- この靴は何色？
- シャツの袖の種類を教えて。
B64_IMAGE: キャプションを取得する画像。画像は base64 でエンコードされたバイト文字列として指定する必要があります。サイズの上限: 10 MB。
RESPONSE_COUNT: 生成したい回答の数。指定できる整数値: 1～3。

HTTP メソッドと URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/imagetext:predict

リクエストの本文（JSON）:

{
  "instances": [
    {
      "prompt": "VQA_PROMPT",
      "image": {
          "bytesBase64Encoded": "B64_IMAGE"
      }
    }
  ],
  "parameters": {
    "sampleCount": RESPONSE_COUNT
  }
}

リクエストを送信するには、次のいずれかのオプションを選択します。

curl

注: 次のコマンドは、gcloud init または gcloud auth login を実行して、ユーザーアカウントで gcloud CLI にログインしているか、Cloud Shell を使用して自動的に gcloud CLI にログインしていることを前提としています。gcloud auth list を実行すると、現在アクティブなアカウントを確認できます。

リクエスト本文を request.json という名前のファイルに保存して、次のコマンドを実行します。

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/imagetext:predict"

PowerShell

注: 次のコマンドは、gcloud init または gcloud auth login を実行して、ご自分のユーザーアカウントで gcloud CLI にログインしていることを前提としています。gcloud auth list を実行すると、現在アクティブなアカウントを確認できます。

リクエスト本文を request.json という名前のファイルに保存して、次のコマンドを実行します。

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/imagetext:predict" | Select-Object -Expand Content

次のサンプルレスポンスは、"sampleCount": 2 と "prompt": "What is this?" を含むリクエストに対するものです。レスポンスは回答として 2 つの予測文字列を返します。

{
  "predictions": [
    "cappuccino",
    "coffee"
  ]
}

Python

このサンプルを試す前に、Vertex AI クイックスタート: クライアントライブラリの使用にある Python の設定手順を完了してください。詳細については、Vertex AI Python API のリファレンスドキュメントをご覧ください。

Vertex AI に対する認証を行うには、アプリケーションのデフォルト認証情報を設定します。詳細については、ローカル開発環境の認証を設定するをご覧ください。

このサンプルでは、load_from_file メソッドを使用して、情報の取得対象となるベース Image としてローカルファイルを参照します。ベース画像を指定したら、ImageTextModel で ask_question メソッドを使用して、回答を表示します。


import vertexai
from vertexai.preview.vision_models import Image, ImageTextModel

# TODO(developer): Update and un-comment below lines
# PROJECT_ID = "your-project-id"
# input_file = "input-image.png"
# question = "" # The question about the contents of the image.

vertexai.init(project=PROJECT_ID, location="us-central1")

model = ImageTextModel.from_pretrained("imagetext@001")
source_img = Image.load_from_file(location=input_file)

answers = model.ask_question(
    image=source_img,
    question=question,
    # Optional parameters
    number_of_results=1,
)

print(answers)
# Example response:
# ['tabby']

Node.js

このサンプルを試す前に、Vertex AI クイックスタート: クライアントライブラリの使用にある Node.js の設定手順を完了してください。詳細については、Vertex AI Node.js API のリファレンスドキュメントをご覧ください。

このサンプルでは、PredictionServiceClient で predict メソッドを呼び出します。サービスは、指定された質問に対する回答を返します。

/**
 * TODO(developer): Update these variables before running the sample.
 */
const projectId = process.env.CAIP_PROJECT_ID;
const location = 'us-central1';
const inputFile = 'resources/cat.png';
// The question about the contents of the image.
const prompt = 'What breed of cat is this a picture of?';

const aiplatform = require('@google-cloud/aiplatform');

// Imports the Google Cloud Prediction Service Client library
const {PredictionServiceClient} = aiplatform.v1;

// Import the helper module for converting arbitrary protobuf.Value objects
const {helpers} = aiplatform;

// Specifies the location of the api endpoint
const clientOptions = {
  apiEndpoint: `${location}-aiplatform.googleapis.com`,
};

// Instantiates a client
const predictionServiceClient = new PredictionServiceClient(clientOptions);

async function getShortFormImageResponses() {
  const fs = require('fs');
  // Configure the parent resource
  const endpoint = `projects/${projectId}/locations/${location}/publishers/google/models/imagetext@001`;

  const imageFile = fs.readFileSync(inputFile);
  // Convert the image data to a Buffer and base64 encode it.
  const encodedImage = Buffer.from(imageFile).toString('base64');

  const instance = {
    prompt: prompt,
    image: {
      bytesBase64Encoded: encodedImage,
    },
  };
  const instanceValue = helpers.toValue(instance);
  const instances = [instanceValue];

  const parameter = {
    // Optional parameters
    sampleCount: 2,
  };
  const parameters = helpers.toValue(parameter);

  const request = {
    endpoint,
    instances,
    parameters,
  };

  // Predict request
  const [response] = await predictionServiceClient.predict(request);
  const predictions = response.predictions;
  if (predictions.length === 0) {
    console.log(
      'No responses were generated. Check the request parameters and image.'
    );
  } else {
    predictions.forEach(prediction => {
      console.log(prediction.stringValue);
    });
  }
}
await getShortFormImageResponses();

VQA 用パラメータを使用する

VQA のレスポンスを受け取るとき、ユースケースに応じていくつかのパラメータを設定できます。

検索結果の表示件数

検索結果の表示件数のパラメータを使用して、送信するリクエストごとに返される回答の数を制限できます。詳細については、imagetext（VQA）モデル API リファレンスをご覧ください。

シード番号

生成される回答を確定するためリクエストに追加する数値。リクエストにシード番号を追加すると、毎回確実に同じ予測（回答）が得られます。ただし、回答が同じ順序で返されるとは限りません。詳細については、imagetext（VQA）モデル API リファレンスをご覧ください。

次のステップ

Imagen や Vertex AI のその他の生成 AI プロダクトに関する次の記事を読む。

Visual Question Answering（VQA）機能を使用して画像情報を取得する コレクションでコンテンツを整理 必要に応じて、コンテンツの保存と分類を行います。

サポートされている言語

パフォーマンスと制限事項

ロケーション

責任ある AI の安全フィルタリング

フィルタされた入力

フィルタされた出力

画像で VQA を使用する（短形式のレスポンス）

REST

curl

PowerShell

Python

Node.js

VQA 用パラメータを使用する

検索結果の表示件数

シード番号

次のステップ

Visual Question Answering（VQA）機能を使用して画像情報を取得する