시각적 질의 응답(VQA)을 사용하여 이미지 정보 가져오기

주의: 2025년 6월 24일부터 Imagen 버전 1 및 2가 지원 중단됩니다. 2025년 9월 24일에 Imagen 모델 imagegeneration@002, imagegeneration@005, imagegeneration@006이 삭제됩니다. Imagen 3으로 마이그레이션하는 방법에 대한 자세한 내용은 Imagen 3으로 마이그레이션을 참조하세요.

시각적 질의 응답(VQA)을 사용하면 모델에 이미지를 제공하고 이미지 콘텐츠에 대해 질문할 수 있습니다. 질문에 대한 응답으로 자연어 답변을 하나 이상 받습니다.

콘솔의 샘플 VQA 이미지, 질문, 답변 — ^{이미지 소스( Google Cloud 콘솔에 표시): 샤론 피타웨이(출처: Unsplash)

프롬프트 질문: 이미지에 포함된 객체는 무엇인가요?

답변 1: 구슬

답변 2: 유리 구슬}

지원 언어

VQA는 다음 언어로 제공됩니다.

영어(en)

성능 및 제한사항

모델을 사용할 때 다음 한도가 적용됩니다.

한도	값
프로젝트별 분당 최대 API 요청 수(짧은 형식)	500
응답으로 반환되는 최대 토큰 수(짧은 형식)	토큰 64개
요청에 허용되는 최대 토큰 수(VQA 짧은 형식만)	토큰 80개

이 모델을 사용할 때 다음 서비스 지연 시간 추정치가 적용됩니다. 이러한 값은 설명을 위한 것이며 서비스가 보장되지 않습니다.

지연 시간	값
API 요청(짧은 형식)	1.5초

위치

위치는 데이터가 영구 저장되는 위치를 제어하기 위해 요청에서 지정할 수 있는 리전입니다. 사용 가능한 리전 목록은 Vertex AI의 생성형 AI 위치를 참조하세요.

책임감 있는 AI 안전 필터링

이미지 캡션 및 시각적 질의 응답(VQA) 기능 모델은 사용자가 구성할 수 있는 안전 필터를 지원하지 않습니다. 그러나 전반적인 Imagen 안전 필터링은 다음 데이터에 적용됩니다.

사용자 입력
모델 출력

따라서 Imagen에서 이러한 안전 필터를 적용하면 출력이 샘플 출력과 다를 수 있습니다. 다음 예시를 고려하세요.

필터링된 입력

입력이 필터링된 경우 응답은 다음과 비슷합니다.

{
  "error": {
    "code": 400,
    "message": "Media reasoning failed with the following error: The response is blocked, as it may violate our policies. If you believe this is an error, please send feedback to your account team. Error Code: 63429089, 72817394",
    "status": "INVALID_ARGUMENT",
    "details": [
      {
        "@type": "type.googleapis.com/google.rpc.DebugInfo",
        "detail": "[ORIGINAL ERROR] generic::invalid_argument: Media reasoning failed with the following error: The response is blocked, as it may violate our policies. If you believe this is an error, please send feedback to your account team. Error Code: 63429089, 72817394 [google.rpc.error_details_ext] { message: \"Media reasoning failed with the following error: The response is blocked, as it may violate our policies. If you believe this is an error, please send feedback to your account team. Error Code: 63429089, 72817394\" }"
      }
    ]
  }
}

필터링된 출력

반환된 응답 수가 지정한 샘플 수보다 적으면 책임감 있는 AI에서 누락된 응답을 필터링한 것입니다. 예를 들어 다음은 "sampleCount": 2가 포함된 요청에 대한 응답이지만 응답 중 하나가 필터링됩니다.

{
  "predictions": [
    "cappuccino"
  ]
}

모든 출력이 필터링되면 응답은 다음과 유사한 빈 객체입니다.

{}

이미지에 VQA 사용(짧은 형식 응답)

다음 샘플을 사용하여 질문하고 이미지에 대한 답변을 얻습니다.

REST

imagetext 모델 요청에 대한 자세한 내용은 imagetext 모델 API 참조를 확인하세요.

요청 데이터를 사용하기 전에 다음을 바꿉니다.

PROJECT_ID: Google Cloud 프로젝트 ID
LOCATION: 프로젝트의 리전입니다. 예를 들면 us-central1, europe-west2, asia-northeast3입니다. 사용 가능한 리전 목록은 Vertex AI의 생성형 AI 위치를 참조하세요.
VQA_PROMPT: 이미지에 대한 답변을 받는 질문
- 이 신발은 무슨 색인가요?
- 셔츠의 소매 유형은 무엇인가요?
B64_IMAGE: 자막을 가져올 이미지입니다. 이미지는 base64 인코딩 바이트 문자열로 지정되어야 합니다. 크기 제한: 10MB
RESPONSE_COUNT: 생성하려는 답변의 수. 허용되는 정수 값: 1~3.

HTTP 메서드 및 URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/imagetext:predict

JSON 요청 본문:

{
  "instances": [
    {
      "prompt": "VQA_PROMPT",
      "image": {
          "bytesBase64Encoded": "B64_IMAGE"
      }
    }
  ],
  "parameters": {
    "sampleCount": RESPONSE_COUNT
  }
}

요청을 보내려면 다음 옵션 중 하나를 선택합니다.

curl

참고: 다음 명령어는 gcloud init 또는 gcloud auth login을 실행하거나 gcloud CLI에 자동으로 로그인하는 Cloud Shell을 사용하여 사용자 계정으로 gcloud CLI에 로그인했다고 가정합니다. gcloud auth list를 실행하면 현재 활성 계정을 확인할 수 있습니다.

요청 본문을 request.json 파일에 저장하고 다음 명령어를 실행합니다.

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/imagetext:predict"

PowerShell

참고: 다음 명령어는 gcloud init 또는 gcloud auth login을 실행하여 사용자 계정으로 gcloud CLI에 로그인했다고 가정합니다. gcloud auth list를 실행하면 현재 활성 계정을 확인할 수 있습니다.

요청 본문을 request.json 파일에 저장하고 다음 명령어를 실행합니다.

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/imagetext:predict" | Select-Object -Expand Content

다음 샘플 응답은 "sampleCount": 2 및 "prompt": "What is this?"가 포함된 요청에 대한 응답입니다. 응답은 예측 문자열 답변 2개를 반환합니다.

{
  "predictions": [
    "cappuccino",
    "coffee"
  ]
}

Python

이 샘플을 사용해 보기 전에 Vertex AI 빠른 시작: 클라이언트 라이브러리 사용의 Python 설정 안내를 따르세요. 자세한 내용은 Vertex AI Python API 참고 문서를 참조하세요.

Vertex AI에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.

이 샘플에서는 load_from_file 메서드를 사용하여 로컬 파일을 기본 Image로 참조해 정보를 가져옵니다. 기본 이미지를 지정한 후 ImageTextModel에서 ask_question 메서드를 사용하고 답변을 출력합니다.


import vertexai
from vertexai.preview.vision_models import Image, ImageTextModel

# TODO(developer): Update and un-comment below lines
# PROJECT_ID = "your-project-id"
# input_file = "input-image.png"
# question = "" # The question about the contents of the image.

vertexai.init(project=PROJECT_ID, location="us-central1")

model = ImageTextModel.from_pretrained("imagetext@001")
source_img = Image.load_from_file(location=input_file)

answers = model.ask_question(
    image=source_img,
    question=question,
    # Optional parameters
    number_of_results=1,
)

print(answers)
# Example response:
# ['tabby']

Node.js

이 샘플을 사용해 보기 전에 Vertex AI 빠른 시작: 클라이언트 라이브러리 사용의 Node.js 설정 안내를 따르세요. 자세한 내용은 Vertex AI Node.js API 참고 문서를 참조하세요.

Vertex AI에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.

이 샘플에서는 PredictionServiceClient에서 predict 메서드를 호출합니다. 서비스는 제공된 질문에 대한 답변을 반환합니다.

/**
 * TODO(developer): Update these variables before running the sample.
 */
const projectId = process.env.CAIP_PROJECT_ID;
const location = 'us-central1';
const inputFile = 'resources/cat.png';
// The question about the contents of the image.
const prompt = 'What breed of cat is this a picture of?';

const aiplatform = require('@google-cloud/aiplatform');

// Imports the Google Cloud Prediction Service Client library
const {PredictionServiceClient} = aiplatform.v1;

// Import the helper module for converting arbitrary protobuf.Value objects
const {helpers} = aiplatform;

// Specifies the location of the api endpoint
const clientOptions = {
  apiEndpoint: `${location}-aiplatform.googleapis.com`,
};

// Instantiates a client
const predictionServiceClient = new PredictionServiceClient(clientOptions);

async function getShortFormImageResponses() {
  const fs = require('fs');
  // Configure the parent resource
  const endpoint = `projects/${projectId}/locations/${location}/publishers/google/models/imagetext@001`;

  const imageFile = fs.readFileSync(inputFile);
  // Convert the image data to a Buffer and base64 encode it.
  const encodedImage = Buffer.from(imageFile).toString('base64');

  const instance = {
    prompt: prompt,
    image: {
      bytesBase64Encoded: encodedImage,
    },
  };
  const instanceValue = helpers.toValue(instance);
  const instances = [instanceValue];

  const parameter = {
    // Optional parameters
    sampleCount: 2,
  };
  const parameters = helpers.toValue(parameter);

  const request = {
    endpoint,
    instances,
    parameters,
  };

  // Predict request
  const [response] = await predictionServiceClient.predict(request);
  const predictions = response.predictions;
  if (predictions.length === 0) {
    console.log(
      'No responses were generated. Check the request parameters and image.'
    );
  } else {
    predictions.forEach(prediction => {
      console.log(prediction.stringValue);
    });
  }
}
await getShortFormImageResponses();

VQA 매개변수 사용

VQA 응답을 가져오면 사용 사례에 따라 여러 매개변수를 설정할 수 있습니다.

알림당 결과 수

결과 수 매개변수를 사용하여 전송하는 요청마다 반환되는 응답 양을 제한합니다. 자세한 내용은 imagetext(VQA) 모델 API 참조를 확인하세요.

시드 번호

생성된 응답을 확정하도록 요청에 추가하는 숫자입니다. 요청에 시드 번호를 추가하면 매번 같은 예측(응답)을 가져올 수 있습니다. 그러나 응답이 반드시 동일한 순서로 반환되지 않습니다. 자세한 내용은 imagetext(VQA) 모델 API 참조를 확인하세요.

다음 단계

Imagen 및 Vertex AI의 기타 생성형 AI 제품 관련 문서 읽기:

시각적 질의 응답(VQA)을 사용하여 이미지 정보 가져오기 컬렉션을 사용해 정리하기 내 환경설정을 기준으로 콘텐츠를 저장하고 분류하세요.

지원 언어

성능 및 제한사항

위치

책임감 있는 AI 안전 필터링

필터링된 입력

필터링된 출력

이미지에 VQA 사용(짧은 형식 응답)

REST

curl

PowerShell

Python

Node.js

VQA 매개변수 사용

알림당 결과 수

시드 번호

다음 단계

시각적 질의 응답(VQA)을 사용하여 이미지 정보 가져오기