Starting April 29, 2025, Gemini 1.5 Pro and Gemini 1.5 Flash models are not available in projects that have no prior usage of these models, including new projects. For details, see Model versions and lifecycle.

Visual question and answering (VQA)

Imagen for Captioning & VQA (imagetext) is the name of the model that supports image question and answering. Imagen for Captioning & VQA answers a question provided for a given image, even if it hasn't been seen before by the model.

To explore this model in the console, see the Imagen for Captioning & VQA model card in the Model Garden.

View Imagen for Captioning & VQA model card

Use cases

Some common use cases for image question and answering include:

Empower users to engage with visual content with Q&A.
Enable customers to engage with product images shown on retail apps and websites.
Provide accessibility options for visually impaired users.

HTTP request

POST https://us-central1-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/us-central1/publishers/google/models/imagetext:predict

Request body

{
  "instances": [
    {
      "prompt": string,
      "image": {
        // Union field can be only one of the following:
        "bytesBase64Encoded": string,
        "gcsUri": string,
        // End of list of possible types for union field.
        "mimeType": string
      }
    }
  ],
  "parameters": {
    "sampleCount": integer,
    "seed": integer
  }
}

Use the following parameters for the visual Q&A generation model imagetext. For more information, see Use Visual Question Answering (VQA).

Parameter	Description	Acceptable values
`instances`	An array that contains the object with prompt and image details to get information about.	array (1 image object allowed)
`prompt`	The question you want to get answered about your image.	string (80 tokens max)
`bytesBase64Encoded`	The image to get information about.	Base64-encoded image string (PNG or JPEG, 20 MB max)
`gcsUri`	The Cloud Storage URI of the image to get information about.	string URI of the image file in Cloud Storage (PNG or JPEG, 20 MB max)
`mimeType`	Optional. The MIME type of the image you specify.	string (`image/jpeg` or `image/png`)
`sampleCount`	Number of generated text strings.	Int value: 1-3
`seed`	Optional. The seed for random number generator (RNG). If RNG seed is the same for requests with the inputs, the prediction results will be the same.	integer

Sample request

Before using any of the request data, make the following replacements:

PROJECT_ID: Your Google Cloud project ID.
LOCATION: Your project's region. For example, us-central1, europe-west2, or asia-northeast3. For a list of available regions, see Generative AI on Vertex AI locations.
VQA_PROMPT: The question you want to get answered about your image.
- What color is this shoe?
- What type of sleeves are on the shirt?
B64_IMAGE: The image to get captions for. The image must be specified as a base64-encoded byte string. Size limit: 10 MB.
RESPONSE_COUNT: The number of answers you want to generate. Accepted integer values: 1-3.

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/imagetext:predict

Request JSON body:

{
  "instances": [
    {
      "prompt": "VQA_PROMPT",
      "image": {
          "bytesBase64Encoded": "B64_IMAGE"
      }
    }
  ],
  "parameters": {
    "sampleCount": RESPONSE_COUNT
  }
}

To send your request, choose one of these options:

curl

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login , or by using Cloud Shell, which automatically logs you into the gcloud CLI . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/imagetext:predict"

PowerShell

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/imagetext:predict" | Select-Object -Expand Content

The following sample responses are for a request with "sampleCount": 2 and "prompt": "What is this?". The response returns two prediction string answers.

{
  "predictions": [
    "cappuccino",
    "coffee"
  ]
}

Response body


{
  "predictions": [
    string
  ]
}

Response element	Description
`predictions`	List of text strings representing VQA answer, sorted by confidence.

Sample response

The following sample responses is for a request with "sampleCount": 2 and "prompt": "What is this?". The response returns two prediction string answers.

{
  "predictions": [
    "cappuccino",
    "coffee"
  ],
  "deployedModelId": "DEPLOYED_MODEL_ID",
  "model": "projects/PROJECT_ID/locations/us-central1/models/MODEL_ID",
  "modelDisplayName": "MODEL_DISPLAYNAME",
  "modelVersionId": "1"
}

Visual question and answering (VQA) Stay organized with collections Save and categorize content based on your preferences.

Use cases

HTTP request

Request body

Sample request

curl

PowerShell

Response body

Sample response

Visual question and answering (VQA)