Image captions

imagetext is the name of the model that supports image captioning. imagetext generates a caption from an image you provide based on the language that you specify. The model supports the following languages: English (en), German (de), French (fr), Spanish (es) and Italian (it).

To explore this model in the console, see the Image Captioning model card in the Model Garden.

View Imagen for Captioning & VQA model card

Use cases

Some common use cases for image captioning include:

Creators can generate captions for uploaded images and videos (for example, a short description of a video sequence)
Generate captions to describe products
Integrate captioning with an app using the API to create new experiences

HTTP request

POST https://us-central1-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/us-central1/publishers/google/models/imagetext:predict

Request body

{
  "instances": [
    {
      "image": {
        // Union field can be only one of the following:
        "bytesBase64Encoded": string,
        "gcsUri": string,
        // End of list of possible types for union field.
        "mimeType": string
      }
    }
  ],
  "parameters": {
    "sampleCount": integer,
    "storageUri": string,
    "language": string,
    "seed": integer
  }
}

Use the following parameters for the Imagen model imagetext. For more information, see Get image descriptions using visual captioning.

Parameter	Description	Acceptable values
`instances`	An array that contains the object with image details to get information about.	array (1 image object allowed)
`bytesBase64Encoded`	The image to caption.	Base64-encoded image string (PNG or JPEG, 20 MB max)
`gcsUri`	The Cloud Storage URI of the image to caption.	string URI of the image file in Cloud Storage (PNG or JPEG, 20 MB max)
`mimeType`	Optional. The MIME type of the image you specify.	string (`image/jpeg` or `image/png`)
`sampleCount`	Number of generated text strings.	Int value: 1-3
`seed`	Optional. The seed for random number generator (RNG). If RNG seed is the same for requests with the inputs, the prediction results will be the same.	integer
`storageUri`	Optional. The Cloud Storage location to save the generated text responses.	string
`language`	Optional. The text prompt for guiding the response.	string: `en` (default), `de`, `fr`, `it`, `es`

Sample request

REST

To test a text prompt by using the Vertex AI API, send a POST request to the publisher model endpoint.

Before using any of the request data, make the following replacements:

PROJECT_ID: Your Google Cloud project ID.
LOCATION: Your project's region. For example, us-central1, europe-west2, or asia-northeast3. For a list of available regions, see Generative AI on Vertex AI locations.
B64_IMAGE: The image to get captions for. The image must be specified as a base64-encoded byte string. Size limit: 10 MB.
RESPONSE_COUNT: The number of image captions you want to generate. Accepted integer values: 1-3.
LANGUAGE_CODE: One of the supported language codes. Languages supported:
- English (en)
- French (fr)
- German (de)
- Italian (it)
- Spanish (es)

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/imagetext:predict

Request JSON body:

{
  "instances": [
    {
      "image": {
          "bytesBase64Encoded": "B64_IMAGE"
      }
    }
  ],
  "parameters": {
    "sampleCount": RESPONSE_COUNT,
    "language": "LANGUAGE_CODE"
  }
}

To send your request, choose one of these options:

curl

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login , or by using Cloud Shell, which automatically logs you into the gcloud CLI . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/imagetext:predict"

PowerShell

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/imagetext:predict" | Select-Object -Expand Content

The following sample responses are for a request with "sampleCount": 2. The response returns two prediction strings.

English (en):

{
  "predictions": [
    "a yellow mug with a sheep on it sits next to a slice of cake",
    "a cup of coffee with a heart shaped latte art next to a slice of cake"
  ],
  "deployedModelId": "DEPLOYED_MODEL_ID",
  "model": "projects/PROJECT_ID/locations/LOCATION/models/MODEL_ID",
  "modelDisplayName": "MODEL_DISPLAYNAME",
  "modelVersionId": "1"
}

Spanish (es):

{
  "predictions": [
    "una taza de café junto a un plato de pastel de chocolate",
    "una taza de café con una forma de corazón en la espuma"
  ]
}

Response body

{
  "predictions": [ string ]
}

Response element	Description
`predictions`	List of text strings representing captions, sorted by confidence.

Sample response

{
  "predictions": [
    "text1",
    "text2"
  ]
}