Visual captioning lets you generate a relevant description for an image. You can use this information for a variety of uses:
- Get more detailed metadata about images for storing and searching.
- Generate automated captioning to support accessibility use cases.
- Receive quick descriptions of products and visual assets.
Image source: Santhosh Kumar on Unsplash (cropped)
Caption (short-form): a blue shirt with white polka dots is hanging on a hook
Supported languages
Visual captioning is available in the following languages:
- English (
en
) - French (
fr
) - German (
de
) - Italian (
it
) - Spanish (
es
)
Performance and limitations
The following limits apply when you use this model:
Limits | Value |
---|---|
Maximum number of API requests (short-form) per minute per project | 500 |
Maximum number of tokens returned in response (short-form) | 64 tokens |
Maximum number of tokens accepted in request (VQA short-form only) | 80 tokens |
The following service latency estimates apply when you use this model. These values are meant to be illustrative and are not a promise of service:
Latency | Value |
---|---|
API requests (short-form) | 1.5 seconds |
Locations
A location is a region you can specify in a request to control where data is stored at rest. For a list of available regions, see Generative AI on Vertex AI locations.
Get short-form image captions
Use the following samples to generate short-form captions for an image.
Console
In the Google Cloud console, open the Vertex AI Studio > Vision tab in the Vertex AI dashboard.
In the lower menu, click Caption.
Click Upload image to select your local image to caption.
In the Parameters panel, choose your Number of captions and Language.
Click
Generate caption.
REST
For more information about imagetext
model requests, see the
imagetext
model API reference.
Before using any of the request data, make the following replacements:
- PROJECT_ID: Your Google Cloud project ID.
- LOCATION: Your project's region. For example,
us-central1
,europe-west2
, orasia-northeast3
. For a list of available regions, see Generative AI on Vertex AI locations. - B64_IMAGE: The image to get captions for. The image must be specified as a base64-encoded byte string. Size limit: 10 MB.
- RESPONSE_COUNT: The number of image captions you want to generate. Accepted integer values: 1-3.
- LANGUAGE_CODE: One of the supported language codes. Languages supported:
- English (
en
) - French (
fr
) - German (
de
) - Italian (
it
) - Spanish (
es
)
- English (
HTTP method and URL:
POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/imagetext:predict
Request JSON body:
{ "instances": [ { "image": { "bytesBase64Encoded": "B64_IMAGE" } } ], "parameters": { "sampleCount": RESPONSE_COUNT, "language": "LANGUAGE_CODE" } }
To send your request, choose one of these options:
curl
Save the request body in a file named request.json
,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/imagetext:predict"
PowerShell
Save the request body in a file named request.json
,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/imagetext:predict" | Select-Object -Expand Content
"sampleCount": 2
. The response returns two prediction strings.
English (en
):
{ "predictions": [ "a yellow mug with a sheep on it sits next to a slice of cake", "a cup of coffee with a heart shaped latte art next to a slice of cake" ], "deployedModelId": "DEPLOYED_MODEL_ID", "model": "projects/PROJECT_ID/locations/LOCATION/models/MODEL_ID", "modelDisplayName": "MODEL_DISPLAYNAME", "modelVersionId": "1" }
Spanish (es
):
{ "predictions": [ "una taza de café junto a un plato de pastel de chocolate", "una taza de café con una forma de corazón en la espuma" ] }
Python
Before trying this sample, follow the Python setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Python API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
In this sample you use the load_from_file
method to reference a local file as
the base Image
to get a caption for. After you specify the base
image, you use the get_captions
method on the
ImageTextModel
and print the output.
Use parameters for image captioning
When you get image captions there are several parameters you can set depending on your use case.
Number of results
Use the number of results parameter to limit the amount of captions returned for
each request you send. For more information, see the
imagetext
(image captioning) model API reference.
Seed number
A number you add to a request to make generated descriptions deterministic.
Adding a seed number with your request is a way to assure you get the same
prediction (descriptions) each time. However, the image captions aren't
necessarily returned in the same order. For more information, see the
imagetext
(image captioning) model API reference.
What's next
- View videos describing Vertex AI foundation models including Imagen, the text-to-image foundation model that lets you generate and edit images:
- Read blog posts describing Imagen on Vertex AI and Generative AI on Vertex AI: