Cloud Vision API Requests and Responses

This document provides a guide to Google Cloud Vision API REST requests and responses. We recommend that you read Getting Started before going through this material. For quick tutorials on performing Cloud Vision API tasks, consult the How-To Guides.

POST Requests

The Cloud Vision API is an easy to use REST API that uses HTTP POST operations to perform data analysis on images you send in the request. The API uses JSON for both requests and responses. A typical Vision API JSON request includes the contents of image(s) on which to perform detection, and a set of operations (called features) to run against each image.

Currently, the Vision API consists of one collection (images) which supports one HTTP Request method (annotate):

POST https://vision.googleapis.com/v1/images:annotate

(For full reference documentation, consult the Vision API Reference; other collections and methods can be added in the future.)

This POST request must authenticate by passing either an API key or an OAuth token directly within the header of the POST request. The API must use a POST request (instead of a GET request), so that authentication parameters can be passed within the request payload instead of embedded in the request URL. The POST request and response JSON formats are explained below.

JSON Request Format

The annotate request passes a JSON request of type AnnotateImageRequest. An example is shown below:

{
  "requests":[
    {
      "image":{
        "content":"/9j/7QBEUGhvdG9...image contents...eYxxxzj/Coa6Bax//Z"
      },
      "features":[
        {
          "type":"LABEL_DETECTION",
          "maxResults":1
        }
      ]
    }
  ]
}

These fields are summarized below. For full documentation on AnnotateImageRequest fields, consult the Vision API Reference.

  • requests - An array of requests, one for each image
    • image - The image data for this request. This image field needs to contain either a child content field or a child source field.
      • content - The contents of the image, provided as a base64-encoded image.
      • source - The image source for this request.
        • gcs_image_uri - The Google Cloud Storage URI to the specific image to use for this request. Note that wildcards are not supported.
    • features - The array of features to detect for this image
      • type - The feature type (See Types of Vision API Request).
      • maxResults - The maximum number of results to return for this feature type. The API can return fewer results.

You can provide links to images stored within Google Cloud Storage. Such images must be accessible using whatever credentials you're using for the Vision API request. For example:

  • Accessible by you, the user, if accessed using OAuth user credentials
  • Accessible by your vision API service account, if accessed through an application using OAuth service account credentials
  • Publicly readable

A sample AnnotateImageRequest using a Cloud Storage URI is shown below:

{
  "requests":[
    {
      "image":{
        "source":{
          "gcs_image_uri":
            "gs://bucket_name/path_to_image_object"
        }
      },
      "features":[
        {
          "type":"LABEL_DETECTION",
          "maxResults":1
        }
      ]
    }
  ]
}

Types of Vision API Requests

For each image within a Vision API request, you can specify one or more features to detect, passing those within the features array for each image. This set of types is listed below:

Feature TypeDescription
LABEL_DETECTION Execute Image Content Analysis on the entire image and return
TEXT_DETECTION Perform Optical Character Recognition (OCR) on text within the image (character limit applies)
DOCUMENT_TEXT_DETECTION Perform Optical Character Recognition (OCR) on dense test image (premium feature not subject to character limit)
FACE_DETECTION Detect faces within the image
LANDMARK_DETECTION Detect geographic landmarks within the image
LOGO_DETECTION Detect company logos within the image
SAFE_SEARCH_DETECTION Determine image safe search properties on the image
IMAGE_PROPERTIES Compute a set of properties about the image (such as the image's dominant colors)

You can perform multiple feature detection on any image by enumerating the feature analysis to perform within the features array:

{
  "requests":[
    {
      "image":{
        "content":"/9j/7QBEUGhvdG9zaG9...image contents...fXNWzvDEeYxxxzj/Coa6Bax//Z"
      },
      "features":[
        {
          "type":"FACE_DETECTION",
          "maxResults":10
        },
        {
          "type":"LABEL_DETECTION",
          "maxResults":10
        }
      ]
    }
  ]
}

JSON Response Format

The annotate request receives a JSON response of type AnnotateImageResponse. Although the requests are similar for each feature type, the responses for each feature type can be quite different. Consult the the Vision API Reference for complete information.

LABEL_DETECTION Responses

A LABEL_DETECTION request produces a response containing a set of labelAnnotations of type EntityAnnotation.

The code example contains a sample label detection response for the top five matches for the photo shown below:

{
  "responses": [
    {
      "labelAnnotations": [
        {
          "mid": "/m/0bt9lr",
          "description": "dog",
          "score": 0.97346616
        },
        {
          "mid": "/m/09686",
          "description": "vertebrate",
          "score": 0.85700572
        },
        {
          "mid": "/m/01pm38",
          "description": "clumber spaniel",
          "score": 0.84881884
        },
        {
          "mid": "/m/04rky",
          "description": "mammal",
          "score": 0.847575
        },
        {
          "mid": "/m/02wbgd",
          "description": "english cocker spaniel",
          "score": 0.75829375
        }
      ]
    }
  ]
}

FACE_DETECTION Responses

A FACE_DETECTION request produces a response containing a set of faceAnnotations of type FaceAnnotation.

The code example contains a sample face detection response for the photo shown below:

{
  "responses":[
    {
      "faceAnnotations":[
        {
          "boundingPoly":{
            "vertices":[
              {
                "x":1916,
                "y":870
              },
              {
                "x":2106,
                "y":870
              },
              {
                "x":2106,
                "y":1091
              },
              {
                "x":1916,
                "y":1091
              }
            ]
          },
          "fdBoundingPoly":{
            "vertices":[
              {
                "x":1923,
                "y":910
              },
              {
                "x":2081,
                "y":910
              },
              {
                "x":2081,
                "y":1068
              },
              {
                "x":1923,
                "y":1068
              }
            ]
          },
          "landmarks":[
            {
              "type":"LEFT_EYE",
              "position":{
                "x":1969.4862,
                "y":955.17334,
                "z":-0.0016533856
              }
            },
            {
              "type":"RIGHT_EYE",
              "position":{
                "x":2019.262,
                "y":967.91278,
                "z":-28.925787
              }
            },
            {
              "type":"LEFT_OF_LEFT_EYEBROW",
              "position":{
                "x":1959.3801,
                "y":939.696,
                "z":14.981886
              }
            },
            {
              "type":"RIGHT_OF_LEFT_EYEBROW",
              "position":{
                "x":1980.2725,
                "y":943.3717,
                "z":-15.975462
              }
            },
            {
              "type":"LEFT_OF_RIGHT_EYEBROW",
              "position":{
                "x":2003.1469,
                "y":948.81323,
                "z":-29.651102
              }
            },
            {
              "type":"RIGHT_OF_RIGHT_EYEBROW",
              "position":{
                "x":2040.1477,
                "y":961.9339,
                "z":-32.134441
              }
            },
            {
              "type":"MIDPOINT_BETWEEN_EYES",
              "position":{
                "x":1987.4386,
                "y":956.79248,
                "z":-24.352777
              }
            },
            {
              "type":"NOSE_TIP",
              "position":{
                "x":1969.6227,
                "y":985.49719,
                "z":-41.193481
              }
            },
            {
              "type":"UPPER_LIP",
              "position":{
                "x":1972.1095,
                "y":1007.2608,
                "z":-30.672895
              }
            },
            {
              "type":"LOWER_LIP",
              "position":{
                "x":1968.515,
                "y":1027.6235,
                "z":-28.315508
              }
            },
            {
              "type":"MOUTH_LEFT",
              "position":{
                "x":1957.8792,
                "y":1013.6796,
                "z":-6.6342912
              }
            },
            {
              "type":"MOUTH_RIGHT",
              "position":{
                "x":1998.7747,
                "y":1022.9999,
                "z":-28.734522
              }
            },
            {
              "type":"MOUTH_CENTER",
              "position":{
                "x":1971.396,
                "y":1017.4032,
                "z":-27.534792
              }
            },
            {
              "type":"NOSE_BOTTOM_RIGHT",
              "position":{
                "x":1993.8416,
                "y":995.19,
                "z":-29.759504
              }
            },
            {
              "type":"NOSE_BOTTOM_LEFT",
              "position":{
                "x":1965.5908,
                "y":989.42383,
                "z":-13.663703
              }
            },
            {
              "type":"NOSE_BOTTOM_CENTER",
              "position":{
                "x":1974.8154,
                "y":995.68555,
                "z":-30.112482
              }
            },
            {
              "type":"LEFT_EYE_TOP_BOUNDARY",
              "position":{
                "x":1968.6737,
                "y":950.9704,
                "z":-3.0559144
              }
            },
            {
              "type":"LEFT_EYE_RIGHT_CORNER",
              "position":{
                "x":1978.8079,
                "y":958.23712,
                "z":-5.4053364
              }
            },
            {
              "type":"LEFT_EYE_BOTTOM_BOUNDARY",
              "position":{
                "x":1967.8793,
                "y":959.22345,
                "z":-0.62461489
              }
            },
            {
              "type":"LEFT_EYE_LEFT_CORNER",
              "position":{
                "x":1962.1622,
                "y":954.26093,
                "z":10.204804
              }
            },
            {
              "type":"LEFT_EYE_PUPIL",
              "position":{
                "x":1967.9233,
                "y":954.9704,
                "z":-0.77994776
              }
            },
            {
              "type":"RIGHT_EYE_TOP_BOUNDARY",
              "position":{
                "x":2016.6268,
                "y":962.88623,
                "z":-31.205936
              }
            },
            {
              "type":"RIGHT_EYE_RIGHT_CORNER",
              "position":{
                "x":2029.2314,
                "y":970.985,
                "z":-29.216293
              }
            },
            {
              "type":"RIGHT_EYE_BOTTOM_BOUNDARY",
              "position":{
                "x":2017.429,
                "y":972.17621,
                "z":-28.954475
              }
            },
            {
              "type":"RIGHT_EYE_LEFT_CORNER",
              "position":{
                "x":2007.4708,
                "y":965.36237,
                "z":-22.286636
              }
            },
            {
              "type":"RIGHT_EYE_PUPIL",
              "position":{
                "x":2017.0439,
                "y":967.18329,
                "z":-29.732374
              }
            },
            {
              "type":"LEFT_EYEBROW_UPPER_MIDPOINT",
              "position":{
                "x":1969.7963,
                "y":934.11523,
                "z":-3.3017645
              }
            },
            {
              "type":"RIGHT_EYEBROW_UPPER_MIDPOINT",
              "position":{
                "x":2021.7909,
                "y":947.04419,
                "z":-33.841984
              }
            },
            {
              "type":"LEFT_EAR_TRAGION",
              "position":{
                "x":1963.6063,
                "y":987.89252,
                "z":77.398705
              }
            },
            {
              "type":"RIGHT_EAR_TRAGION",
              "position":{
                "x":2075.2998,
                "y":1016.2071,
                "z":13.859237
              }
            },
            {
              "type":"FOREHEAD_GLABELLA",
              "position":{
                "x":1991.0243,
                "y":945.11224,
                "z":-24.655386
              }
            },
            {
              "type":"CHIN_GNATHION",
              "position":{
                "x":1964.3625,
                "y":1055.4045,
                "z":-23.147352
              }
            },
            {
              "type":"CHIN_LEFT_GONION",
              "position":{
                "x":1948.226,
                "y":1019.5986,
                "z":52.048538
              }
            },
            {
              "type":"CHIN_RIGHT_GONION",
              "position":{
                "x":2046.8456,
                "y":1044.8068,
                "z":-6.1001
              }
            }
          ],
          "rollAngle":16.066454,
          "panAngle":-29.752207,
          "tiltAngle":3.7352962,
          "detectionConfidence":0.98736823,
          "landmarkingConfidence":0.57041687,
          "joyLikelihood":0.90647823,
          "sorrowLikelihood":4.1928422e-05,
          "angerLikelihood":0.00033951481,
          "surpriseLikelihood":0.0024809798,
          "underExposedLikelihood":3.5745124e-06,
          "blurredLikelihood":0.00038755304,
          "headwearLikelihood":1.1718362e-05
        }
      ]
    }
  ]
}

Generating JSON Requests

The examples in the section use a Python script to construct Cloud Vision AnnotateImageRequests.

import argparse
import base64
import json
import sys


def main(input_file, output_filename):
    """Translates the input file into a json output file.

    Args:
        input_file: a file object, containing lines of input to convert.
        output_filename: the name of the file to output the json to.
    """
    request_list = []
    for line in input_file:
        image_filename, features = line.lstrip().split(' ', 1)

        with open(image_filename, 'rb') as image_file:
            content_json_obj = {
                'content': base64.b64encode(image_file.read()).decode('UTF-8')
            }

        feature_json_obj = []
        for word in features.split(' '):
            feature, max_results = word.split(':', 1)
            feature_json_obj.append({
                'type': get_detection_type(feature),
                'maxResults': int(max_results),
            })

        request_list.append({
            'features': feature_json_obj,
            'image': content_json_obj,
        })

    with open(output_filename, 'w') as output_file:
        json.dump({'requests': request_list}, output_file)


DETECTION_TYPES = [
    'TYPE_UNSPECIFIED',
    'FACE_DETECTION',
    'LANDMARK_DETECTION',
    'LOGO_DETECTION',
    'LABEL_DETECTION',
    'TEXT_DETECTION',
    'SAFE_SEARCH_DETECTION',
]


def get_detection_type(detect_num):
    """Return the Vision API symbol corresponding to the given number."""
    detect_num = int(detect_num)
    if 0 < detect_num < len(DETECTION_TYPES):
        return DETECTION_TYPES[detect_num]
    else:
        return DETECTION_TYPES[0]

The Python script reads an input text file that specifies the feature detection to perform on a set of images. Each line in the input file contains the path to an image and a feature specifier of the form feature:max_results for each image. Features are mapped using an integer value from 1-6 (as shown in the get_detection_type() function definition, above).

For example, the following input file to the script requests face and label detection annotations for image1, and landmark and logo detection annotations for image2; each with a maximum of 10 results per annotation.

filepath_to_image1.jpg 1:10 4:10
filepath_to_image2.png 2:10 3:10

You can download the script and run it, specifying the input and output files as follows:

python generatejson.py -i inputfile -o outputfile

The next subsections show examples that use the script to generate Cloud Vision AnnotateImageRequests for sample images, use curl and Python to send the requests, then display the AnnotateImageResponses returned by Cloud Vision.

Using Curl to Send Generated Requests

This example asks Cloud Vision to annotate the following image:

The input file to the script, visioninfile.txt, lists the image file path, and specifies three features annotations, landmark (2), logo (3), and label (4) annotations, for the image. A maximum of 10 responses is specified for all annotations.

/Users/username/desktv.jpg 2:10 3:10 4:10

The script is executed with input and output file arguments. The output is written to vision.json. Note: to aid readability, line continuation "\" characters are used below to split the command into several lines; when executing the command, the command and all arguments should be supplied on a single command line.

python generatejson.py \
    -i /Users/username/testdata/visioninfile \
    -o /Users/username/testdata/vision.json

The script processes the input file and displays a summary of the requested feature annotations for the image on the console, while it writes the assembled request to vision.json:

/Users/username/testdata/desktv.jpg 2:10 3:10 4:10
detect:LANDMARK_DETECTION
results: 10
detect:LOGO_DETECTION
results: 10
detect:LABEL_DETECTION
results: 10

The resulting JSON content is truncated and formatted, below:

{
   "requests": [
      {
         "image": {
            "content": "/9j/4AAQS...0LSP//Z"
         },
         "features": [
            {
               "type": "LANDMARK_DETECTION",
               "maxResults": "10"
            },
            {
               "type": "LOGO_DETECTION",
               "maxResults": "10"
            },
            {
               "type": "LABEL_DETECTION",
               "maxResults": "10"
            }
         ]
      }
   ]
}

curl is used to send the request to Cloud Vision:

curl -v -k -s -H "Content-Type: application/json" \
    https://vision.googleapis.com/v1/images:annotate?key=API-key \
    --data-binary @/Users/username/testdata/vision.json

The Cloud Vision API returns the response. Since landmarks and logos are not detected in the image, landmarkAnnotation and logoAnnotation objects are absent in the response. Note that multiple label annotations in the response are sorted in order of highest-to-lowest confidence score.

 ...
 {
   "responses": [
     {
       "labelAnnotations": [
         {
           "mid": "/m/0c_jw",
           "description": "furniture",
           "score": 0.91564965
         },
         {
           "mid": "/m/01y9k5",
           "description": "desk",
           "score": 0.2922152
         },
         {
           "mid": "/m/04bcr3",
           "description": "table",
           "score": 0.12354217
         },
         {
           "mid": "/m/03vf67",
           "description": "sideboard",
           "score": 0.10264879
         },
         {
           "mid": "/m/06ht1",
           "description": "room",
           "score": 0.056943923
         },
         {
           "mid": "/m/078n6m",
           "description": "coffee table",
           "score": 0.051967125
         },
         {
           "mid": "/m/03qh03g",
           "description": "media",
           "score": 0.031322964
         },
         {
           "mid": "/m/0fqfqc",
           "description": "drawer",
           "score": 0.030417876
         },
         {
           "mid": "/m/02z51p",
           "description": "nightstand",
           "score": 0.024424918
         },
         {
           "mid": "/m/07c52",
           "description": "television",
           "score": 0.01784919
         }
       ]
     }
   ]
 }

Using Python to Send Generated Requests

This example requests Cloud Vision annotations for the following image:

The input file to the script, visioninfile.txt, contains the image file URI, and lists three features annotations: logo (3), label (4), and text (5) annotations, with a maximum of 10 responses for each annotation:

/Users/username/testdata/google.jpg 3:10 4:10 5:10

The script is executed with input and output file arguments. The output is written to vision.json. Note: to aid readability, line continuation "\" characters are used below to split the command into several lines; when executing the command, the command and all arguments should be supplied on a single command line.

python generatejson.py \
    -i /Users/username/testdata/visioninfile.txt \
    -o /Users/username/testdata/vision.json

The script processes the input file and displays a summary of the requested feature annotations on the console,

detect:LOGO_DETECTION
results: 10
detect:LABEL_DETECTION
results: 10
detect:TEXT_DETECTION
results: 10

and writes the assembled JSON request to vision.json (the content is truncated and formatted, below):

{
  "requests": [
    {
      "image": {
        "content": "/9j/4...A//9k="
       },
       "features": [
          {
            "type": "LOGO_DETECTION",
            "maxResults": "10"
          },
          {
            "type": "LABEL_DETECTION",
             "maxResults": "10"
          },
          {
            "type": "TEXT_DETECTION",
            "maxResults": "10"
          }
       ]
     }
  ]
}

Python is used to send the request to Cloud Vision and display the response. Note that multiple label annotations in the response are sorted in order of highest-to-lowest confidence score.

python
...
>>> import requests
>>> data = open('/Users/<username>/testdata/vision.json', 'rb').read()
>>> response = requests.post(url='https://vision.googleapis.com/v1/images:annotate?key=<API-key>',
    data=data,
    headers={'Content-Type': 'application/json'})
>>> print response.text
{
  "responses": [
    {
      "logoAnnotations": [
        {
          "mid": "/m/045c7b",
          "description": "Google",
          "score": 0.35000956,
          "boundingPoly": {
            "vertices": [
              {
                "x": 158,
                "y": 50
              },
              {
                "x": 515,
                "y": 50
              },
              {
                "x": 515,
                "y": 156
              },
              {
                "x": 158,
                "y": 156
              }
            ]
          }
        }
      ],
      "labelAnnotations": [
        {
          "mid": "/m/021sdg",
          "description": "graphics",
          "score": 0.67143095
        },
        {
          "mid": "/m/0dgsmq8",
          "description": "artwork",
          "score": 0.66358012
        },
        {
          "mid": "/m/0dwx7",
          "description": "logo",
          "score": 0.31318793
        },
        {
          "mid": "/m/01mf0",
          "description": "software",
          "score": 0.23124418
        },
        {
          "mid": "/m/03g09t",
          "description": "clip art",
          "score": 0.20368107
        },
        {
          "mid": "/m/02ngh",
          "description": "emoticon",
          "score": 0.19831011
        },
        {
          "mid": "/m/0h8npc5",
          "description": "digital content software",
          "score": 0.1769385
        },
        {
          "mid": "/m/03tqj",
          "description": "icon",
          "score": 0.097528793
        },
        {
          "mid": "/m/0hr95w1",
          "description": "pointer",
          "score": 0.03663468
        },
        {
          "mid": "/m/0n0j",
          "description": "area",
          "score": 0.033584446
        }
      ],
      "textAnnotations": [
        {
          "locale": "en",
          "description": "Google\n",
          "boundingPoly": {
            "vertices": [
              {
                "x": 61,
                "y": 26
              },
              {
                "x": 598,
                "y": 26        },

              },
              {
                "x": 598,
                "y": 227
              },
              {
                "x": 61,
                "y": 227
              }
            ]
          }
        }
      ]
    }
  ]
}
>>>

Send feedback about...

Google Cloud Vision API Documentation