Document Text Tutorial

Audience

The goal of this tutorial is help you develop applications using Google Cloud Vision API Document Text Detection. It assumes you are familiar with basic programming constructs and techniques, but even if you are a beginning programmer, you should be able to follow along and run this tutorial without difficulty, then use the Cloud Vision API reference documentation to create basic applications.

Prerequisites

Annotating an image using Document Text OCR

This tutorial walks you through a basic Vision API application that makes a DOCUMENT_TEXT_DETECTION request, then processes the fullTextAnnotation response.

A fullTextAnnotation is a structured hierarchical response for the text extracted from the image, organized as Pages→Blocks→Paragraphs→Words→Symbols:

  • Page is a collection of blocks, plus meta-information about the page: sizes, resolutions (X resolution and Y resolution may differ).

  • Block represents one "logical" element of the page—for example, an area covered by text, or a picture or separator between columns. The text and table blocks contain the main information needed to extract the text.

  • Paragraph is a structural unit of text representing an ordered sequence of words. By default, words are considered to be separated by word breaks.

  • Word is the smallest unit of text. It is represented as an array of Symbols.

  • Symbol represents a character or a punctuation mark.

The fullTextAnnotation also can provide URLs to Web images that partially or fully match the image in the request.

Complete code listing

As you read the code, we recommend that you follow along by referring to the Cloud Vision API Python reference.

import argparse
import base64
import json

from googleapiclient import discovery
from oauth2client.client import GoogleCredentials

from PIL import Image
from PIL import ImageDraw

# Check the level to print the polyboxes and provide the output.
# If the output level is Page, just print the text (Page does not have
# polyboxes).
def checkAndDrawBox(draw, checkType, itemName, outputText, item):
     if itemName==checkType:
        if itemName == "page":
            print outputText
        else:
            box = [(v.get('x', 0.0), v.get('y', 0.0))
                    for v in item['boundingBox']['vertices']]
            draw.line(box + [box[0]], width=5, fill='#00ff00')
            print outputText
     return

def main(image_file, output_level,output_file):
    """Run a request on a single image"""
    credentials = GoogleCredentials.get_application_default()
    service = discovery.build('vision', 'v1', credentials=credentials)

    with open(image_file, 'rb') as image:
        image_content = base64.b64encode(image.read())
        service_request = service.images().annotate(body={
            'requests': [{
                'image': {
                    'content': image_content.decode('UTF-8')
                },
                'features': [{
                    'type': 'DOCUMENT_TEXT_DETECTION',
                    'maxResults': 10
                }]
            }]
        })

        # Get libraries to draw the image.
        im = Image.open(image)
        draw = ImageDraw.Draw(im)

        # Walk through the fullTextAnnotation block in the response.
        # Walk through all the Pages.
        # For each page, walk through the Blocks.
        # For each block, walk through the Paras.
        # For each para, walk through the Words.
        # For each word, consolidate the Symbols into a Word.
        # At each level, draw a box and print output based on the requested
        # output level.
        apiresponse = service_request.execute()
        data = json.dumps(apiresponse)
        urlresponse = json.loads(data)
        for key, value in urlresponse.items():
            responses = urlresponse[key]
            for response in responses:
                if 'fullTextAnnotation' not in response:
                    print "full text not available"
                    return
                fullText = response['fullTextAnnotation']
                pages = fullText['pages']
                for page in pages:
                     pageText = ""
                     blocks = page['blocks']
                     for block in blocks:
                         blockType = block['blockType']
                         paras = block['paragraphs']
                         blockText = ""
                         for para in paras:
                                words = para['words']
                                paraText =""
                                for word in words:
                                    wordText=""
                                    symbols = word['symbols']
                                    for symbol in symbols:
                                        wordText = wordText + symbol['text']
                                    checkAndDrawBox(draw, output_level, "word", wordText, word)
                                    paraText = paraText + wordText
                                blockText = blockText + paraText
                                checkAndDrawBox(draw, output_level, "para", paraText, para)
                         checkAndDrawBox(draw, output_level, "block", blockText, block)
                         pageText = pageText + blockText
                     checkAndDrawBox(draw, output_level, "page", pageText, page)

        # Save output with the drawn polyboxes based on the requested level.
        im.save( output_level + "_" + output_file)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('image_file', help='The image you\'d like to label.')
    parser.add_argument('output_level', help='Level of the output. Can be one of 4 options: page, block, para, word')
    parser.add_argument('output_file', help='Output file containing the input images with boxes drawn around the text')
    args = parser.parse_args()
    main(args.image_file, args.output_level, args.output_file)

This simple application performs the following tasks:

  • Imports the libraries necessary to run the application
  • Takes three arguments passes it to the main()function:
    • image_file— the input image file to be annotated
    • output_level—the output annotation level, which must be one of the following: enums, page, block, para, or word
    • output_file—the output filename into which Cloud Vision will generate an output image with polyboxes drawn for the desired level
  • Gets credentials to run the Cloud Vision API service
  • Creates a Cloud Vision Text Detection request to send to the service
  • Sends the request and returns a response
  • Loops over the response and prints out the results
  • Parses the response for the service and displays the raw text
  • Creates an output image with boxes drawn around the text based on the level

A closer look at the code

Importing libraries

import argparse
import base64
import json

from googleapiclient import discovery
from oauth2client.client import GoogleCredentials

from PIL import Image
from PIL import ImageDraw

We import standard libraries:

  • argparse to allow the application to accept input filenames as arguments
  • base64 to encode the image data as JSON text
  • json to format the response

Other imports:

  • The discovery module within the googleapiclient library holds the directory of our API calls.
  • The GoogleCredentials module within the oauth2client.client library handles authentication to the service.
  • The Image and ImageDraw libraries from the PIL library are used to create the output image with boxes drawn on the input image.

Running the application

def main(image_file, output_level,output_file):
    """Run a text detection request on a single image"""
  ...
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('image_file', help='The image you\'d like to annotate.')
    parser.add_argument('output_level', help='Level of the output. Can be one of
                                        four options: page, block, para, word')
    parser.add_argument('output_file', help='Output file where Cloud Vision will
                                             output the input images with boxes
                                             drawn around the text')
    args = parser.parse_args()
    main(args.image_file, args.output_level, args.output_file)

Here, we simply parse the passed-in arguments and pass it to the main() function.

Authenticating to the API

    credentials = GoogleCredentials.get_application_default()
    service = discovery.build('vision', 'v1', credentials=credentials)

Before communicating with the Vision API service, you must authenticate your service using previously acquired credentials. Within an application, the simplest way to obtain credentials is to use Application Default Credentials (ADC). We obtain the Application Default Credentials using the get_application_default() method. By default, this method will attempt to obtain credentials from the GOOGLE_APPLICATION_CREDENTIALS environment variable, which should be set to point to your service account's JSON key file (see Setting Up a Service Account for more information.)

We then build the API for our service by calling the discovery module, which builds the Vision API, providing us with its annotate() method.

Constructing the request

  service_request = service.images().annotate(body={
            'requests': [{
                'image': {
                    'content': image_content.decode('UTF-8')
                },
                'features': [{
                    'type': 'DOCUMENT_TEXT_DETECTION',
                    'maxResults': 10
                }]
            }]
        })

Now that our Vision API service is ready, we can construct a request to the service. Requests to the Google Cloud Vision API are provided as JSON objects. See the Vision API Reference for complete information on the structure of a request.

This code snippet performs the following tasks:

  1. Constructs the JSON for a POST request to the images().annotate() method.
  2. Injects the local image file, binary 64-encoded, into the request.
  3. Indicates that our annotate method should perform DOCUMENT_TEXT_DETECTION.

Parsing the response

apiresponse = service_request.execute()
        data = json.dumps(apiresponse)
        urlresponse = json.loads(data)
        for key, value in urlresponse.items():
            responses = urlresponse[key]
            for response in responses:
                if 'fullTextAnnotation' not in response:
                    print "full text not available"
                    return
                fullText = response['fullTextAnnotation']
                pages = fullText['pages']
                for page in pages:
                     pageText = ""
                     blocks = page['blocks']
                     for block in blocks:
                         blockType = block['blockType']
                         paras = block['paragraphs']
                         blockText = ""
                         for para in paras:
                                words = para['words']
                                paraText =""
                                for word in words:
                                    wordText=""
                                    symbols = word['symbols']
                                    for symbol in symbols:
                                        wordText = wordText + symbol['text']
                                    checkAndDrawBox(draw, output_level, "word", wordText, word)
                                    paraText = paraText + wordText
                                blockText = blockText + paraText
                                checkAndDrawBox(draw, output_level, "para", paraText, para)
                         checkAndDrawBox(draw, output_level, "block", blockText, block)
                         pageText = pageText + blockText
                     checkAndDrawBox(draw, output_level, "page", pageText, page)

        # Save the output with the drawn polyboxes based on the requested level.
        im.save( output_level + "_" + output_file)
    

Once the operation has been completed, our response will contain an AnnotateImageResponse, which consists of a list of Image Annotation results, one for each image sent in the request. Because we sent only one image in the request, we walk through the full TextAnnotation, and print the extracted text along with drawing the box around the text based on the requested output level.

Running the application

To run the application, you can download this receipt.jpg file (you may need to right-click the link), then pass the location where you downloaded the file on on your local machine to the tutorial application (ocr1.1.py).

Here is the Python command, followed by the Text Annotation output images.

python ocr1.1.py receipt.jpg para out.jpg

Output Image Annotations at the Para output level

Output Image Annotations at the Word output level

Congratulations! You've performed Text Detection using Google Cloud Vision Full Text Annotations!

Send feedback about...

Google Cloud Vision API Documentation