Detect text in images

Optical Character Recognition (OCR) is one of the three Vertex AI pre-trained APIs available on Google Distributed Cloud (GDC) air-gapped.

Use the OCR feature of Vertex AI to detect text in various file types. Vertex AI detects typed text in a photo image or handwritten text.

Learn more about OCR-supported languages detected by the text-recognition feature of OCR in a single image.

Before you begin

To get the permissions you need to use the Vertex AI Optical Character Recognition (OCR) pre-trained API, ask your Project IAM Admin to grant you the AI OCR Developer (ai-ocr-developer) role in your project namespace.

Examples of files with detected text

The examples illustrate how Vertex AI detects and extracts text from images.

Road sign photograph

Figure 1 is a photograph that contains a street or traffic sign. Vertex AI returns a JSON file with the extracted string, individual words, and their bounding boxes.

Road sign

Figure 1. Road sign photograph where Vertex AI detects words and their bounding boxes.

Scanned image of typed text

Figure 2 is a scanned image of typed text. Vertex AI returns a JSON file containing page, block, paragraph, word, and break information.

Dense figure with annotations

Figure 2. Scanned image of typed text where Vertex AI detects information such as words, pages, and paragraphs.

Image of handwriting

Figure 3 is an image of handwritten text. Vertex AI detects and extracts text from these images. For a list of handwriting scripts that are supported for handwriting recognition, see Handwriting scripts.

Handwriting figure

Figure 3. Handwriting image where Vertex AI detects text.

Feature differences from Google Cloud

This section describes the differences between OCR on Distributed Cloud and Vision and OCR on Google Cloud.

The primary difference is that Distributed Cloud only supports OCR. Vision on Distributed Cloud doesn't provide other features available in Vision on Google Cloud, such as image recognition, facial recognition, and crop hint detection.

The following table describes the supported features in Distributed Cloud.

Feature GDC feature
OCR API methods OCR on GDC supports the following two methods:
  • BatchAnnotateImages
  • BatchAnnoteFiles
Language support OCR on GDC supports a subset of the languages supported on Google Cloud.
Asynchronous methods AsyncBatchAnnotateFiles
BatchAnnotateFiles method The following subset of fields supported on Google Cloud are supported on GDC:
  • type
  • content
  • language_hints
  • mime_type
  • pages

If you set any other fields in a request, they are ignored or cause an error.
BatchAnnotateImages method The following subset of fields supported on Google Cloud are supported on GDC:
  • type
  • content
  • language_hints

If you set any other fields in a request, they are ignored or cause an error.
File location In GDC, you can only process images for OCR if they are stored locally.

Use the API to detect text in files

Vertex AI on Distributed Cloud supports the following two methods for extracting text from files and images:

Vertex AI on Distributed Cloud doesn't support any other OCR API methods that are supported on Google Cloud.

Use the BatchAnnotateImages method

Use the BatchAnnotateImages method to detect and extract text from a batch of JPEG and PNG files. Specify the following fields in the request:

  • type: The type of the text to extract. Specify one of two OCR types: TEXT_DETECTION or DOCUMENT_TEXT_DETECTION.

  • content: The images with text to detect. You can only process images that are stored locally in your Distributed Cloud environment. The system can't access images available to the public or stored in a Google Cloud bucket. Therefore, the system doesn't support them.

  • language_hints: Optional. List of languages to use for the TEXT_DETECTION or DOCUMENT_TEXT_DETECTION OCR types. In most cases, an empty value yields the best results, because it enables automatic language detection. For languages based on the Latin alphabet, you don't need to set the language_hints field. In rare cases, when you know the language of the text in the image, setting a hint improves results. Use the language_hints field with caution. If a hint is wrong, it can significantly impede text detection.

The BatchAnnotateImages method on Distributed Cloud supports a subset of the parameters that you can specify when you call BatchAnnotateImages in Vertex AI on Google Cloud. If you specify any other parameters in a BatchAnnotateImages request on Distributed Cloud, they are ignored or result in an error.

For more information, see BatchAnnotateImages in the Distributed Cloud API reference.

Use the BatchAnnotateFiles method

Use the BatchAnnotateFiles method to detect and extract text from a batch of PDF and TIFF files. Specify the following fields in the request:

  • type: The type of the text to extract. Specify one of two OCR types: TEXT_DETECTION or DOCUMENT_TEXT_DETECTION.

  • content: The images with text to detect. You can only process images that are stored locally in your Distributed Cloud environment. The system can't access images available to the public or stored in a Google Cloud bucket. Therefore, the system doesn't support them.

  • language_hints: Optional. List of languages to use for the TEXT_DETECTION or DOCUMENT_TEXT_DETECTION OCR types. In most cases, an empty value yields the best results, because it enables automatic language detection. For languages based on the Latin alphabet, you don't need to set the language_hints field. In rare cases, when you know the language of the text in the image, setting a hint improves results. Use the language_hints field with caution. If a hint is wrong, it can significantly impede text detection.

  • mime_type: The type of the file. You must set it to one of the following values:

    • application/pdf
    • image/tiff
  • pages: Optional. The pages of the file that are processed for text detection. The maximum number of pages that you can specify is five. If you don't specify the number of pages, the first five pages of the file are processed.

The BatchAnnotateFiles method on GDC supports a subset of the parameters that you can specify on Google Cloud. If you specify any other parameters in a BatchAnnotateFiles request on Distributed Cloud, they are ignored or result in an error.

For more information, see BatchAnnotateFiles in the Distributed Cloud API reference.

Use the OCR client library

The Optical Character Recognition (OCR) API detects text on a local image by sending the contents of an image file as a base64-encoded string in the body of a request.

Prerequisites to use the OCR client library

To use the OCR client library, make sure the following prerequisites are complete:

  • Install the Vertex AI client libraries. For more information, see Install Vertex AI client libraries.

  • Get the OCR endpoint. Make a note of the endpoint and use it where you see OCR_ENDPOINT in the following client library sample code. For more information, see Get the OCR endpoint.

Create OCR text detection requests

REST


To send an OCR request using Python, first install the Vertex AI client libraries.

Before you send an OCR request using REST, replace the BASE64_ENCODED_IMAGE variable with the base64 ASCII string representation of your binary image data. This string begins with characters that look similar to the following string:

  • /9j/4QAYRXhpZgAA...9tAVx/zDQDlGxn//2Q==

The sample JSON request body:

{
  "requests": [
    {
      "image": {
        "content": BASE64_ENCODED_IMAGE
      },
      "features": [
        {
          "type": "TEXT_DETECTION"
        }
      ]
    }
  ]
}

To send your request, choose one of the following options:

curl


Save the request body in a file named request.json, then run the following command:

curl -X POST \
-H "Content-Type: application/json; charset=utf-8" \
-H "x-goog-user-project: projects/PROJECT_ID" \
-d @request.json \
"OCR_ENDPOINT:v1/images:annotate"

PowerShell


Save the request body in a file named request.json, then run the following command:

$cred = gcloud auth application-default print-access-token
$headers = @{
  "Authorization" = "Bearer $cred"
  "x-goog-user-project" = "projects/PROJECT_ID"
}

Invoke-WebRequest
  -Method POST
  -Headers $headers
  -ContentType: "application/json; charset=utf-8"
  -InFile request.json
  -Uri "OCR_ENDPOINT/v1/images:annotate" | Select-Object -Expand Content
 

Python


To send an OCR request using Python, first install the Vertex AI client libraries and Python 3.7.

The Python code sample:

import io
from google.cloud import vision
import google.auth
from google.auth.transport import requests
from google.api_core.client_options import ClientOptions

# Get your credentials
def get_credentials():
  credentials = None
  try:
    credentials, project_id = google.auth.default()
    credentials = credentials.with_gdch_audience('https://OCR_ENDPOINT:443')
    req = requests.Request()
    credentials.refresh(req)
  except Exception as e:
    print("Caught exception" + str(e))
    raise e
  return credentials

# Use your credentials to access the client library
def get_vision_client(credentials):
  opts = ClientOptions(api_endpoint='OCR_ENDPOINT:443')
  return vision.ImageAnnotatorClient(credentials= credentials, client_options=opts)

# Define the function that detects text in an image file
def detect_text(path):
  credentials  = get_credentials()
  client = get_vision_client(credentials)

  metadata = [("x-goog-user-project", "projects/PROJECT_ID")]

  with io.open(path, 'rb') as image_file:
    content = image_file.read()

  image = vision.Image(content=content)

  response = client.text_detection(image=image, metadata=metadata)
  texts = response.text_annotations
  print('Texts:')

  for text in texts:
    print('\n"{}"'.format(text.description))
    vertices = (['({},{})'.format(vertex.x, vertex.y)
      for vertex in text.bounding_poly.vertices])
    print('bounds: {}'.format(','.join(vertices)))

  # Add error handling code here.
  if response.error.message:
    raise Exception('Error handling code goes here.')

# Call the "detect_text" function using the path to your text file
if __name__=="__main__": 
  detect_text("PATH_TO_IMAGE_FILE")

Optional: Specify the language in a request

The BatchAnnotateFiles and BatchAnnotateImages methods support one or more language hints to specify the language of any text in the image. If you don't specify a language, Vertex AI enables automatic language detection, which usually results in the most accurate results. For languages based on the Latin alphabet, you don't need to set language hints. In rare cases, when the language of the text in the image is known, setting a hint improves the results. However, if the hint is incorrect, it might cause a significant impediment. Text detection returns an error if a specified language isn't one of the supported languages.

Add one or more supported languages to the imageContext.languageHints request in the request.json file to provide a language hint. The following code sample is a demonstration:

{
  "requests": [
    {
      "image": {
        "content": BASE64_ENCODED_IMAGE
      },
      "features": [
        {
          "type": "DOCUMENT_TEXT_DETECTION"
        }
      ],
      "imageContext": {
        "languageHints": ["en-t-i0-handwrit"]
      }
    }
  ]
}

Get the OCR endpoint

To get the endpoint for OCR, see View service statuses and endpoints.

OCR limits

The following table lists the current limits in Optical Character Recognition (OCR) on Google Distributed Cloud (GDC) air-gapped.

File limit for OCR Value
Maximum number of pages 5
Maximum file size 20 MB
Maximum image size 20 million pixels (length x width)

Submitted files for OCR that exceed the maximum number of pages or the maximum file size return an error. Submitted files that exceed the maximum image size are downsized to 20 million pixels.

Supported file types for OCR

The Optical Character Recognition (OCR) pre-trained API detects and transcribes text from the following file types:

  • PDF
  • TIFF
  • JPG
  • PNG

You must store the files locally in your Google Distributed Cloud air-gapped (GDC) environment. You can't access files hosted in Cloud Storage or files that are publicly available for text detection.