Detect text in images offline

Asynchronous image detection and annotation can be performed on a list of generic files, such as PDF files. The files might contain multiple pages and multiple images per page. The files to be processed must be stored in an Object storage bucket, and the detected text is written to an Object storage bucket in JSON format.

You can retrieve the progress and results of an asynchronous batch annotation request by using the google.longrunning.Operations interface. The Operation.metadata field contains OperationMetadata. The Operation.response field contains the AsyncBatchAnnotateFilesResponse(Results).

Before you begin

Follow these steps before using an asynchronous OCR API:

  1. Create a project.

    You can create the project using a custom resource (CR):

    apiVersion: resourcemanager.gdc.goog/v1
    kind: Project
    metadata:
      labels:
        atat.config.google.com/clin-number: CLIN_NUMBER
        atat.config.google.com/task-order-number: TASK_ORDER_NUMBER
      name: ocr-async-project
      namespace: platform
    
  2. Ask your Project IAM Admin to grant you the AI OCR Developer (ai-ocr-developer) role in your project namespace.

Prepare your environment

Before using an asynchronous OCR API to detect text offline, you must do the following:

  1. Create a storage bucket in the project.
  2. Select the Standard class.
  3. Grant read and write permissions on the bucket to the service account (ai-ocr-system-sa) used by the OCR service.

Alternatively, you can follow these steps to create the storage bucket, role, and role binding using custom resources (CR):

  1. Create the storage bucket.

    apiVersion: object.gdc.goog/v1
    kind: Bucket
    metadata:
      name: ocr-async-bucket
      namespace: ocr-async-project
    spec:
      description: bucket for async ocr
      storageClass: Standard
      bucketPolicy:
        lockingPolicy:
          defaultObjectRetentionDays: 90
    
  2. Create the role.

    apiVersion: rbac.authorization.k8s.io/v1
    kind: Role
    metadata:
      name: ocr-async-reader-writer
      namespace: ocr-async-project
    rules:
      -
        apiGroups:
          - object.gdc.goog
        resources:
          - buckets
        verbs:
          - read-object
          - write-object
    
  3. Create the role binding.

    apiVersion: rbac.authorization.k8s.io/v1
    kind: RoleBinding
    
    metadata:
      name: ocr-async-reader-writer-rolebinding
      namespace: ocr-async-project
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: Role
      name: ocr-async-reader-writer
    subjects:
      -
        kind: ServiceAccount
        name: ai-ocr-system-sa
        namespace: ai-ocr-system
    

Upload files to the object storage bucket

In order for the OCR service to process the file, the files must be uploaded to the object storage bucket.

Follow these steps:

  1. To configure the gdcloud CLI storage, see Install and configure the storage CLI for projects.
  2. For the steps to upload objects to a storage bucket, see Upload and download storage objects in projects.

Trigger the AsyncBatchAnnotateFilesRequest request

AsyncBatchAnnotateFilesRequest initiates the offline processing and returns the ID of the long-running process that performs text detection on the file. The returned ID can be used to track the status of the offline processing. If there are too many ongoing operations, the offline processing might not start immediately.

Before sending a request, you must ensure that the OCR service account has read permission to your input bucket and write permission to your output bucket, even though the input and output buckets can be different and in different project namespaces. We recommend using the same input and output buckets to prevent errors in case you provide the wrong name, and the results are written to buckets that don't belong to you.

To call the AsyncBatchAnnotateFilesRequest, you must specify the following:

  • Input file: The file that you want to annotate.
  • Output destination: The location where you want to store the annotated results.
  • Project ID: The ID of the project that you want to use.
  • Endpoint: The endpoint that you want to use.

curl

    echo '{"parent":PROJECT_ID,"requests": [{"features": [{"type": "DOCUMENT_TEXT_DETECTION"}],"input_config": {"gcs_source": {"uri": INPUT_FILE},"mime_type": "application/pdf"},"output_config": {"gcs_destination": {"uri": OUTPUT_DESTINATION}}}]}' | curl --data-binary  @- -H "Content-Type: application/json" -H "Authorization: Bearer TOKEN" ENDPOINT/v1/files:asyncBatchAnnotate

grpcurl

  cat <<- EOF > request.json
  {
    "requests": [{
      "features": [{
        "type": "DOCUMENT_TEXT_DETECTION"
      }],
      "input_config": {
        "gcs_source": {
          "uri": "INPUT_FILE"
        },
        "mime_type": "application/pdf"
      },
      "output_config": {
        "gcs_destination": {
          "uri": "OUTPUT_DESTINATION"
        }
      }
    }],
    "parent": "PROJECT_ID"
  }
  EOF

  grpcurl -max-msg-sz 50000000 -d @ -plaintext ENDPOINT

  google.cloud.vision.v1.ImageAnnotator.AsyncBatchAnnotateFiles < request.json

Python

The vc.async_batch_annotate_files() function returns a Google API Core operation object. This object contains a long-running operation (LRO), which can be accessed by calling operation.operation. The operation name can be obtained from the LRO, and the user can use the name to query the status of the LRO. The operation.result() waits until the LRO is complete and then returns the result.

  def vision_func_async(creds):
    vc = vision_client(creds)
    features = [{"type_": vision.Feature.Type.DOCUMENT_TEXT_DETECTION}]
    input_config = {"gcs_source":{"uri":INPUT_FILE},"mime_type": "application/pdf"}
    output_config = {"gcs_destination": {"uri": OUTPUT_DESTINATION}}
    req = {"input_config": input_config, "output_config": output_config, "features":features}
    reqs = {"requests":[req],"parent":PROJECT_ID}
    operation = vc.async_batch_annotate_files(request=reqs)
    lro = operation.operation
    resp = operation.result()

Validate the jobs and check the status

The OPERATION_NAME returned by the AsyncBatchAnnotateFiles function is required to check the status of the operation.

Get operation

The get method returns the latest state of a long-running operation. Use this method to poll the operation result generated by the OCR service. To call the get method, specify your OPERATION_NAME and the ENDPOINT.

curl

curl -X GET "http://ENDPOINT/v1/OPERATION_NAME"

grpcurl

grpcurl -plaintext -d '{"name": OPERATION_NAME}' ENDPOINT google.longrunning.Operations/get

List Operation

The list method returns a list of the operations that match a specified filter in the request. The method can return operations from a specific project. To call the list method, specify your PROJECT_ID and the ENDPOINT.

curl

curl -X GET "http://ENDPOINT/v1/PROJECT_ID?page_size=10"

grpcurl

grpcurl -plaintext -d '{"name": PROJECT_ID, "page_size": 10}' ENDPOINT google.longrunning.Operations/list

Delete the bucket

For more information, see Delete objects in storage buckets.