Detect text in images offline

Asynchronous image detection and annotation can be performed on a list of generic files, such as PDF files. The files might contain multiple pages and multiple images per page. The files to be processed must be stored in an Object storage bucket, and the detected text is written to an Object storage bucket in JSON format.

You can retrieve the progress and results of an asynchronous batch annotation request by using the google.longrunning.Operations interface. The Operation.metadata field contains OperationMetadata. The Operation.response field contains the AsyncBatchAnnotateFilesResponse(Results).

Prepare your environment

Before using an asynchronous OCR API to detect text offline, you must do the following:

  1. Create a project.
  2. Create a storage bucket under the project.
  3. Select the Standard class.
  4. Grant read and write permissions on the bucket to the service account (ai-ocr-system-sa) used by the OCR service.
  5. Enter the following sample code to create the project, storage bucket, role, and role binding:

    1. Create the project.

        apiVersion: resourcemanager.gdc.goog/v1
        kind: Project
        metadata:
          labels:
            atat.config.google.com/clin-number: <clin-number>
            atat.config.google.com/task-order-number: <task-order-number>
          name: ocr-async-project
          namespace: platform
      
    2. Create the storage bucket.

        apiVersion: object.gdc.goog/v1
        kind: Bucket
        metadata:
          name: ocr-async-bucket
          namespace: ocr-async-project
        spec:
          description: bucket for async ocr
          storageClass: Standard
          bucketPolicy:
            lockingPolicy:
              defaultObjectRetentionDays: 90
      
    3. Create the role.

        apiVersion: rbac.authorization.k8s.io/v1
        kind: Role
        metadata:
          name: ocr-async-reader-writer
          namespace: ocr-async-project
        rules:
          -
            apiGroups:
              - object.gdc.goog
            resources:
              - buckets
            verbs:
              - read-object
              - write-object
      
    4. Create the role binding.

        apiVersion: rbac.authorization.k8s.io/v1
        kind: RoleBinding
      
        metadata:
          name: ocr-async-reader-writer-rolebinding
          namespace: ocr-async-project
        roleRef:
          apiGroup: rbac.authorization.k8s.io
          kind: Role
          name: ocr-async-reader-writer
        subjects:
          -
            kind: ServiceAccount
            name: ai-ocr-system-sa
            namespace: ai-ocr-system
      

Upload files to the object storage bucket

In order for the OCR service to process the file, the files must be uploaded to the object storage bucket.

Follow these steps:

  1. To configure the gdcloud CLI storage, see Install and configure the storage CLI for projects.
  2. For the steps to upload objects to a storage bucket, see Upload and download storage objects in projects.

Trigger the AsyncBatchAnnotateFilesRequest request

AsyncBatchAnnotateFilesRequest initiates the offline processing and returns the ID of the long-running process that performs text detection on the file. The returned ID can be used to track the status of the offline processing. If there are too many ongoing operations, the offline processing might not start immediately.

Before sending a request, you must ensure that the OCR service account has read permission to your input bucket and write permission to your output bucket, even though the input and output buckets can be different and under different project namespaces. We recommend using the same input and output buckets to prevent errors in case you provide the wrong name, and the results are written to buckets that don't belong to you.

To call the AsyncBatchAnnotateFilesRequest, you must specify the following:

  • Input file: The file that you want to annotate.
  • Output destination: The location where you want to store the annotated results.
  • Project ID: The ID of the project that you want to use.
  • Endpoint: The endpoint that you want to use.

curl

    echo '{"parent":PROJECT_ID,"requests": [{"features": [{"type": "DOCUMENT_TEXT_DETECTION"}],"input_config": {"gcs_source": {"uri": INPUT_FILE},"mime_type": "application/pdf"},"output_config": {"gcs_destination": {"uri": OUTPUT_DESTINATION}}}]}' | curl -k --data-binary  @- -H "Content-Type: application/json" -H "Authorization: Bearer TOKEN" ENDPOINT/v1/files:asyncBatchAnnotate

grpcurl

  cat <<- EOF > request.json
  {
    "requests": [{
      "features": [{
        "type": "DOCUMENT_TEXT_DETECTION"
      }],
      "input_config": {
        "gcs_source": {
          "uri": "INPUT_FILE"
        },
        "mime_type": "application/pdf"
      },
      "output_config": {
        "gcs_destination": {
          "uri": "OUTPUT_DESTINATION"
        }
      }
    }],
    "parent": "PROJECT_ID"
  }
  EOF

  grpcurl -max-msg-sz 50000000 -d @ -plaintext ENDPOINT

  google.cloud.vision.v1.ImageAnnotator.AsyncBatchAnnotateFiles < request.json

Python

The vc.async_batch_annotate_files() function returns a Google API Core operation object. This object contains a long-running operation (LRO), which can be accessed by calling operation.operation. The operation name can be obtained from the LRO, and the user can use the name to query the status of the LRO. The operation.result() waits until the LRO is complete and then returns the result.

  def vision_func_async(creds):
    vc = vision_client(creds)
    features = [{"type_": vision.Feature.Type.DOCUMENT_TEXT_DETECTION}]
    input_config = {"gcs_source":{"uri":INPUT_FILE},"mime_type": "application/pdf"}
    output_config = {"gcs_destination": {"uri": OUTPUT_DESTINATION}}
    req = {"input_config": input_config, "output_config": output_config, "features":features}
    reqs = {"requests":[req],"parent":PROJECT_ID}
    operation = vc.async_batch_annotate_files(request=reqs)
    lro = operation.operation
    resp = operation.result()

Validate the jobs and check the status

The OPERATION_NAME returned by the AsyncBatchAnnotateFiles function is required to check the status of the operation.

Get operation

The get method returns the latest state of a long-running operation. Use this method to poll the operation result generated by the OCR service. To call the get method, specify your OPERATION_NAME and the ENDPOINT.

curl

curl -X GET "http://ENDPOINT/v1/OPERATION_NAME"

grpcurl

grpcurl -plaintext -d '{"name": OPERATION_NAME}' ENDPOINT google.longrunning.Operations/get

List Operation

The list method returns a list of the operations that match a specified filter in the request. The method can return operations under a specific project. To call the list method, specify your PROJECT_ID and the ENDPOINT.

curl

curl -X GET "http://ENDPOINT/v1/PROJECT_ID?page_size=10"

grpcurl

grpcurl -plaintext -d '{"name": PROJECT_ID, "page_size": 10}' ENDPOINT google.longrunning.Operations/list

Delete the bucket

For more information, see Delete objects in storage buckets.