Process-with-Document-AI pipeline

The Process-with-Document AI pipeline allows users to process existing documents with a Document AI processor and update the document properties with the newly extracted entities.

Prerequisites

Before you begin, you need the following:

  1. A Document AI processor ready under the same Google Cloud project.

    • If you don't have a processor, follow the steps to create one. You can choose to create any type as long as the processor type matches the document type.
  2. Dedicated Cloud Storage folders for storing exported documents and processed documents.

    • Make sure the folders are empty before you start the pipeline.
  3. A schema with mappings between Document AI entities and Document AI Warehouse properties.

    • The newly extracted entities might not be correctly converted to Document AI Warehouse entities without such a mapping.

    • To add mappings to the schema, follow set schemas with mapping.

Run the pipeline

REST

curl --location --request POST 'https://contentwarehouse.googleapis.com/v1/projects/PROJECT_NUMBER/locations/LOCATION:runPipeline' \
--header 'Content-Type: application/json' \
--header "Authorization: Bearer ${AUTH_TOKEN}" \
--data '{
    "name": "projects/PROJECT_NUMBER/locations/LOCATION",
    "process_with_doc_ai_pipeline": {
        "documents": [
          "projects/PROJECT_NUMBER/locations/LOCATION/documents/DOCUMENT"
        ],
        "export_folder_path": "gs://EXPORT_FOLDER",
        "processor_info": {
          "processor_name": "projects/PROJECT_NUMBER/locations/LOCATION/processors/PROCESSOR"
        },
        "processor_results_folder_path": "gs://PROCESS_FOLDER"
    },
    "request_metadata": {
        "user_info": {
            "id": "user:USER EMAIL ADDRESS"
        }
    }
}'

The documents list is the resource names of the documents to be processed. The Cloud Storage folder path export_folder_path is used to store the exported documents before being sent to the processor. For more information about the request body fields, refer to the API documentation.

This command returns a resource name for a long-running operation. With this resource name, you can track the progress of the pipeline by following the next step.

Get long-running operation result

REST

curl --location --request GET 'https://contentwarehouse.googleapis.com/v1/projects/PROJECT_NUMBER/locations/LOCATION/operations/OPERATION' \
--header "Authorization: Bearer ${AUTH_TOKEN}"

Next steps

Go to Document AI Warehouse UI or call the document:get API to check if documents are successfully updated.