Instructions on project set up

This is a step-by-step guide on how to use the Cloud Storage-Document AI Warehouse Connector.

Prerequisites

  1. Ask to join gcw-earlyaccess Google Group, to get access to the source code repository. The product team should approve your request within 24 hours. If urgent, reach out to the product team to approve.

  2. A valid project: Follow create a project steps to create a project if you don't have an existing one.

  3. Billing: Ensure your project has the necessary billing.

  4. To provision the project: Follow provision through Google Cloud console to provision the project. Follow the guide all the way through, including the access token steps, so that you can use any curl commands requiring an authorization token. For example, you can run the following command in shell to get an authorization token.

    AUTH_TOKEN=$(gcloud auth print-identity-token)
    
  5. A valid service account: A service account in the format of dw-ui-service-account@<project_id>.iam.gserviceaccount.com should be created in the provisioning process. If not, follow managing service accounts to create a service account.

  6. A Cloud Storage bucket: Buckets are the basic containers that hold your data. Everything that you store in Cloud Storage must be contained in a bucket. Follow create buckets to create a Cloud Storage bucket. If you need to use Document AI parser, you need a separate bucket for storing Document AI parsing outputs.

  7. Document schema: A document schema used to define document structure. Please follow create a document schema to create a document schema if one does not exist.

    A property named DocaiEntities is required to store Document AI parsed entities as a map property. If you are using an existing schema, you need to update the schema with this new property. Note that this map property is not searchable or filterable.

    If you need to convert entities to searchable properties, you need to know the Document AI processor schema and add the entity types of the schema to the Document Warehouse property schema. If the entity types are not in the Document Warehouse property schema (or the names mismatch), all of the entities will not be searchable or filterable. Currently the nested entities are not supported.

    For example, the Document AI processor schema contains "line_item", "line_item/amount", "line_item/description", "receiver_name", "receiver_address", "total_amount", then the schema should be as follows:

    {
        "property_definitions": [
            {
                "name": "DocaiEntities",
                "display_name": "DocaiEntities",
                "is_repeatable": false,
                "is_filterable": true,
                "is_searchable": true,
                "is_metadata": false,
                "is_required": false,
                "map_type_options": {}
            }, {
                "name": "line_item",
                "display_name": "line_item",
                "is_searchable": true,
                "is_filterable": true,
                "text_type_options": {}
            }, {
                "name": "receiver_name",
                "display_name": "receiver_name",
                "is_searchable": true,
                "is_filterable": true,
                "text_type_options": {}
            }, {
                "name": "receiver_address",
                "display_name": "receiver_address",
                "is_searchable": true,
                "is_filterable": true,
                "text_type_options": {}
            }, {
                "name": "total_amount",
                "display_name": "total_amount",
                "is_searchable": true,
                "is_filterable": true,
                "float_type_options": {}
            }
        ]
    }
    

    The full list of fields/entities that can be detected by each Document AI parser can be found at Fields detected.

  8. Project-level ACL: To invoke Cloud Storage connector, the user needs to have project-level document creation permission. Please refer to the guide for fetchACL/setACL or the preview Admin UI guide for how to set it up.

  9. Document AI processors and a staging bucket: To use Document AI processors, please follow the guide to create processors in Google Cloud console. We also need a Cloud Storage bucket to store parsed results. It is recommended to use a separate bucket from the raw file bucket to store the staging outputs.

  10. UI Configuration: If you need to use our preview UI, please follow the admin UI guide for setting up the project.

Initialize Cloud Storage - Document AI Warehouse connector using Cloud Shell

  1. Go to Google Cloud console.
  2. Be sure to choose the project you want to set up Cloud Storage- Document AI Warehouse connector for: set-up1
  3. Click Activate Cloud Shell at the top of the Google Cloud console. set-up2

    A Cloud Shell session opens inside a new frame at the bottom of the Google Cloud console and displays a command-line prompt. set-up3

  4. Check if the project is the one you want to use. If not, please run:

        gcloud config set project <project_id>
    

    set-up4

  5. Clone Cloud Storage-Document AI Warehouse connector source repository:

        gcloud source repos clone gcs-connector --project=docai-warehouse-gcs-connector
    
  6. Navigate to the path where the bash script file is available. (likely to be via cd gcs-connector) set-up5

  7. Run the initialization script using the following command:

        bash initialize.sh -s <SERVICE_ACCOUNT_EMAIL>
    

    set-up6

  8. The script does the following:

    1. Grant service account roles. See roles required for service account for roles you need.
    2. Enable required APIs. See APIs required.
    3. Create 2 Cloud Google Tasks queues.
    4. Deploy 2 Cloud Workflows.
    5. Deploy 3 Cloud Functions.
  9. If there is no error, the script runs through for the first four steps without the need to interact with the terminal. However, for deploying to Cloud Functions, there is a prompt:

    Allow unauthenticated invocations of new function [<cloud function name>]? (y/N)?
    

    Enter directly (N) for each prompt (there are three prompts in total).

Verify the setup

  1. Navigate to the Workflows UI page and the workflows deployed would be shown as in the below screenshot. set-up7
  2. Navigate to the Cloud Functions UI page to validate that your functions have been deployed. set-up8
  3. Navigate to the Cloud Tasks UI page There should be two queues that are created. set-up9

Update to the latest version

We continue releasing new versions of the connector in the Document AI Warehouse- Cloud Storage-connector repo. To update your project with the latest version, please:

  1. Navigate to the Cloud Shell location where you cloned the repo.

  2. Pull the latest codes via:

    git pull origin master
    
  3. At the top level within the repo, run deploy.sh command:

    bash deploy.sh -s <SERVICE_ACCOUNT_EMAIL>
    

After the script finishes, please navigate to Workflows UI and Cloud Functions UI to see if the last updated or deployed time has been updated.

Next step

Proceed to Run Cloud Storage connector.