Manage document schemas

This document describes how to manage the document schemas in Document AI Warehouse, including create, fetch, list, update, and delete operations.

What are document schemas

Each document is of a certain document type and is specified by a schema.

A document schema defines the structure for a document type (for example, Invoice or Paystub) in Document AI Warehouse, where admins can specify Properties of different data types (Text | Numeric | Date | Enumeration).

Properties are used to represent the extracted data, classification tags or other business tags appended to documents by AI or human users - for example, Invoice_Amount (numeric), Due_Date (date), or Supplier_Name (text).

  1. Property Attributes: Each property can be declared as

    1. Filterable - can be used to filter search results

    2. Searchable - indexed so it can be found in search queries

    3. Required/Optional - required is used to ensure the property exists in the document. We recommend keeping most properties as optional, unless the property is required.

  2. Extensible Schema: in some cases, end users with Edit access need to add / delete new schema properties to documents. This is supported by a "MAP property", which is a list of key-value pairs.

    1. Each key-value pair in a MAP property can be a data-type of (Text | Numeric | Date | Enumeration).

    2. For example, Invoice may contain a Map Property "Invoice_Entities" with the following key value pairs:

      • Invoice_Amount (numeric) 1000

      • Due_Date (date) 12/24/2021

      • Supplier_Name (text) ABC Corp

    3. Immutability of Schema: Note that Schema or Schema Properties can be added but currently cannot be edited or deleted, so define schema carefully.

Prerequisite

You need to generate an access token to proceed to the following examples. See Quickstart.

Create a schema

Create a document schema.

REST

curl --location --request POST 'https://contentwarehouse.googleapis.com/v1/projects/PROJECT_NUMBER/locations/LOCATION/documentSchemas' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer {AUTH_TOKEN}' \
--data '{
    "document_schema": {
        "display_name":"Test Doc Schema",
        "property_definitions": [
            {
                "name": "plaintiff",
                "display_name": "Plaintiff",
                "is_searchable": true,
                "is_repeatable": true,
                "text_type_options": {}
            },
        ]
    }
}'

Get a schema

Get details of a document schema.

REST

curl --request GET --url https://contentwarehouse.googleapis.com/v1/projects/PROJECT_NUMBER/locations/LOCATION/documentSchemas/{document_schema_id}
--header 'Authorization: Bearer {AUTH_TOKEN}'
--header 'Content-Type: application/json; charset=UTF-8'

List schemas

List document schemas.

REST

curl --request GET --url https://contentwarehouse.googleapis.com/v1/projects/PROJECT_NUMBER/locations/LOCATION/documentSchemas
--header 'Authorization: Bearer {AUTH_TOKEN}'
--header 'Content-Type: application/json; charset=UTF-8'

Delete a schema

Delete a document schema.

REST

curl --request DELETE --url https://contentwarehouse.googleapis.com/v1/projects/PROJECT_NUMBER/locations/LOCATION/documentSchemas/{document_schema_id}
--header 'Authorization: Bearer {AUTH_TOKEN}'
--header 'Content-Type: application/json; charset=UTF-8'

Update a schema

Update a document schema. Currently the update logic only supports adding new property definitions. The new document schema should include all property definitions present in the existing schema.

  • Supported:

    • For existing properties, users can change the following metadata settings: is_repeatable, is_metadata, is_required.
    • For existing ENUM properties, users can add new ENUM possible values or delete existing ENUM possible values. They can update the EnumTypeOptions.validation_check_disabled flag to disable the validation check. The validation check is used to make sure the ENUM values specified in the documents are in the range of possible ENUM values defined in the property definition when calling the CreateDocument API.
    • Adding new property definitions is supported.
  • Not supported:

    • For existing schema, updates to display_name and document_is_folder are not allowed.
    • For existing properties, updates to name, display_name and value_type_options are not allowed.

REST

curl --request PATCH --url https://contentwarehouse.googleapis.com/v1/projects/PROJECT_NUMBER/locations/LOCATION/documentSchemas/{document_schema_id}
--header 'Authorization: Bearer {AUTH_TOKEN}'
--header 'Content-Type: application/json; charset=UTF-8'
--data '{
    "document_schema": {
        "display_name": "Test Doc Schema",
        "property_definitions": [
            {
                "name": "plaintiff",
                "display_name": "Plaintiff",
                "is_searchable": true,
                "is_repeatable": true,
                "text_type_options": {}
            },
        ]
    }
}'

Python

For more information, see the Document AI Warehouse Python API reference documentation.

To authenticate to Document AI Warehouse, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.


from google.cloud import contentwarehouse

# TODO(developer): Uncomment these variables before running the sample.
# project_number = "YOUR_PROJECT_NUMBER"
# location = "us" # Format is 'us' or 'eu'
# document_schema_id = "YOUR_SCHEMA_ID"


def update_document_schema(
    project_number: str, location: str, document_schema_id: str
) -> None:
    # Create a Schema Service client
    document_schema_client = contentwarehouse.DocumentSchemaServiceClient()

    # The full resource name of the location, e.g.:
    # projects/{project_number}/locations/{location}/documentSchemas/{document_schema_id}
    document_schema_path = document_schema_client.document_schema_path(
        project=project_number,
        location=location,
        document_schema=document_schema_id,
    )

    # Define Schema Property of Text Type with updated values
    updated_property_definition = contentwarehouse.PropertyDefinition(
        name="stock_symbol",  # Must be unique within a document schema (case insensitive)
        display_name="Searchable text",
        is_searchable=True,
        is_repeatable=False,
        is_required=True,
        text_type_options=contentwarehouse.TextTypeOptions(),
    )

    # Define Update Document Schema Request
    update_document_schema_request = contentwarehouse.UpdateDocumentSchemaRequest(
        name=document_schema_path,
        document_schema=contentwarehouse.DocumentSchema(
            display_name="My Test Schema",
            property_definitions=[updated_property_definition],
        ),
    )

    # Update Document schema
    updated_document_schema = document_schema_client.update_document_schema(
        request=update_document_schema_request
    )

    # Read the output
    print(f"Updated Document Schema: {updated_document_schema}")

Next steps