Export metadata to BigQuery

Introduction to BigQuery connector

The BigQuery connector helps you to export the document's metadata (including properties) stored in Document AI Warehouse into your BigQuery table. With your data in BigQuery, you can run analysis, create reports and dashboards to help you make business decisions.

To enable the BigQuery connector, you need to set up a BigQuery table with necessary permissions granted, and configure the asynchronous tasks through the API. The BigQuery connector exports the data from the Document AI Warehouse to your BigQuery tables.

Before you start

Set up Document AI Warehouse, and ingest your documents. For more information, follow the quick start.

You must ensure that the project that hosts your BigQuery table is the same project used by Document AI Warehouse to store your documents. In other words, the data must always be exported from the Document AI Warehouse to BigQuery table in the same project.

In the project, you must have the Owner (roles/owner) role, or you must have the resourcemanager.projects.getIamPolicy and the resourcemanager.projects.setIamPolicy

permissions.

Set up BigQuery Access

Bind the service account doc-ai-warehouse-dw-bq-connector@system.gserviceaccount.com

to the BigQuery Admin role:

gcloud projects add-iam-policy-binding <var>PROJECT_ID</var> --member serviceAccount:doc-ai-warehouse-dw-bq-connector@system.gserviceaccount.com --role=roles/bigquery.admin

Set up BigQuery dataset and table

Set up a BigQuery dataset and table for Document AI Warehouse to export the data. If you don't have a BigQuery dataset, follow creating datasets to create one.

Create a BigQuery table in your BigQuery dataset. Following the BigQuery instructions, you create tables with the DDL sample statements:

CREATE TABLE `PROJECT_ID.DATASET_NAME.TABLE_NAME`
(
  project_number INT64,
  location STRING,
  mod_type STRING,
  document_id STRING,
  document_json JSON,
  create_time TIMESTAMP,
  creator STRING,
  update_time TIMESTAMP,
  updater STRING,
  document_state STRING,
  export_time TIMESTAMP
)
PARTITION BY TIMESTAMP_TRUNC(export_time, HOUR)
OPTIONS(
  partition_expiration_days=150,
  description="table partitioned by export_time on hour with expiry"
);

The DDL creates a new BigQuery table for you. The table is hourly time-partitioned, and the partition is deleted in 150 days.

Configure BigQuery connector

Create data export configuration

The following instructions create a new data export job, which sets up the asynchronous jobs to export data. We recommend starting with an empty table for each new data export job. Refer to the API reference for the details of the configuration.

You have the following running options. They can be configured using the FREQUENCY. Refer to the API reference.

  • ADHOC: The job runs only once. All data is exported to your BigQuery table.
  • DAILY: The job runs daily. For the first run, all of the data are exported to your BigQuery table. Once the initial export is complete, only the previous day's data changes (or the delta from last successful synchronization) are exported to your BigQuery table.
  • HOURLY: The job runs hourly. For the first run, all of the data are exported to your BigQuery table. Once the initial export is complete, only previous hour's data changes (or the delta from last successful synchronization) are exported to your BigQuery table.

Before using any of the request data, make the following replacements:

  • PROJECT_NUMBER: your Google Cloud project number
  • LOCATION: your Document AI Warehouse location (such as `us`)
  • DATASET_LOCATION: your dataset location
  • DATASET_NAME: your dataset name
  • TABLE_NAME: your table name
  • FREQUENCY: one of ADHOC, DAILY or HOURLY.

Request JSON body:

{
  "projectNumber": PROJECT_NUMBER,
  "location": "DATASET_LOCATION",
  "dataset": "DATASET_NAME",
  "table": "TABLE_NAME",
  "frequency": "FREQUENCY",
  "state": "ACTIVE"
}

To send your request, expand one of these options:

You should receive a JSON response similar to the following:

Job execution

Once you have successfully created a job, the job runs based on your configuration. Note that the jobs run asynchronously because it takes time to execute. Depending on the amount of data to be exported, the first run can take time to finish. For daily job, allow 24 hours for the results to show in the BigQuery table.

Delete data export configuration

The following command deletes (by archiving) a job that you created.

Before using any of the request data, make the following replacements:

  • PROJECT_NUMBER: your Google Cloud project number
  • LOCATION: your Document AI Warehouse location (such as `us`)
  • JOB_ID: your job ID, in the response when you created it

Request JSON body:

{}

To send your request, expand one of these options:

You should receive a JSON response similar to the following:

After this, your export job is deleted (archived) and Document AI Warehouse doesn't run it any more.

Explore data ingested into BigQuery

To extract document metadata and properties into distinct table fields in BigQuery for your analysis needs, you can use the sample DDL queries below. These extracted fields can also be used in Looker Studio or any BI Dashboard tool to visualize relationships within the data.

Extract key fields from document_json

This query selects relevant fields from the data export including key fields from document metadata (stored in the document_json field).

DROP VIEW IF EXISTS
 `DATASET_NAME.VIEW_NAME_1`;
CREATE VIEW
 `DATASET_NAME.VIEW_NAME_1` AS
SELECT
 project_number,
 document_id,
 mod_type,
 create_time,
 update_time,
 location,
 creator,
 updater,
 document_state,
 SPLIT(JSON_EXTRACT_SCALAR(document_json,'$.documentSchemaName' ), '/')[SAFE_OFFSET(ARRAY_LENGTH(SPLIT(JSON_EXTRACT_SCALAR(document_json,'$.documentSchemaName' ), '/')) - 1)] AS document_schema_name,
 JSON_EXTRACT_SCALAR(document_json,'$.name') AS document_name,
 JSON_EXTRACT_SCALAR(document_json,'$.rawDocumentFileType')
AS raw_document_file_type,
 JSON_EXTRACT(document_json,'$.properties') AS properties
FROM
 `DATASET_NAME.SYSTEM_METADATA_AND_DOC_PROPERTIES_TABLE_EXPORT_NAME`;

Unnesting properties from document_json

This query unnests properties from the document metadata (document_json) to create key value pairs (property name, value). These key value pairs will be transformed to individual table fields in the next query to enable property-level data exploration and dashboard visualization.

DROP VIEW IF EXISTS
 `DATASET_NAME.VIEW_NAME_2`;
CREATE VIEW
 `DATASET_NAME.VIEW_NAME_2` AS
SELECT

    *   EXCEPT(key_value_pair,
   properties,raw_document_file_type)
FROM (
SELECT
 *,
 REPLACE(JSON_VALUE(key_value_pair,'$.name'),'/','-') property_name,
 -- Note: values are either text OR float values
 CASE
   WHEN JSON_VALUE(key_value_pair,'$.textValues.values[0]') IS NULL THEN JSON_VALUE(key_value_pair,'$.floatValues.values[0]')
 ELSE
 JSON_VALUE(key_value_pair,'$.textValues.values[0]')
END
 AS value,
 CASE
   WHEN raw_document_file_type IS NULL THEN "RAW_DOCUMENT_FILE_TYPE_UNSPECIFIED"
 ELSE
 raw_document_file_type
END
 AS document_file_type
FROM
 `DATASET_NAME.VIEW_NAME_1`,
 UNNEST(JSON_EXTRACT_ARRAY(properties)) AS key_value_pair);

Pivoting properties from document_json to create table fields in BigQuery

The following procedures create a table with all of the document properties transformed as individual table fields by pivoting the properties and associated values. The results of this table can be leveraged to derive further insights through subsequent queries in Looker Studio and in other BI visualization tools.

DECLARE
 property_field STRING;
-- Extracting distinct property_names from the previous view and storing it in property_field, declared above

EXECUTE IMMEDIATE
 """SELECT string_agg(CONCAT("'",property_name,"'")) from (select distinct property_name from DATASET.VIEW_NAME_2)""" INTO property_field;

DROP TABLE IF EXISTS `DATASET_NAME.ANALYTICS_TABLE_NAME`;
-- Creating pivot table with the aid of extracted distinct property_names
-- Casting numerical values to float/int
-- Pivot on property_name and value (ie. create a new column for each of the property_name, substitute the value)

EXECUTE IMMEDIATE
 FORMAT ("""
CREATE TABLE `DATASET_NAME.ANALYTICS_TABLE_NAME` AS
SELECT * FROM `DATASET_NAME.VIEW_NAME_2`
PIVOT(min(value) FOR property_name IN (%s))""", property_field);

Data Cleaning & Transformation Procedures (Business Case Specific)

Depending on the data ingested into BigQuery, you might need to perform additional data cleaning and transforming procedures to enable further analysis. Such procedures vary from case to case (dataset to dataset) and should be performed as appropriate.

Some examples of data cleaning procedures might include (not limited to):

  • Unifying date formats.
  • Consolidating property values.
  • Casting data types to strings, floats, and integers, for example.

Visualizing the data in Looker Studio

Once your data has been extracted, cleansed, and transformed in BigQuery, your final dataset can be exported to Looker Studio for visual analysis.

Looker Dashboards

The sample dashboards outlined showcase possible visualizations that can be created from your dataset. In this scenario, the sample data export from Document AI Warehouse consists of W2s & Invoices (two schemas).

Public Facing Looker Dashboards

Sample view: Document AI Warehouse analytics overview

The following dashboard provides you with high-level insight into the variety of documents ingested into your Document AI Warehouse instance.

looker dashboard one

You are able to view document-level details including:

  • Total number of documents.
  • Total number of document schemas.
  • Record count by document schema.
  • Document file type (such as PDF, text, unspecified type]).

You are further able to use properties extracted from the document metadata (document_json) to build key breakdowns for the Invoices and W2s ingested into BigQuery.

Sample view: business-specific insights dashboard (invoices)

The following dashboard provides the user with a detailed look into a single document schema (invoices) to enable insights on all of the invoices ingested into Document AI Warehouse.

looker dashboard two

You are able to view schema-specific details on invoices, for example:

  • Top suppliers by invoice amounts.
  • Suppliers by location.
  • Invoice dates and their corresponding due dates.
  • Trends in invoices month over month by amount and record count.

Connecting data source to dashboards

To use these dashboard samples as a starting point to visualize your dataset, you can connect your data source from BigQuery.

Before connecting the sample dashboards to your BigQuery data source, ensure you are signed in to your account associated with your Google Cloud environment.

looker dashboard three

Select the highlighted button to reveal dropdown options.

looker dashboard four

Select Make a Copy.

looker dashboard five

Under the New Data Source subsection, select Create data source.

looker dashboard six

Select BigQuery.

looker dashboard seven

Select the project where your dataset is stored, then follow the prompts to select your dataset and table. Click Connect.

looker dashboard eight

Click Add to Report.

looker dashboard nine

Click Copy Report.

looker dashboard ten

If you choose to edit and update the widgets on your dashboard, you can edit it, because you have a copy of the dashboard with the extracted properties.