Stay organized with collections Save and categorize content based on your preferences.

Create object tables

This document describes how to access unstructured data in BigQuery by creating an object table.

To create an object table, you must complete the following tasks:

  1. Create a connection to read object information from Cloud Storage.
  2. Grant permission to read Cloud Storage information to the service account associated with the connection.
  3. Create the object table and associate it with the connection by using the CREATE EXTERNAL TABLE statement.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.

  4. Enable the BigQuery and BigQuery Connection API APIs.

    Enable the APIs

  5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  6. Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.

  7. Enable the BigQuery and BigQuery Connection API APIs.

    Enable the APIs

  8. Ensure that your BigQuery administrator has created a connection and set up access to Cloud Storage.

Required roles

To work with object tables, your users need the following IAM permissions based on their role in your organization. For more information on user roles, see Security model. For more information about giving permissions, see Viewing the grantable roles on resources.

  • Data lake administrator

    To get the permissions that you need to connect to Cloud Storage, ask your administrator to grant you the BigQuery Connection Admin (roles/bigquery.connectionAdmin) role on the project.

    To get the permissions that you need to create and manage Cloud Storage buckets, ask your administrator to grant you the Storage Admin (roles/storage.admin) role on the project.

    This predefined role contains the permissions required to connect to Cloud Storage and create and manage Cloud Storage buckets. To see the exact permissions that are required, expand the Required permissions section:

    Required permissions

    • bigquery.connections.create
    • bigquery.connections.get
    • bigquery.connections.list
    • bigquery.connections.update
    • bigquery.connections.use
    • bigquery.connections.delete
    • storage.bucket.*
    • storage.object.*

  • Data warehouse administrator

    To get the permissions that you need to create object tables, ask your administrator to grant you the following roles on the project:

    • BigQuery Data Editor (roles/bigquery.dataEditor) role.
    • BigQuery Connection Admin (roles/bigquery.connectionAdmin) role.

    This predefined role contains the permissions required to create object tables. To see the exact permissions that are required, expand the Required permissions section:

    Required permissions

    • bigquery.tables.create
    • bigquery.tables.update
    • bigquery.connections.delegate

  • Data analyst

    To get the permissions that you need to query object tables, ask your administrator to grant you the following roles on the project:

    • BigQuery Data Viewer (roles/bigquery.dataViewer) role
    • BigQuery Connection User (roles/bigquery.connectionUser) role

    This predefined role contains the permissions required to query object tables. To see the exact permissions that are required, expand the Required permissions section:

    Required permissions

    • bigquery.jobs.create
    • bigquery.tables.get
    • bigquery.tables.getData
    • bigquery.readsessions.create

    You might also be able to get these permissions with custom roles or other predefined roles.

Create object tables

Before you create an object table, you must have an existing dataset to contain it. For more information, see Creating datasets.

To create an object table:

SQL

Use the CREATE EXTERNAL TABLE statement.

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    CREATE EXTERNAL TABLE `PROJECT_ID.DATASET_ID.TABLE_NAME`
    WITH CONNECTION `REGION.CONNECTION_NAME`
    OPTIONS(
      object_metadata = 'SIMPLE',
      uris = ['BUCKET_PATH'[,...]],
      max_staleness = STALENESS_INTERVAL,
      metadata_cache_mode = 'CACHE_MODE');
    

    Replace the following:

    • PROJECT_ID: your project ID.
    • DATASET_ID: the ID of the dataset to contain the object table.
    • TABLE_NAME: the name of the object table.
    • REGION: the region or multi-region that contains the connection.
    • CONNECTION_NAME: the name of the cloud resource connection to use with this object table. The connection determines which service account is used to read data from Cloud Storage.
    • BUCKET_PATH: the path to the Cloud Storage bucket that contains the objects represented by the object table, in the format ['gs://bucket_name/[folder_name/]*'].

      You can use one asterisk (*) wildcard character in each path to limit the objects included in the object table. For example, if the bucket contains several types of unstructured data, you could create the object table over only PDF objects by specifying ['gs://bucket_name/*.pdf']. For more information, see Wildcard support for Cloud Storage URIs.

      You can specify multiple buckets for the uris option by providing multiple paths, for example ['gs://mybucket1/*', 'gs://mybucket2/folder5/*'].

      For more information about using Cloud Storage URIs in BigQuery, see Cloud Storage resource path.

    • STALENESS_INTERVAL: specifies whether cached metadata is used by operations against the object table, and how fresh the cached metadata must be in order for the operation to use it. For more information on metadata caching considerations, see Metadata caching for performance.

      To disable metadata caching, specify 0. This is the default.

      To enable metadata caching, specify an interval literal value between 30 minutes and 7 days. For example, specify INTERVAL 4 HOUR for a 4 hour staleness interval. With this value, operations against the table use cached metadata if it has been refreshed within the past 4 hours. If the cached metadata is older than that, the operation retrieves metadata from Cloud Storage instead.

    • CACHE_MODE: specifies whether the metadata cache is refreshed automatically or manually. For more information on metadata caching considerations, see Metadata caching for performance.

      Set to AUTOMATIC for the metadata cache to be refreshed at a system-defined interval, usually somewhere between 30 and 60 minutes.

      Set to MANUAL if you want to refresh the metadata cache on a schedule you determine. In this case, you can call the BQ.REFRESH_EXTERNAL_METADATA_CACHE system procedure to refresh the cache.

      You must set CACHE_MODE if STALENESS_INTERVAL is set to a value greater than 0.

  3. Click Run.

For more information about how to run queries, see Running interactive queries.

Examples

The following example creates an object table with a metadata cache staleness interval of 1 day:

CREATE EXTERNAL TABLE `my_dataset.object_table`
WITH CONNECTION `us.my-connection`
OPTIONS(
  object_metadata = 'SIMPLE',
  uris = ['gs://mybucket/*'],
  max_staleness = INTERVAL 1 DAY,
  metadata_cache_mode = 'AUTOMATIC'
);

The following example creates an object table over the objects in three Cloud Storage buckets:

CREATE EXTERNAL TABLE `my_dataset.object_table`
WITH CONNECTION `us.my-connection`
OPTIONS(
  object_metadata = 'SIMPLE',
  uris = ['gs://bucket1/*','gs://bucket2/folder1/*','gs://bucket3/*']
);

The following example creates an object table over just the PDF objects in a Cloud Storage bucket:

CREATE EXTERNAL TABLE `my_dataset.object_table`
WITH CONNECTION `us.my-connection`
OPTIONS(
  object_metadata = 'SIMPLE',
  uris = ['gs://bucket1/*.pdf']
);

bq

Use the bq mk command.

bq mk --table \
--external_table_definition=BUCKET_PATH@REGION.CONNECTION_NAME \
--object_metadata=SIMPLE \
--max_staleness=STALENESS_INTERVAL \
--metadata_cache_mode=CACHE_MODE \
PROJECT_ID:DATASET_ID.TABLE_NAME

Replace the following:

  • PROJECT_ID: your project ID.
  • DATASET_ID: the ID of the dataset to contain the object table.
  • TABLE_NAME: the name of the object table.
  • REGION: the region or multi-region that contains the connection.
  • CONNECTION_NAME: the name of the cloud resource connection to use with this external table. The connection determines which service account is used to read data from Cloud Storage.
  • BUCKET_PATH: the path to the Cloud Storage bucket that contains the objects represented by the object table, in the format gs://bucket_name/[folder_name/]*.

    You can use one asterisk (*) wildcard character in each path to limit the objects included in the object table. For example, if the bucket contains several types of unstructured data, you could create the object table over only PDF objects by specifying gs://bucket_name/*.pdf. For more information, see Wildcard support for Cloud Storage URIs.

    You can specify multiple buckets for the uris option by providing multiple paths, for example gs://mybucket1/*,gs://mybucket2/folder5/*.

    For more information about using Cloud Storage URIs in BigQuery, see Cloud Storage resource path.

  • STALENESS_INTERVAL: specifies whether cached metadata is used by operations against the object table, and how fresh the cached metadata must be in order for the operation to use it. For more information on metadata caching considerations, see Metadata caching for performance.

    To disable metadata caching, specify 0. This is the default.

    To enable metadata caching, specify an interval value between 30 minutes and 7 days, using the Y-M D H:M:S format described in the INTERVAL data type documentation. For example, specify 0-0 0 4:0:0 for a 4 hour staleness interval. With this value, operations against the table use cached metadata if it has been refreshed within the past 4 hours. If the cached metadata is older than that, the operation retrieves metadata from Cloud Storage instead.

  • CACHE_MODE: specifies whether the metadata cache is refreshed automatically or manually. For more information on metadata caching considerations, see Metadata caching for performance.

    Set to AUTOMATIC for the metadata cache to be refreshed at a system-defined interval, usually somewhere between 30 and 60 minutes.

    Set to MANUAL if you want to refresh the metadata cache on a schedule you determine. In this case, you can call the BQ.REFRESH_EXTERNAL_METADATA_CACHE system procedure to refresh the cache.

    You must set CACHE_MODE if STALENESS_INTERVAL is set to a value greater than 0.

Examples

The following example creates an object table with a metadata cache staleness interval of 1 day:

bq mk --table \
--external_table_definition=gs://mybucket/*@us.my-connection \
--object_metadata=SIMPLE \
--max_staleness=0-0 1 0:0:0 \
--metadata_cache_mode=AUTOMATIC \
my_dataset.object_table

The following example creates an object table over the objects in three Cloud Storage buckets:

bq mk --table \
--external_table_definition=gs://bucket1/*,gs://bucket2/folder1/*,gs://bucket3/*@us.my-connection \
--object_metadata=SIMPLE \
my_dataset.object_table

The following example creates an object table over just the PDF objects in a Cloud Storage bucket:

bq mk --table \
--external_table_definition=gs://bucket1/*.pdf@us.my-connection \
--object_metadata=SIMPLE \
my_dataset.object_table

API

Call the tables.insert method. Include an ExternalDataConfiguration object with the objectMetadata field set to SIMPLE in the Table resource that you pass in.

The following example shows how to call this method by using curl:

ACCESS_TOKEN=$(gcloud auth print-access-token) curl \
-H "Authorization: Bearer ${ACCESS_TOKEN}" \
-H "Content-Type: application/json" \
-X POST \
-d '{"tableReference": {"projectId": "my_project", "datasetId": "my_dataset", "tableId": "object_table_name"}, "externalDataConfiguration": {"objectMetadata": "SIMPLE", "sourceUris": ["gs://mybucket/*"]}}' \
https://www.googleapis.com/bigquery/v2/projects/my_project/datasets/my_dataset/tables

Query object tables

You can query an object table like any other BigQuery, for example:

SELECT *
FROM mydataset.myobjecttable;

Querying an object table returns metadata for the underlying objects. For more information, see Object table schema.

What's next