Introduction to BigQuery Omni

With BigQuery Omni, you can run BigQuery analytics on data stored in Amazon Simple Storage Service (Amazon S3) or Azure Blob Storage using BigLake tables.

Many organizations store data in multiple public clouds. Often, this data ends up being siloed, because it's hard to get insights across all of the data. You want to be able to analyze the data with a multi-cloud data tool that is inexpensive, fast, and does not create additional overhead of decentralized data governance. By using BigQuery Omni, we reduce these frictions with a unified interface.

To run BigQuery analytics on your external data, you first need to connect to Amazon S3 or Blob Storage. If you want to query external data, you would need to create a BigLake table that references Amazon S3 or Blob Storage data.

You can also move data between clouds to combine data across clouds using cross-cloud transfer or query the data across clouds using cross-cloud joins. BigQuery Omni offers a cross-cloud analytics solution with the ability to analyze data where it is and the flexibility to replicate data when necessary. For more information, see Load data with cross-cloud transfer and Cross-cloud joins.

Architecture

BigQuery's architecture separates compute from storage, allowing BigQuery to scale out as needed to handle very large workloads. BigQuery Omni extends this architecture by running the BigQuery query engine in other clouds. As a result, you don't have to physically move data into BigQuery storage. Processing happens where that data already sits.

BigQuery Omni architecture

Query results can be returned to Google Cloud over a secure connection — for example, to be displayed in the Google Cloud console. Alternatively, you can write the results directly to Amazon S3 buckets or Blob Storage. In that case, there is no cross-cloud movement of the query results.

BigQuery Omni uses standard AWS IAM roles or Azure Active Directory principals to access the data in your subscription. You delegate read or write access to BigQuery Omni, and you can revoke access at any time.

Data flow when querying data

The following image describes how the data moves between Google Cloud and AWS or Azure for the following queries:

  • SELECT statement
  • CREATE EXTERNAL TABLE statement
Data movement between Google Cloud and AWS or Azure for queries.
Figure 1: Data movement between Google Cloud and AWS or Azure for queries.
  1. BigQuery control plane receive query jobs from you through Google Cloud console, bq command-line tool, an API method, or a client library.
  2. BigQuery control plane sends query jobs for processing to BigQuery data plane on AWS or Azure.
  3. BigQuery data plane receives the query from the control plane through a VPN connection.
  4. BigQuery data plane reads table data from your Amazon S3 bucket or Blob Storage.
  5. BigQuery data plane runs the query job on table data. The processing of table data occurs in the specified AWS or Azure region.
  6. The query result is transmitted from data plane to the control plane through the VPN connection.
  7. The BigQuery control plane receives the query job results for display to you in response to the query job. This data is stored for up to 24 hours.
  8. The query result is returned to you.

For more information, see Query Amazon S3 data and Blob Storage data.

Data flow when exporting data

The following image describes how data moves between Google Cloud and AWS or Azure during an EXPORT DATA statement.

Data movement between Google Cloud and AWS or Azure for export queries.
Figure 2: Data movement between Google Cloud and AWS or Azure for export queries.
  1. BigQuery control plane receives export query jobs from you through Google Cloud console, bq command-line tool, an API method, or a client library. The query contains the destination path for the query result in your Amazon S3 bucket or Blob Storage.
  2. BigQuery control plane sends export query jobs for processing to BigQuery data plane (on AWS or Azure).
  3. BigQuery data plane receives the export query from the control plane through the VPN connection.
  4. BigQuery data plane reads table data from your Amazon S3 bucket or Blob Storage.
  5. BigQuery data plane runs the query job on table data. Processing of table data occurs in the specified AWS or Azure region.
  6. BigQuery writes the query result to the specified destination path in your Amazon S3 bucket or Blob Storage.

For more information, see Export query results to Amazon S3 and Blob Storage.

Benefits

Performance. You can get insights faster, because data is not copied across clouds, and queries run in the same region where your data resides.

Cost. You save on outbound data transfer costs because the data doesn't move. There are no additional charges to your AWS or Azure account related to BigQuery Omni analytics, because the queries run on clusters managed by Google. You are only billed for running the queries, using the BigQuery pricing model.

Security and data governance. You manage the data in your own AWS or Azure subscription. You don't need to move or copy the raw data out of your public cloud. All computation happens in the BigQuery multi-tenant service which runs within the same region as your data.

Serverless architecture. Like the rest of BigQuery, BigQuery Omni is a serverless offering. Google deploys and manages the clusters that run BigQuery Omni. You don't need to provision any resources or manage any clusters.

Ease of management. BigQuery Omni provides a unified management interface through Google Cloud. BigQuery Omni can use your existing Google Cloud account and BigQuery projects. You can write a GoogleSQL query in the Google Cloud console to query data in AWS or Azure, and see the results displayed in the Google Cloud console.

Cross-cloud transfer. You can load data into standard BigQuery tables from S3 buckets and Blob Storage. For more information, see Transfer Amazon S3 data and Blob Storage data to BigQuery.

Metadata caching for performance

You can use cached metadata to improve query performance on BigLake tables that reference Amazon S3 data. It is especially helpful in cases where you are working with large numbers of files or if the data is hive partitioned.

The metadata includes file names, partitioning information, and physical metadata from files such as row counts. You can choose whether or not to enable metadata caching on a table. Queries with a large number of files and with Hive partition filters benefit the most from metadata caching.

If you don't enable metadata caching, queries on the table must read the external data source to get object metadata. Reading this data increases the query latency; listing millions of files from the external data source can take several minutes. If you enable metadata caching, queries can avoid listing files from the external data source and can partition and prune files more quickly.

There are two properties that control this feature:

  • Maximum staleness specifies when queries use cached metadata.
  • Metadata cache mode specifies how the metadata is collected.

When you have metadata caching enabled, you specify the maximum interval of metadata staleness that is acceptable for operations against the table. For example, if you specify an interval of 1 hour, then operations against the table use cached metadata if it has been refreshed within the past hour. If the cached metadata is older than that, the operation falls back to retrieving metadata from Amazon S3 instead. You can specify a staleness interval between 30 minutes and 7 days.

You can choose to refresh the cache either automatically or manually:

  • For automatic refreshes, the cache is refreshed at a system defined interval, usually somewhere between 30 and 60 minutes. Refreshing the cache automatically is a good approach if the files in Amazon S3 are added, deleted, or modified at random intervals. If you need to control the timing of the refresh, for example to trigger the refresh at the end of an extract-transform-load job, use manual refresh.
  • For manual refreshes, you run the BQ.REFRESH_EXTERNAL_METADATA_CACHE system procedure to refresh the metadata cache on a schedule that meets your requirements. Refreshing the cache manually is a good approach if the files in Amazon S3 are added, deleted, or modified at known intervals, for example as the output of a pipeline.

    If you issue multiple concurrent manual refreshes, only one will succeed.

The metadata cache expires after 7 days if it isn't refreshed.

Both manual and automatic cache refreshes are executed with INTERACTIVE query priority.

If you choose to use automatic refreshes, we recommend that you create a reservation, and then create an assignment with a BACKGROUND job type for the project that runs the metadata cache refresh jobs. This prevents the refresh jobs from competing with user queries for resources, and potentially failing if there aren't sufficient resources available for them.

You should consider how the staleness interval and metadata caching mode values will interact before you set them. Consider the following examples:

  • If you are manually refreshing the metadata cache for a table, and you set the staleness interval to 2 days, you must run the BQ.REFRESH_EXTERNAL_METADATA_CACHE system procedure every 2 days or less if you want operations against the table to use cached metadata.
  • If you are automatically refreshing the metadata cache for a table, and you set the staleness interval to 30 minutes, it is possible that some of your operations against the table might read from Amazon S3 if the metadata cache refresh takes on the longer side of the usual 30 to 60 minute window.

To find information about metadata refresh jobs, query the INFORMATION_SCHEMA.JOBS view, as shown in the following example:

SELECT *
FROM `region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE job_id LIKE '%metadata_cache_refresh%'
AND creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 6 HOUR)
ORDER BY start_time DESC
LIMIT 10;

For more information, see Metadata caching.

Cache-enabled tables with materialized views

You can use materialized views over Amazon Simple Storage Service (Amazon S3) metadata cache-enabled tables to improve performance and efficiency when querying structured data stored in Amazon S3. These materialized views function like materialized views over BigQuery-managed storage tables, including the benefits of automatic refresh and smart tuning.

To make Amazon S3 data in a materialized view available in a supported BigQuery region for joins, create a replica of the materialized view. You can only create materialized view replicas over authorized material views.

Limitations

In addition to the limitations for BigLake tables, the following limitations apply to BigQuery Omni, which includes BigLake tables based on Amazon S3 and Blob Storage data:

  • Working with data in any of the BigQuery Omni regions is not supported by the Standard and Enterprise Plus editions. For more information about editions, see Introduction to BigQuery editions.
  • The OBJECT_PRIVILEGES, STREAMING_TIMELINE_BY_*, TABLE_SNAPSHOTS, TABLE_STORAGE, TABLE_CONSTRAINTS, KEY_COLUMN_USAGE, CONSTRAINT_COLUMN_USAGE, and PARTITIONS INFORMATION_SCHEMA views are not available for BigLake tables based on Amazon S3 and Blob Storage data.
  • Materialized views are not supported for Blob Storage.
  • JavaScript UDFs are not supported.
  • The following SQL statements are not supported:

  • The following limitations apply on querying and reading destination temporary tables (preview):

    • Querying destination temporary tables with the SELECT statement is not supported.
    • Using the BigQuery Storage Read API to read data from destination temporary tables is not supported.
    • When using the ODBC driver, high-throughput reads (EnableHTAPI option) is not supported.
  • Scheduled queries are only supported through the API or CLI method. The destination table option is disabled for queries. Only EXPORT DATA queries are allowed.

  • BigQuery Storage API is not available in the BigQuery Omni regions.

  • If your query uses the ORDER BY clause and has a result size larger than 256 MB, then your query fails. To resolve this, either reduce the result size or remove the ORDER BY clause from the query. For more information about BigQuery Omni quotas, see Quotas and limits.

  • Using customer-managed encryption keys (CMEK) with datasets and external tables is not supported.

Pricing

For information about pricing and limited-time offers in BigQuery Omni, see BigQuery Omni pricing.

Quotas and limits

For information about BigQuery Omni quotas, see Quotas and limits.

If your query result is larger than 20 GiB, consider exporting the results to Amazon S3 or Blob Storage. To learn about quotas for BigQuery Connection API, see BigQuery Connection API.

Locations

BigQuery Omni processes queries in the same location as the dataset that contains the tables you're querying. After you create the dataset, the location cannot be changed. Your data resides within your AWS or Azure account. BigQuery Omni regions support Enterprise edition reservations and on-demand compute (analysis) pricing. For more information about editions, see Introduction to BigQuery editions.
Region description Region name Colocated BigQuery region
AWS
AWS - US East (N. Virginia) aws-us-east-1 us-east4
AWS - US West (Oregon) aws-us-west-2 us-west1
AWS - Asia Pacific (Seoul) aws-ap-northeast-2 asia-northeast3
AWS - Asia Pacific (Sydney) aws-ap-southeast-2 australia-southeast1
AWS - Europe (Ireland) aws-eu-west-1 europe-west1
AWS - Europe (Frankfurt) aws-eu-central-1 europe-west3
Azure
Azure - East US 2 azure-eastus2 us-east4

What's next