Migrating data warehouses to BigQuery: Performance optimization

This document is part of a series that helps you transition from an on-premises data warehouse to BigQuery on Google Cloud. This document describes essential techniques for optimizing query performance in your BigQuery-powered data warehouse. It assumes that you have migrated the schema and data from your legacy data warehouse to BigQuery and that you have translated your legacy jobs' queries to BigQuery Standard SQL.

As you read through this article, consider the reasons for optimizing in the context of your migration effort. The goal for this document isn't to provide an exhaustive list of all available optimization techniques—for that you should refer to the official documentation on BigQuery best practices. Instead, this document aims to help you think about if, and when, you should consider using those techniques.

The migration series consists of the following parts:

General performance considerations

BigQuery processes data efficiently for both small and petabyte-scale datasets. With the help of BigQuery, your data analytics jobs should perform well without modification in your newly migrated data warehouse. If you find that, under certain circumstances, query performance doesn't match your expectations, consider the key principles described here to understand how various factors can impact performance.

Optimization tradeoffs

As with any systems, optimizing for performance sometimes involves tradeoffs. For example, using advanced SQL syntax can sometimes introduce complexity and reduce the queries' understandability for people who aren't SQL experts. Spending time on micro-optimizations for noncritical workloads could also divert resources away from building new features for your applications or from making optimizations that are more impactful. Therefore, to help you achieve the highest possible return on investment, we recommend that you focus your optimizations on the workloads that matter most to your data analytics pipelines.

To evaluate whether specific queries are problematic, you can use Cloud Monitoring to monitor how your BigQuery jobs consume resources over time. If you identify a slow or resource-intensive query, you can drill down and focus your performance optimizations on that query.

Capacity and concurrency

BigQuery breaks down the computational capacity required to execute SQL queries into units called slots. BigQuery then automatically calculates how many slots each query requires, depending on the query's size and complexity.

BigQuery automatically manages the slots quota that your running queries share, based on customer history, usage, and spending. Most users find the default per-project slot capacity more than sufficient. Access to more slots doesn't guarantee faster per-query performance. However, a larger pool of slots might improve performance of large or complex queries, as well as performance of highly concurrent workloads. To further improve query performance, consider the benefits of purchasing more reserved slots, in addition to data model and query optimizations where appropriate.

BigQuery offers two pricing models for queries: on-demand pricing and flat-rate pricing. On-demand pricing is based on the amount of data processed by each query you run. Flat-rate pricing is suitable if you require consistent monthly expenditures on analytics spending. When you enroll in flat-rate pricing, you purchase dedicated query processing capacity that is measured in BigQuery slots. The cost of all bytes processed is included in the monthly flat-rate price. If your queries exceed your flat-rate capacity, they are queued until your flat-rate resources become available.

Query plan and timeline

The BigQuery web UI lets you visually inspect the query plan and timeline for your queries. You can use the jobs.get API method to retrieve the query plan and timeline information. You might also be interested in trying out BigQuery Visualiser, an open source tool that visually represents the flow of execution stages in a BigQuery job.

When BigQuery executes a query job, it converts the declarative SQL statement into a graph of execution. This graph is broken up into a series of query stages, which themselves are composed of more granular sets of execution steps. BigQuery uses a heavily distributed parallel architecture to run these queries. The BigQuery stages model the units of work that many potential workers might execute in parallel. Stages communicate with one another through a fast, distributed shuffle architecture.

Query execution plan.

In addition to the query plan, query jobs also expose a timeline of execution. This timeline provides an accounting of units of work completed, pending, and active within query workers. A query might have multiple stages with active workers simultaneously, so the timeline is intended to show overall progress of the query.

Timeline statistics.

To estimate how computationally expensive a query is, you can look at the total number of slot seconds the query consumes. The lower the number of slot seconds, the better, because it means more resources are available to other queries running in the same project at the same time.

The query plan and timeline statistics can help you understand how BigQuery executes queries and if certain stages dominate resource utilization. For example, a JOIN stage that generates far more output rows than input rows might indicate an opportunity to filter earlier in the query. However, the managed nature of the service limits whether some details are directly actionable. If you want to improve query execution and performance, we recommend that you follow the techniques in the Query optimization strategies section and in the BigQuery best practices documentation.

External data sources

An external data source (also known as a federated data source) is a data source that you can query directly from BigQuery even though the data itself isn't stored in BigQuery. Instead of loading or streaming the data, you create a table that references the external data source. External sources supported by BigQuery include Cloud Bigtable, Cloud Storage, and Google Drive.

Query performance for external data sources might not be as fast as querying data in a native BigQuery table. If query speed is a priority, we recommend loading the data into BigQuery instead of setting up an external data source. The performance of a query that includes an external data source depends on the external storage type. For example, querying data stored in Cloud Storage is faster than querying data stored in Google Drive. In general, query performance for external data sources should be equivalent to reading the data directly from the external storage.

When you query an external data source, the results aren't cached. You are charged for each query against an external table even if you issue the same query multiple times. If you must repeatedly issue a query against an external table that doesn't change frequently, consider writing the query results to a permanent table and running the queries against the permanent table instead.

Generally, consider using an external data source only for these use cases:

  • Performing extract, transform, and load (ETL) operations when loading data.
  • Querying data that changes frequently.
  • Handling periodic loads such as recurring ingestion of data from Cloud Bigtable.

For more details on the opportunities, limitations, and best practices for using external data sources, see the official BigQuery documentation.

Query optimization strategies

After reviewing the previous section, if you find that the performance of specific queries doesn't match your expectations, consider optimizing those queries by following the strategies presented in this section.

Partitioning

Table partitioning consists of dividing a table into segments, called partitions. By dividing a large table into smaller partitions, you can improve query performance, and you can control costs by reducing the number of bytes read by a query. We recommend partitioning for tables that are over 10 GB, if the tables' schema allows for it.

There are two types of table partitioning in BigQuery:

You can create a partitioned table in BigQuery:

As an example, assume that you have migrated an orders table from your legacy data warehouse to BigQuery, with the following schema:

orders
Column name Type
salesOrderKey STRING
orderDate TIMESTAMP
customerKey STRING
totalSale FLOAT
currencyKey INTEGER

You can now partition the orders table by date on the orderDate column as follows:

CREATE TABLE `your-project.sales.orders_partitioned` (
  salesOrderKey STRING,
  orderDate TIMESTAMP,
  customerKey STRING,
  totalSale FLOAT64,
  currencyKey INT64
)
PARTITION BY DATE(orderDate)
AS
SELECT * FROM `your-project.sales.orders`

The following query should run much faster on the partitioned table than on the non-partitioned table, because the WHERE clause hints to BigQuery that it only needs to scan data from a single partition:

SELECT
 DATE_TRUNC(DATE(orderDate), MONTH) AS firstOfMonth,
 currencyKey,
 COUNT(*) AS numberOfOrders,
 SUM(totalSale) AS totalSales
FROM `your-project.sales.orders_partitioned`
WHERE DATE_TRUNC(DATE(orderDate), MONTH) = '2015-01-01'
GROUP BY DATE_TRUNC(DATE(orderDate), MONTH), currencyKey

For more information about when to consider partitioning as it relates to schema improvement, see the schema and data transfer overview.

Clustering

Table clustering can improve the performance of certain types of queries such as queries that use filter clauses and queries that aggregate data. When a query job or load job writes data to a clustered table, BigQuery sorts the data using the values in the clustering columns. These values are used to organize the data into multiple blocks in BigQuery storage. When you submit a query containing a clause that filters data based on the clustering columns, BigQuery uses the sorted blocks to eliminate scans of unnecessary data, which makes queries faster and cheaper.

Currently, BigQuery supports clustering only over a partitioned table. Table clustering is supported for both ingestion-time partitioned tables and for tables partitioned on a DATE or TIMESTAMP column.

You can use clustering over a partitioned table under the following circumstances:

  • Your data is already partitioned on a date or timestamp column.
  • You commonly use filters or aggregation against particular columns in your queries.

When you use clustering and partitioning together, you can partition the data by a date or timestamp column and then cluster on a different set of columns. In this case, data in each partition is clustered based on the values of the clustering columns. Partitioning provides a way to get accurate cost estimates for queries (based on the partitions that are scanned).

To further improve on the example shown in the previous section, you can create a table named orders_clustered that is both partitioned and clustered, based on the original orders table:

CREATE TABLE
 `your-project.sales.orders_clustered` (
   salesOrderKey STRING,
   orderDate TIMESTAMP,
   customerKey STRING,
   totalSale FLOAT64,
   currencyKey INT64 )
PARTITION BY DATE(orderDate)
CLUSTER BY customerKey
AS
SELECT * FROM `your-project.sales.orders`

The following query should run much faster on the orders_clustered table than on the original orders table thanks to the combination of partitioning and the WHERE filter. It should also run faster than on the orders_partitioned table thanks to the combination of clustering and the customerKey IN [...] and GROUP BY customerKey clauses.

SELECT
  customerKey,
  DATE_TRUNC(DATE(orderDate), MONTH) AS firstOfMonth,
  currencyKey,
  COUNT(*) AS numberOfOrders,
  SUM(totalSale) AS totalSales
FROM `your-project.sales.orders_clustered`
WHERE DATE_TRUNC(DATE(orderDate), MONTH) = '2015-02-01'
  AND customerKey IN (
    '1292803200000-402',
    '1298764800000-414',
    '1267401600000-212',
    '1267488000000-209')
GROUP BY customerKey, DATE_TRUNC(DATE(orderDate), MONTH), currencyKey

For more information about when to consider clustering as it relates to schema improvement, see the schema and data transfer overview.

Denormalization

Denormalization is a common strategy for increasing read performance on relational datasets that were previously normalized. The recommended way to denormalize data in BigQuery is to use nested and repeated fields. It's best to use this strategy when the relationships are hierarchical and often queried together as in parent-child relationships.

Using nested and repeated fields localizes the data to individual slots so that execution can be done in parallel. This approach maintains relationships without completely flattening the denormalized relational data, which further improves performance by reducing network communication during the shuffle phase.

Denormalizing data increases the required amount of storage. However, this increase is not a concern in most use cases thanks to BigQuery low-cost storage. By trading compute resources for storage resources, denormalization helps reduce overall cost for applications with high read rates. This cost reduction is the result of overall speed gains and decreased compute and network utilization.

Nested data is represented as a STRUCT type in BigQuery standard SQL. Repeated data is represented as an ARRAY type and can use an ARRAY function. Nesting and repetition can also complement each other (ARRAY of STRUCTs). To learn more, see BigQuery's best practices for denormalization.

As an example, consider two tables orders_clustered and order_lines_clustered with the following schema:

orders_clustered
Column name Type
salesOrderKey STRING
orderDate TIMESTAMP
customerKey STRING
totalSale FLOAT
currencyKey INTEGER
order_lines_clustered
Column name Type
salesOrderKey STRING
salesOrderLineKey STRING
productKey STRING
quantity INTEGER
unitPrice FLOAT
orderDate TIMESTAMP
totalSale FLOAT

You can create a denormalized table (orders_nested) that pre-joins data from the two other tables into repeated, nested fields:

CREATE TABLE `your-project.sales.orders_denormalized`
PARTITION BY DATE(orderDate)
CLUSTER BY customerKey
AS
SELECT
  o.salesOrderKey,
  o.orderDate,
  o.currencyKey,
  o.customerKey,
  o.totalSale,
  ARRAY_AGG(
    STRUCT(
      ol.salesOrderLineKey,
      ol.productKey,
      ol.totalSale
  )) AS lineItem
FROM `your-project.sales.orders_clustered` AS o
  JOIN `your-project.sales.order_lines_clustered` AS ol
    ON ol.orderDate = o.orderDate
    AND ol.salesOrderKey = o.salesOrderKey
GROUP BY 1,2,3,4,5

Now, consider the following two SELECT queries:

SELECT
  o.salesOrderKey,
  o.orderDate,
  o.currencyKey,
  ol.salesOrderLineKey,
  ol.productKey,
  ol.totalSale
FROM `your-project.sales.orders_clustered` AS o
  JOIN `your-project.sales.order_lines_clustered` AS ol
    ON ol.orderDate = o.orderDate
    AND ol.salesOrderKey = o.salesOrderKey
WHERE DATE_TRUNC(DATE(o.orderDate), MONTH) = '2015-02-01'
  AND o.customerKey = '1201392000000-325'

SELECT
 salesOrderKey,
 orderDate,
 currencyKey,
 lineItem
FROM `your-project.sales.orders_denormalized`
WHERE DATE_TRUNC(DATE(orderDate), MONTH) = '2015-02-01'
 AND customerKey = '1201392000000-325'

The second query should run much faster than the first one because it operates on a denormalized table with data that has already been pre-joined.

The second query returns the lineItem values in the ARRAY(STRUCT()) data type:

Results of the second query.

You can flatten the array by using the UNNEST operator:

SELECT
 o.salesOrderKey,
 o.orderDate,
 o.currencyKey,
 ol.*
FROM `your-project.sales.orders_denormalized` AS o
  JOIN UNNEST(lineItem) AS ol
WHERE DATE_TRUNC(DATE(orderDate), MONTH) = '2015-02-01'
 AND customerKey = '1201392000000-325'

The query returns the result in a flattened form, as shown in this screenshot:

Query result shown in flattened form.

For more information about when to consider denormalization as it relates to schema improvement, see the guide on schema and data transfer.

Approximate functions

Approximate aggregation is the estimation of aggregate function outputs, such as cardinality and quantiles. BigQuery supports a range of approximate aggregate functions that use the HyperLogLog++ algorithm to estimate distinct values in a large data set.

Approximate aggregation functions require less memory than the corresponding exact aggregation functions, but they also introduce statistical uncertainty. You can use approximate aggregation functions for large data streams for which linear memory usage is impractical, and for data that's already approximate.

For example, consider the following two queries:

SELECT
  COUNT(DISTINCT salesOrderKey)
FROM `your-project.sales.orders`

SELECT
 APPROX_COUNT_DISTINCT(salesOrderKey)
FROM `your-project.sales.orders`

The first query returns the exact count value. The second query returns an approximate count value, with negligible deviation from the exact value. However, it should run much faster than the first query.

Anti-patterns

For further performance improvements, see the BigQuery documentation on anti-patterns to avoid common mistakes like making joins that generate more outputs than inputs, using point-specific DML statements, or querying unfiltered skewed data.

More best practices

This document highlights only a few of the most prominent techniques that you can use to improve query performance for your migrated data warehouse. To make further improvements and learn more best practices, see the extensive documentation on optimizing query performance.

What's next

Try out other Google Cloud features for yourself. Have a look at our tutorials.