This document is part of a series that helps you transition from an on-premises data warehouse to BigQuery on Google Cloud. This document describes essential techniques for optimizing query performance in your BigQuery-powered data warehouse. It assumes that you have migrated the schema and data from your legacy data warehouse to BigQuery and that you have translated your legacy jobs' queries to BigQuery Standard SQL.
As you read through this article, consider the reasons for optimizing in the context of your migration effort. The goal for this document isn't to provide an exhaustive list of all available optimization techniques—for that you should refer to the official documentation on BigQuery best practices. Instead, this document aims to help you think about if, and when, you should consider using those techniques.
The migration series consists of the following parts:
- Introduction and overview
- Schema and data transfer overview
- Data governance
- Data pipelines
- Reporting and analysis
- Performance optimization (this document)
General performance considerations
BigQuery processes data efficiently for both small and petabyte-scale datasets. With the help of BigQuery, your data analytics jobs should perform well without modification in your newly migrated data warehouse. If you find that, under certain circumstances, query performance doesn't match your expectations, consider the key principles described here to understand how various factors can impact performance.
As with any systems, optimizing for performance sometimes involves tradeoffs. For example, using advanced SQL syntax can sometimes introduce complexity and reduce the queries' understandability for people who aren't SQL experts. Spending time on micro-optimizations for noncritical workloads could also divert resources away from building new features for your applications or from making optimizations that are more impactful. Therefore, to help you achieve the highest possible return on investment, we recommend that you focus your optimizations on the workloads that matter most to your data analytics pipelines.
To evaluate whether specific queries are problematic, you can use Cloud Monitoring to monitor how your BigQuery jobs consume resources over time. If you identify a slow or resource-intensive query, you can drill down and focus your performance optimizations on that query.
Capacity and concurrency
BigQuery breaks down the computational capacity required to execute SQL queries into units called slots. BigQuery then automatically calculates how many slots each query requires, depending on the query's size and complexity.
BigQuery automatically manages the slots quota that your running queries share, based on customer history, usage, and spending. Most users find the default per-project slot capacity more than sufficient. Access to more slots doesn't guarantee faster per-query performance. However, a larger pool of slots might improve performance of large or complex queries, as well as performance of highly concurrent workloads. To further improve query performance, consider the benefits of purchasing more reserved slots, in addition to data model and query optimizations where appropriate.
BigQuery offers two pricing models for queries: on-demand pricing and flat-rate pricing. On-demand pricing is based on the amount of data processed by each query you run. Flat-rate pricing is suitable if you require consistent monthly expenditures on analytics spending. When you enroll in flat-rate pricing, you purchase dedicated query processing capacity that is measured in BigQuery slots. The cost of all bytes processed is included in the monthly flat-rate price. If your queries exceed your flat-rate capacity, they are queued until your flat-rate resources become available.
Query plan and timeline
BigQuery web UI
lets you visually inspect the
query plan and timeline
for your queries. You can use the
jobs.get API method to retrieve the query
plan and timeline information. You might also be interested in trying out
an open source tool that visually represents the flow of execution stages in a
When BigQuery executes a query job, it converts the declarative SQL statement into a graph of execution. This graph is broken up into a series of query stages, which themselves are composed of more granular sets of execution steps. BigQuery uses a heavily distributed parallel architecture to run these queries. The BigQuery stages model the units of work that many potential workers might execute in parallel. Stages communicate with one another through a fast, distributed shuffle architecture.
In addition to the query plan, query jobs also expose a timeline of execution. This timeline provides an accounting of units of work completed, pending, and active within query workers. A query might have multiple stages with active workers simultaneously, so the timeline is intended to show overall progress of the query.
To estimate how computationally expensive a query is, you can look at the total number of slot seconds the query consumes. The lower the number of slot seconds, the better, because it means more resources are available to other queries running in the same project at the same time.
The query plan and timeline statistics can help you understand how
BigQuery executes queries and if certain stages dominate resource
utilization. For example, a
JOIN stage that generates far more output rows than
input rows might indicate an opportunity to filter earlier in the query.
However, the managed nature of the service limits whether some details are
directly actionable. If you want to improve query execution and performance,
we recommend that you follow the techniques in the Query optimization strategies
section and in the BigQuery best practices documentation.
External data sources
An external data source (also known as a federated data source) is a data source that you can query directly from BigQuery even though the data itself isn't stored in BigQuery. Instead of loading or streaming the data, you create a table that references the external data source. External sources supported by BigQuery include Cloud Bigtable, Cloud Storage, and Google Drive.
Query performance for external data sources might not be as fast as querying data in a native BigQuery table. If query speed is a priority, we recommend loading the data into BigQuery instead of setting up an external data source. The performance of a query that includes an external data source depends on the external storage type. For example, querying data stored in Cloud Storage is faster than querying data stored in Google Drive. In general, query performance for external data sources should be equivalent to reading the data directly from the external storage.
When you query an external data source, the results aren't cached. You are charged for each query against an external table even if you issue the same query multiple times. If you must repeatedly issue a query against an external table that doesn't change frequently, consider writing the query results to a permanent table and running the queries against the permanent table instead.
Generally, consider using an external data source only for these use cases:
- Performing extract, transform, and load (ETL) operations when loading data.
- Querying data that changes frequently.
- Handling periodic loads such as recurring ingestion of data from Cloud Bigtable.
For more details on the opportunities, limitations, and best practices for using external data sources, see the official BigQuery documentation.
Query optimization strategies
After reviewing the previous section, if you find that the performance of specific queries doesn't match your expectations, consider optimizing those queries by following the strategies presented in this section.
Table partitioning consists of dividing a table into segments, called partitions. By dividing a large table into smaller partitions, you can improve query performance, and you can control costs by reducing the number of bytes read by a query. We recommend partitioning for tables that are over 10 GB, if the tables' schema allows for it.
There are two types of table partitioning in BigQuery:
- Tables partitioned by ingestion time: Tables partitioned based on the data's ingestion (load) date or arrival date.
tables: Tables that are partitioned based on a
You can create a partitioned table in BigQuery:
- By using a
CREATE TABLEstatement with a
PARTITION BYclause containing a partition expression.
- Manually by using the BigQuery command-line tool's
- Programmatically by calling the
- From query results.
- When you load data.
As an example, assume that you have migrated an
orders table from your legacy
data warehouse to BigQuery, with the following schema:
You can now partition the
orders table by date on the
orderDate column as
CREATE TABLE `your-project.sales.orders_partitioned` ( salesOrderKey STRING, orderDate TIMESTAMP, customerKey STRING, totalSale FLOAT64, currencyKey INT64 ) PARTITION BY DATE(orderDate) AS SELECT * FROM `your-project.sales.orders`
The following query should run much faster on the partitioned table than on the
non-partitioned table, because the
WHERE clause hints to
BigQuery that it only needs to scan data from a single
SELECT DATE_TRUNC(DATE(orderDate), MONTH) AS firstOfMonth, currencyKey, COUNT(*) AS numberOfOrders, SUM(totalSale) AS totalSales FROM `your-project.sales.orders_partitioned` WHERE DATE_TRUNC(DATE(orderDate), MONTH) = '2015-01-01' GROUP BY DATE_TRUNC(DATE(orderDate), MONTH), currencyKey
For more information about when to consider partitioning as it relates to schema improvement, see the schema and data transfer overview.
Table clustering can improve the performance of certain types of queries such as queries that use filter clauses and queries that aggregate data. When a query job or load job writes data to a clustered table, BigQuery sorts the data using the values in the clustering columns. These values are used to organize the data into multiple blocks in BigQuery storage. When you submit a query containing a clause that filters data based on the clustering columns, BigQuery uses the sorted blocks to eliminate scans of unnecessary data, which makes queries faster and cheaper.
Currently, BigQuery supports clustering only over a partitioned
table. Table clustering is supported for both
partitioned tables and for tables
You can use clustering over a partitioned table under the following circumstances:
- Your data is already partitioned on a date or timestamp column.
- You commonly use filters or aggregation against particular columns in your queries.
When you use clustering and partitioning together, you can partition the data by a date or timestamp column and then cluster on a different set of columns. In this case, data in each partition is clustered based on the values of the clustering columns. Partitioning provides a way to get accurate cost estimates for queries (based on the partitions that are scanned).
To further improve on the example shown in the previous section, you can create
a table named
orders_clustered that is both partitioned and clustered, based
on the original
CREATE TABLE `your-project.sales.orders_clustered` ( salesOrderKey STRING, orderDate TIMESTAMP, customerKey STRING, totalSale FLOAT64, currencyKey INT64 ) PARTITION BY DATE(orderDate) CLUSTER BY customerKey AS SELECT * FROM `your-project.sales.orders`
The following query should run much faster on the
orders_clustered table than
on the original
orders table thanks to the combination of partitioning and the
WHERE filter. It should also run faster than on the
thanks to the combination of clustering and the
customerKey IN [...] and
GROUP BY customerKey clauses.
SELECT customerKey, DATE_TRUNC(DATE(orderDate), MONTH) AS firstOfMonth, currencyKey, COUNT(*) AS numberOfOrders, SUM(totalSale) AS totalSales FROM `your-project.sales.orders_clustered` WHERE DATE_TRUNC(DATE(orderDate), MONTH) = '2015-02-01' AND customerKey IN ( '1292803200000-402', '1298764800000-414', '1267401600000-212', '1267488000000-209') GROUP BY customerKey, DATE_TRUNC(DATE(orderDate), MONTH), currencyKey
For more information about when to consider clustering as it relates to schema improvement, see the schema and data transfer overview.
Denormalization is a common strategy for increasing read performance on relational datasets that were previously normalized. The recommended way to denormalize data in BigQuery is to use nested and repeated fields. It's best to use this strategy when the relationships are hierarchical and often queried together as in parent-child relationships.
Using nested and repeated fields localizes the data to individual slots so that execution can be done in parallel. This approach maintains relationships without completely flattening the denormalized relational data, which further improves performance by reducing network communication during the shuffle phase.
Denormalizing data increases the required amount of storage. However, this increase is not a concern in most use cases thanks to BigQuery low-cost storage. By trading compute resources for storage resources, denormalization helps reduce overall cost for applications with high read rates. This cost reduction is the result of overall speed gains and decreased compute and network utilization.
Nested data is represented as a
in BigQuery standard SQL. Repeated data is represented as an
and can use an
Nesting and repetition can also complement each other (
learn more, see
BigQuery's best practices for denormalization.
As an example, consider two tables
order_lines_clustered with the following schema:
You can create a denormalized table (
orders_nested) that pre-joins data from
the two other tables into repeated, nested fields:
CREATE TABLE `your-project.sales.orders_denormalized` PARTITION BY DATE(orderDate) CLUSTER BY customerKey AS SELECT o.salesOrderKey, o.orderDate, o.currencyKey, o.customerKey, o.totalSale, ARRAY_AGG( STRUCT( ol.salesOrderLineKey, ol.productKey, ol.totalSale )) AS lineItem FROM `your-project.sales.orders_clustered` AS o JOIN `your-project.sales.order_lines_clustered` AS ol ON ol.orderDate = o.orderDate AND ol.salesOrderKey = o.salesOrderKey GROUP BY 1,2,3,4,5
Now, consider the following two
SELECT o.salesOrderKey, o.orderDate, o.currencyKey, ol.salesOrderLineKey, ol.productKey, ol.totalSale FROM `your-project.sales.orders_clustered` AS o JOIN `your-project.sales.order_lines_clustered` AS ol ON ol.orderDate = o.orderDate AND ol.salesOrderKey = o.salesOrderKey WHERE DATE_TRUNC(DATE(o.orderDate), MONTH) = '2015-02-01' AND o.customerKey = '1201392000000-325' SELECT salesOrderKey, orderDate, currencyKey, lineItem FROM `your-project.sales.orders_denormalized` WHERE DATE_TRUNC(DATE(orderDate), MONTH) = '2015-02-01' AND customerKey = '1201392000000-325'
The second query should run much faster than the first one because it operates on a denormalized table with data that has already been pre-joined.
The second query returns the
lineItem values in the
SELECT o.salesOrderKey, o.orderDate, o.currencyKey, ol.* FROM `your-project.sales.orders_denormalized` AS o JOIN UNNEST(lineItem) AS ol WHERE DATE_TRUNC(DATE(orderDate), MONTH) = '2015-02-01' AND customerKey = '1201392000000-325'
The query returns the result in a flattened form, as shown in this screenshot:
For more information about when to consider denormalization as it relates to schema improvement, see the guide on schema and data transfer.
Approximate aggregation is the estimation of aggregate function outputs, such as cardinality and quantiles. BigQuery supports a range of approximate aggregate functions that use the HyperLogLog++ algorithm to estimate distinct values in a large data set.
Approximate aggregation functions require less memory than the corresponding exact aggregation functions, but they also introduce statistical uncertainty. You can use approximate aggregation functions for large data streams for which linear memory usage is impractical, and for data that's already approximate.
For example, consider the following two queries:
SELECT COUNT(DISTINCT salesOrderKey) FROM `your-project.sales.orders` SELECT APPROX_COUNT_DISTINCT(salesOrderKey) FROM `your-project.sales.orders`
The first query returns the exact count value. The second query returns an approximate count value, with negligible deviation from the exact value. However, it should run much faster than the first query.
For further performance improvements, see the BigQuery documentation on anti-patterns to avoid common mistakes like making joins that generate more outputs than inputs, using point-specific DML statements, or querying unfiltered skewed data.
More best practices
This document highlights only a few of the most prominent techniques that you can use to improve query performance for your migrated data warehouse. To make further improvements and learn more best practices, see the extensive documentation on optimizing query performance.
Try out other Google Cloud features for yourself. Have a look at our tutorials.