Introduction to clustered tables
Clustered tables in BigQuery are tables that have a user-defined column sort order using clustered columns. Clustered tables can improve query performance and reduce query costs.
In BigQuery, a clustered column is a user-defined table property that sorts storage blocks based on the values in the clustered columns. The storage blocks are adaptively sized based on the size of the table. A clustered table maintains the sort properties in the context of each operation that modifies it. Queries that filter or aggregate by the clustered columns only scan the relevant blocks based on the clustered columns instead of the entire table or table partition. As a result, BigQuery might not be able to accurately estimate the bytes to be processed by the query or the query costs, but it attempts to reduce the total bytes at execution.
When you cluster a table using multiple columns, the column order determines which columns take precedence when BigQuery sorts and groups the data into storage blocks. The following example compares the logical storage block layout of an unclustered table with the layout of clustered tables that have one or multiple clustered columns:
When you query a clustered table, you do not receive an accurate query cost estimate before query execution because the number of storage blocks to be scanned is not known before query execution. The final cost is determined after query execution is complete and is based on the specific storage blocks that were scanned.
When to use clustering
You might consider clustering in the following scenarios:
- Your queries commonly filter on particular columns. Clustering accelerates queries because the query only scans the blocks that match the filter.
- Your queries filter on columns that have many distinct values (high cardinality). Clustering accelerates these queries by providing BigQuery with detailed metadata for where to get input data.
- You do not need strict cost estimates before query execution.
You might consider alternatives to clustering in the following circumstances:
- You need a strict query cost estimate before you run a query. The cost of queries over clustered tables can only be determined after the query is run.
- Your query tables are smaller than 1 GB. Typically, clustering does not offer significant performance gains on tables less than 1 GB.
Because clustering addresses how a table is stored, it's generally a good first option for improving query performance. If you require different or additional methods, you might consider table partitioning.
Like clustering, partitioning uses user-defined partition columns to specify how data is partitioned and what data is stored in each partition. Unlike clustering, partitioning provides granular query cost estimates before you run a query. Like clustering, partitioning doesn't necessarily reduce the volume of data that's scanned in a query. The volume depends on what you set as the partition columns and how those columns are used as query filters during execution.
Another method is to combine clustering and table partitioning. In this approach, you first segment data into partitions, and then you cluster the data within each partition by the clustering columns. For more information, see Clustered and partitioned tables in this document.
Cluster column types and ordering
This section describes column types and how column order works in table clustering.
Cluster column types
Cluster columns must be top-level, non-repeated columns that are one of the following types:
For more information about data types, see GoogleSQL data types.
Cluster column ordering
The order of clustered columns affects query performance. To benefit from clustering, the query filter order must match the clustered column order and must include at least the first clustered column.
In the following example, the orders table is clustered using a column sort
Status. A query that filters on
Country is optimized for clustering, but a query that filters
Status is not optimized. To optimize your clustering
results, you must filter from clustered columns in order starting from the first
Combining clustered and partitioned tables
You can combine table clustering with table partitioning to achieve finely-grained sorting for further query optimization.
In a partitioned table, data is stored in physical blocks, each of which holds one partition of data. Each partitioned table maintains various metadata about the sort properties across all operations that modify it. The metadata lets BigQuery more accurately estimate a query cost before the query is run. However, partitioning requires BigQuery to maintain more metadata than with an unpartitioned table. As the number of partitions increases, the amount of metadata to maintain increases.
When you create a table that is clustered and partitioned, you can achieve more finely grained sorting, as the following diagram shows:
As data is added to a clustered table, the new data is organized into blocks, which might create new storage blocks or update existing blocks. Block optimization is required for optimal query and storage performance because new data might not be grouped with existing data that has the same cluster values.
To maintain the performance characteristics of a clustered table, BigQuery performs automatic reclustering in the background. For partitioned tables, clustering is maintained for data within the scope of each partition.
- Only GoogleSQL is supported for querying clustered tables and for writing query results to clustered tables.
- You can only specify up to four clustering columns. If you need additional columns, consider combining clustering with partitioning.
- When using
STRINGtype columns for clustering, BigQuery uses only the first 1,024 characters to cluster the data. The values in the columns can themselves be longer than 1,024 characters.
- If you alter an existing non-clustered table to be clustered, the existing data is not clustered. Only new data is stored using the clustered columns and is subject to automatic reclustering.
Clustered table quotas and limits
BigQuery restricts the use of shared Google Cloud resources with quotas and limits, including limitations on certain table operations or the number of jobs run within a day.
When you use the clustered table feature with a partitioned table, you are subject to the limits on partitioned tables.
Quotas and limits also apply to the different types of jobs that you can run against clustered tables. For information about the job quotas that apply to your tables, see Jobs in "Quotas and Limits".
Clustered table pricing
When you create and use clustered tables in BigQuery, your charges are based on how much data is stored in the tables and on the queries that you run against the data. For more information, see Storage pricing and Query pricing.
Like other BigQuery table operations, clustered table operations take advantage of BigQuery free operations such as batch load, table copy, automatic reclustering, and data export. These operations are subject to BigQuery quotas and limits. For information about free operations, see Free operations.
For a detailed clustered table pricing example, see Estimate storage and query costs.
To control access to tables in BigQuery, see Introduction to table access controls.
- To learn how to create and use clustered tables, see Creating and using clustered tables.
- For information about querying clustered tables, see Querying clustered tables.