This document provides an overview of table clustering support in BigQuery.
When you create a clustered table in BigQuery, the table data is automatically organized based on the contents of one or more columns in the table’s schema. The columns you specify are used to colocate related data. When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.
Clustering can improve the performance of certain types of queries such as queries that use filter clauses and queries that aggregate data. When data is written to a clustered table by a query job or a load job, BigQuery sorts the data using the values in the clustering columns. These values are used to organize the data into multiple blocks in BigQuery storage. When you submit a query containing a clause that filters data based on the clustering columns, BigQuery uses the sorted blocks to eliminate scans of unnecessary data.
Similarly, when you submit a query that aggregates data based on the values in the clustering columns, performance is improved because the sorted blocks colocate rows with similar values.
When to use clustering
Currently, BigQuery supports clustering over a partitioned table. Use clustering over a partitioned table when:
- Your data is already partitioned on a date or timestamp column.
- You commonly use filters or aggregation against particular columns in your queries.
When you use clustering and partitioning together, the data can be partitioned by a date or timestamp column and then clustered on a different set of columns. In this case, data in each partition is clustered based on the values of the clustering columns. Partitioning provides a way to obtain accurate cost estimations for queries (based on the partitions that are scanned).
Clustering partitioned tables
In a table partitioned by a date or timestamp column, each partition contains a single day of data. When the data is stored, BigQuery ensures that all the data in a block belongs to a single partition. A partitioned table maintains these properties across all operations that modify it: query jobs, Data Manipulation Language (DML) statements, Data Definition Language (DDL) statements, load jobs, and copy jobs. This requires BigQuery to maintain more metadata than a non-partitioned table. As the number of partitions increases, the amount of metadata overhead increases.
Although more metadata must be maintained, by ensuring that data is partitioned globally, BigQuery can more accurately estimate the bytes processed by a query before you run it. This cost calculation provides an upper bound on the final cost of the query.
In a clustered table, BigQuery automatically sorts the data based on the values in the clustering columns and organizes them in optimally sized storage blocks. You can achieve more finely grained sorting by creating a table that is clustered and partitioned. A clustered table maintains the sort properties in the context of each operation that modifies it. As a result, BigQuery may not be able to accurately estimate the bytes processed by the query or the query costs. When blocks of data are eliminated during query execution, BigQuery provides a best effort reduction of the query costs.
Over time, as more and more operations modify a table, the degree to which the
data is sorted begins to weaken, and the table becomes partially sorted. In a
partially sorted table, queries that use the clustering columns may need to scan
more blocks compared to a table that is fully sorted. You can re-cluster the
data in the entire table by running a
SELECT * query that selects from and
overwrites the table (or any specific partition in it). In addition, any
arbitrary portion of the table can be re-clustered using a DML
Clustered table quotas and limits
When you use the clustered table feature with a partitioned table, you are subject to the limits on partitioned tables.
Quotas and limits also apply to the different types of jobs you can run against clustered tables, including:
- Loading data (load jobs)
- Exporting data (export jobs)
- Querying data (query jobs)
- Copying tables (copy jobs)
For more information on all quotas and limits, see Quotas and Limits.
Clustered table pricing
When you create and use clustered tables in BigQuery, your charges are based on how much data is stored in the tables and on the queries you run against the data:
- For information on storage pricing, see Storage pricing.
- For information on query pricing, see Query pricing.
Many clustered table operations are free, including loading data into clustered tables, copying tables and partitions, and exporting data. Though free, these operations are subject to BigQuery's Quotas and Limits. For information on all free operations, see Free operations on the pricing page.
For a detailed clustered table pricing example, see the Pricing page.
Features under development
The following features are being developed but are not currently available in the beta release:
- Support for clustering non-partitioned tables.
- To learn how to create and use clustered tables, see Creating and Using Clustered Tables.
- For information on querying clustered tables, see Querying Clustered Tables.
- For an overview of partitioned table support in BigQuery, see Introduction to Partitioned Tables.
- To learn how to create and use ingestion-time partitioned tables, see Creating and Using Ingestion-Time Partitioned Tables.
- To learn how to create and use partitioned tables, see Creating and Using Partitioned Tables.