Querying clustered tables
When you create a clustered table in BigQuery, the table data is automatically organized based on the contents of one or more columns in the table's schema. The columns you specify are used to colocate related data. When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.
To optimize performance when you run queries against clustered tables, use an expression that filters on a clustered column or on multiple clustered columns in the order the clustered columns are specified. Queries that filter on clustered columns generally perform better than queries that filter only on non-clustered columns.
BigQuery sorts the data in a clustered table based on the values in the clustering columns and organizes them into blocks.
When you submit a query that contains a filter on a clustered column, BigQuery uses the clustering information to efficiently determine whether a block contains any data relevant to the query. This allows BigQuery to only scan the relevant blocks — a process referred to as block pruning.
You can query clustered tables by:
- Using the Google Cloud console
- Using the bq command-line tool's
bq query
command - Calling the
jobs.insert
method and configuring a query job - Using the client libraries
Currently, you can only use GoogleSQL with clustered tables.
Go
Before trying this sample, follow the Go setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Go API reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.
Java
Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.
Python
Before trying this sample, follow the Python setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Python API reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.
Required permissions
To run a query job, you need the
bigquery.jobs.create
Identity and Access Management (IAM) permission on the project that
runs the query job.
Each of the following predefined IAM roles includes the permissions that you need to run a query job:
roles/bigquery.admin
roles/bigquery.jobUser
roles/bigquery.user
You also need the bigquery.tables.getData
permission
on all tables and views that your query references. In addition, when querying
a view you need this permission on all underlying tables and views.
However, if you are using authorized views
or authorized datasets, you don't need
access to the underlying source data.
Each of the following predefined IAM roles includes the permission that you need on all tables and views that the query references:
roles/bigquery.admin
roles/bigquery.dataOwner
roles/bigquery.dataEditor
roles/bigquery.dataViewer
For more information about IAM roles in BigQuery, see Predefined roles and permissions.
Best practices
To get the best performance from queries against clustered tables, use the following best practices.
For context, the sample table used in the best practice examples is a
clustered table that is created by using a DDL statement. The DDL statement
creates a table named ClusteredSalesData
. The table is clustered by the
following columns: customer_id
, product_id
, order_id
, in that sort order.
CREATE TABLE `mydataset.ClusteredSalesData` PARTITION BY DATE(timestamp) CLUSTER BY customer_id, product_id, order_id AS SELECT * FROM `mydataset.SalesData`
Filter clustered columns by sort order
When you specify a filter, use expressions that filter on the clustered columns
in sort order. Sort order is the column order given in the CLUSTER BY
clause.
To get the benefits of clustering, include all of the clustered
columns or a subset of the columns in left-to-right sort order, starting with
the first column. For example, if the column sort order is A
, B
, C
, a
query that filters on A
and B
might benefit from clustering, but a query
that filters on B
and C
does not. The ordering of the column names inside
the filter expression doesn't affect performance.
The following example queries the ClusteredSalesData
clustered table
that was created in the preceding example. The query includes a filter
expression that filters on customer_id
and then on product_id
. This query
optimizes performance by filtering the clustered columns in sort
order—the column order given in the CLUSTER BY
clause.
SELECT SUM(totalSale) FROM `mydataset.ClusteredSalesData` WHERE customer_id = 10000 AND product_id LIKE 'gcp_analytics%'
The following query does not filter the clustered columns in sort order. As a
result, the performance of the query is not optimal. This query filters on
product_id
then on order_id
(skipping customer_id
).
SELECT SUM(totalSale) FROM `mydataset.ClusteredSalesData` WHERE product_id LIKE 'gcp_analytics%' AND order_id = 20000
Do not use clustered columns in complex filter expressions
If you use a clustered column in a complex filter expression, the performance of the query is not optimized because block pruning cannot be applied.
For example, the following query will not prune blocks because a clustered
column—customer_id
—is used in a function in the filter
expression.
SELECT SUM(totalSale) FROM `mydataset.ClusteredSalesData` WHERE CAST(customer_id AS STRING) = "10000"
To optimize query performance by pruning blocks, use simple filter expressions
like the following. In this example, a simple filter is applied to the
clustered column—customer_id
.
SELECT SUM(totalSale) FROM `mydataset.ClusteredSalesData` WHERE customer_id = 10000
Do not compare clustered columns to other columns
If a filter expression compares a clustered column to another column (either a clustered column or a non-clustered column), the performance of the query is not optimized because block pruning cannot be applied.
The following query does not prune blocks because the filter expression compares
a clustered column—customer_id
to another column—order_id
.
SELECT SUM(totalSale) FROM `mydataset.ClusteredSalesData` WHERE customer_id = order_id
Table security
To control access to tables in BigQuery, see Introduction to table access controls.
What's next
- For more information on running queries, see Running interactive and batch queries.
- To learn how to create and use clustered tables, see Creating and using clustered tables.
- For an overview of partitioned table support in BigQuery, see Introduction to partitioned tables.
- To learn how to create partitioned tables, see Creating partitioned tables.