Querying Clustered Tables

Querying clustered tables

When you create a clustered table in BigQuery, the table data is automatically organized based on the contents of one or more columns in the table’s schema. The columns you specify are used to colocate related data. When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.

To optimize performance when you run queries against clustered tables, use an expression that filters on a clustered column or on multiple clustered columns in the order the clustered columns are specified. Queries that filter on clustered columns generally perform better than queries that filter only on non-clustered columns.

BigQuery sorts the data in a clustered table based on the values in the clustering columns and organizes them into blocks.

When you submit a query that contains a filter on a clustered column, BigQuery uses the clustering information to efficiently determine whether a block contains any data relevant to the query. This allows BigQuery to only scan the relevant blocks — a process referred to as block pruning.

You can query clustered tables by using the BigQuery web UI, the command-line tool's bq query command, or by calling the jobs.insert API method and configuring a query job.

Currently, you can only use standard SQL with clustered tables.

Required permissions

At the dataset level, querying a clustered table requires READER access to the dataset that contains the table.

Instead of using dataset-level permissions, you can leverage a project-level IAM role that includes bigquery.tables.getData permissions. bigquery.tables.getData permissions are required to read the data in the table being queried.

The following predefined, project-level IAM roles include bigquery.tables.getData permissions for every dataset in the project:

You must also be granted bigquery.jobs.create permissions to run query jobs. The following predefined, project-level IAM roles include bigquery.jobs.create permissions:

For more information on IAM roles and permissions in BigQuery, see access control. For more information on dataset-level roles, see Primitive roles for datasets.

Best practices

To get the best performance from queries against clustered tables, use the following best practices.

Sample table used in the examples

The sample table used in the examples on this page is a clustered table that is created by using a DDL statement. The DDL statement creates a table named ClusteredSalesData. The table is clustered by the following columns: first by customer_id then by product_id then by order_id.

CREATE TABLE
  `mydataset.ClusteredSalesData`
PARTITION BY
  DATE(timestamp)
CLUSTER BY
  customer_id,
  product_id,
  order_id AS
SELECT
  *
FROM
  `mydataset.SalesData`

Filter clustered columns in the order they're specified

When you specify a filter, use expressions that filter on the clustered columns in sort order.

The following query includes a filter expression that filters on customer_id and then on product_id. This query optimizes performance by filtering the clustered columns in sort order.

SELECT
  SUM(totalSale)
FROM
  `mydataset.ClusteredSalesData`
WHERE
  customer_id = 10000
  AND product_id LIKE 'gcp_analytics%'

The following query does not filter the clustered columns in sort order. As a result, the performance of the query is not optimal. This query filters on product_id then on order_id (skipping customer_id).

SELECT
  SUM(totalSale)
FROM
  `mydataset.ClusteredSalesData`
WHERE
  product_id LIKE 'gcp_analytics%'
  AND order_id = 20000

Do not use clustered columns in complex filter expressions

If you use a clustered column in a complex filter expression, the performance of the query is not optimized because block pruning cannot be applied.

For example, the following query will not prune blocks because a clustered column — customer_id — is used in a function in the filter expression.

SELECT
  SUM(totalSale)
FROM
  `mydataset.ClusteredSalesData`
WHERE
  CAST(customer_id AS STRING) = "10000"

To optimize query performance by pruning blocks, use simple filter expressions like the following. In this example, a simple filter is applied to the clustered column — customer_id.

SELECT
  SUM(totalSale)
FROM
  `mydataset.ClusteredSalesData`
WHERE
  customer_id = 10000

Do not compare clustered columns to other columns

If a filter expression compares a clustered column to another column (either a clustered column or a non-clustered column), the performance of the query is not optimized because block pruning cannot be applied.

The following query does not prune blocks because the filter expression compares a clustered column — customer_id to another column — order_id.

SELECT
  SUM(totalSale)
FROM
  `mydataset.ClusteredSalesData`
WHERE
  customer_id = order_id

Next steps

Was this page helpful? Let us know how we did:

Send feedback about...