This document describes how to create and use clustered tables in BigQuery. For an overview of clustered table support in BigQuery, see Introduction to clustered tables.
Limitations
Clustered tables in BigQuery are subject to the following limitations:
- Clustering is supported only for partitioned tables.
- Only standard SQL is supported for querying clustered tables and for writing query results to clustered tables.
- You can only specify clustering columns when a table is created.
- After a clustered table is created, you cannot modify the clustering columns.
- Clustering columns must be top-level, non-repeated columns of one of the
following types:
DATE
BOOL
GEOGRAPHY
INT64
NUMERIC
STRING
TIMESTAMP
For more information about data types, see Standard SQL data types.
- You can specify up to four clustering columns.
Creating clustered tables
Currently, you can only cluster a
partitioned table. This includes both
ingestion-time partitioned tables
and tables
partitioned by a TIMESTAMP
or DATE
column.
You can create a clustered table in BigQuery:
- From query results:
- By using a DDL
CREATE TABLE AS SELECT
statement - By running a query that creates a clustered destination table
- By using a DDL
- By using a DDL
CREATE TABLE
statement with aCLUSTER BY
clause containing aclustering_column_list
- Manually using the
gcloud
command-line toolbq mk
command - Programmatically by calling the
tables.insert
API method - When you load data
- By using the client libraries
Table naming
When you create a table in BigQuery, the table name must be unique per dataset. The table name can:
- Contain up to 1,024 characters
- Contain letters (upper or lower case), numbers, and underscores
Required permissions
At a minimum, to create a table, you must be granted the following permissions:
bigquery.tables.create
permissions to create the tablebigquery.tables.updateData
to write data to the table by using a load job, a query job, or a copy jobbigquery.jobs.create
to run a query job, load job, or copy job that writes data to the table
Additional permissions such as bigquery.tables.getData
may be required to
access the data you're writing to the table.
The following predefined Cloud IAM roles include both
bigquery.tables.create
and bigquery.tables.updateData
permissions:
bigquery.dataEditor
bigquery.dataOwner
bigquery.admin
The following predefined Cloud IAM roles include bigquery.jobs.create
permissions:
bigquery.user
bigquery.jobUser
bigquery.admin
In addition, if a user has bigquery.datasets.create
permissions, when that
user creates a dataset, they are granted bigquery.dataOwner
access to it.
bigquery.dataOwner
access gives the user the ability to create and
update tables in the dataset.
For more information on Cloud IAM roles and permissions in BigQuery, see Predefined roles and permissions.
Creating an empty clustered table with a schema definition
You specify clustering columns when you create a table in BigQuery. After the table is created, you cannot modify the clustering columns. You can only specify clustering columns for a partitioned table.
Clustering columns must be top-level, non-repeated columns, and they must be one
of the following simple data types:
+ DATE
+ BOOLEAN
+ GEOGRAPHY
+ INTEGER
+ NUMERIC
+ STRING
+ TIMESTAMP
You can specify up to four clustering columns. When you specify multiple columns, the order of the columns determines how the data is sorted. For example, if the table is clustered by columns a, b and c, the data is sorted in the same order—first by column a, then by column b, and then by column c. As a best practice, place the most frequently filtered or aggregated column first.
The order of your clustering columns also affects query performance and pricing. For more information about query best practices for clustered tables, see Querying clustered tables.
To create an empty clustered table with a schema definition:
Console
In the Google Cloud Console, go to the BigQuery web UI.
In the navigation panel, in the Resources section, expand your project and select a dataset.
On the right side of the window, in the details panel, click Create table.
On the Create table page, under Source, for Create table from, select Empty table.
Under Destination:
- For Dataset name, choose the appropriate dataset, and in the Table name field, enter the name of the table you're creating.
- Verify that Table type is set to Native table.
Under Schema, enter the schema definition.
Enter schema information manually by:
Enabling Edit as text and entering the table schema as a JSON array.
Using Add field to manually input the schema.
Under Partition and cluster settings, select Partition by field and choose the
DATE
orTIMESTAMP
column. This option is not available if the schema does not contain aDATE
orTIMESTAMP
column.To create an ingestion-time partitioned table, select Partition by ingestion time.
(Optional) For Partitioning filter, click the Require partition filter check box to require users to include a
WHERE
clause that specifies the partitions to query. Requiring a partition filter can reduce cost and improve performance. For more information, see Querying partitioned tables.(Optional) For Clustering order, enter between one and four comma-separated column names.
(Optional) Click Advanced options and for Encryption, click Customer-managed key to use a Key Management Service key. If you leave the Google-managed key setting, BigQuery encrypts the data at rest.
Click Create table.
Classic UI
Go to the classic BigQuery web UI.
In the navigation pane, click the down arrow icon
next to your dataset name and click Create new table.
On the Create Table page, in the Source Data section, click Create empty table.
On the Create Table page, in the Destination Table section:
- For Table name, choose the appropriate dataset, and in the table name field, enter the name of the table you're creating.
- Verify that Table type is set to Native table.
In the Schema section, manually enter the schema definition.
You can enter schema information manually by:
Clicking Edit as text and entering the table schema as a JSON array.
Using Add Field to input the schema.
Under Options:
- For Partitioning Type, click None and choose Day.
- For Partitioning Field, choose one of the following:
- Choose
timestamp
to create a table partitioned by a DATE or TIMESTAMP column. - Choose
_PARTITIONTIME
to create an ingestion-time partitioned table.
- Choose
- For Clustering Fields, enter between one and four field names.
For Destination Encryption, leave the
Default
option. This property is for customer-managed encryption keys. By default, BigQuery encrypts customer content stored at rest.
Click Create Table.
After the table is created, you can update the clustered table's table expiration, description, and labels. You cannot add a partition expiration after a table is created using the BigQuery web UI.
CLI
Use the mk
command with
the following flags:
--table
(or the-t
shortcut).--schema
. You can supply the table's schema definition inline or use a JSON schema file.- Either
--time_partitioning_type
(for ingestion-time partitioned tables) or--time_partitioning_field
(for partitioned tables). Currently,DAY
is the only supported value for--time_partitioning_type
. --clustering_fields
. You can specify up to four clustering columns.
Optional parameters include --expiration
, --description
,
--time_partitioning_expiration
, --destination_kms_key
, and --label
.
If you are creating a table in a project other than your default project,
add the project ID to the dataset in the following format:
project_id:dataset
.
--destination_kms_key
is not demonstrated here. For information about
using --destination_kms_key
, see
customer-managed encryption keys.
Enter the following command to create an empty clustered table with a schema definition:
bq mk \ --table \ --expiration integer1 \ --schema schema \ --time_partitioning_type=DAY \ --time_partitioning_field partition_column \ --clustering_fields cluster_columns \ --time_partitioning_expiration integer2 \ --description "description" \ --label key:value,key:value \ project_id:dataset.table
Replace the following:
- integer1: The default lifetime (in seconds) for the table. The minimum value is 3,600 seconds (one hour). The expiration time evaluates to the current UTC time plus the integer value. If you set the table's expiration time when you create a time-partitioned table, the dataset's default table expiration setting is ignored. Setting this value deletes the table and all partitions after the specified time.
- schema: An inline schema definition in the format
column:data_type,column:data_type
or the path to the JSON schema file on your local machine. - partition_column: The name of the
TIMESTAMP
orDATE
column used to create a partitioned table. If you create a partitioned table, you do not need to specify the--time_partitioning_type=DAY
flag. - cluster_columns: A comma-separated list of up to four clustering columns.
- integer2: The default lifetime (in seconds) for the table's partitions. There is no minimum value. The expiration time evaluates to the partition's date plus the integer value. The partition expiration is independent of the table's expiration but does not override it. If you set a partition expiration that is longer than the table's expiration, the table expiration takes precedence.
- description: A description of the table, in quotes.
- key:value: The key-value pair that represents a label. You can enter multiple labels using a comma-separated list.
- project_id: Your project ID.
- dataset: A dataset in your project.
- table: The name of the partitioned table you're creating.
When you specify the schema on the command line, you cannot include a
RECORD
(STRUCT
)
type, you cannot include a column description, and you
cannot specify the column's mode. All modes default to NULLABLE
. To
include descriptions, modes, and RECORD
types, supply a JSON schema
file instead.
Examples:
Enter the following command to create a clustered table
named myclusteredtable
in mydataset
in your default project. The table
is a partitioned table (partitioned by a TIMESTAMP
column). The
partitioning expiration is set to 86,400 seconds (1 day), the table's
expiration is set to 2,592,000 (1 30-day month), the description is set to
This is my clustered table
, and the label is set to
organization:development
. The command uses the -t
shortcut instead of
--table
.
The schema is specified inline as:
timestamp:timestamp,customer_id:string,transaction_amount:float
. The
specified clustering field customer_id
is used to cluster the partitions.
bq mk -t \
--expiration 2592000 \
--schema 'timestamp:timestamp,customer_id:string,transaction_amount:float' \
--time_partitioning_field timestamp \
--clustering_fields customer_id \
--time_partitioning_expiration 86400 \
--description "This is my clustered table" \
--label org:dev \
mydataset.myclusteredtable
Enter the following command to create a clustered table named
myclusteredtable
in myotherproject
, not your default project. The table
is an ingestion-time partitioned table. The partitioning expiration is set
to 259,200 seconds (3 days), the description is set to This is my
partitioned table
, and the label is set to organization:development
. The
command uses the -t
shortcut instead of --table
. This command does not
specify a table expiration. If the dataset has a default table expiration,
it is applied. If the dataset has no default table expiration, the table
never expires, but the partitions expire in 3 days.
The schema is specified in a local JSON file: /tmp/myschema.json
. The
customer_id
field is used to cluster the partitions.
bq mk -t \
--expiration 2592000 \
--schema /tmp/myschema.json \
--time_partitioning_type=DAY \
--clustering_fields=customer_id \
--time_partitioning_expiration 86400 \
--description "This is my partitioned table" \
--label org:dev \
myotherproject:mydataset.myclusteredtable
After the table is created, you can update the partitioned table's table expiration, partition expiration, description, and labels.
API
Call the tables.insert
method with a defined table resource
that specifies the timePartitioning
property, the clustering.fields
property, and the schema
property.
Go
Before trying this sample, follow the Go setup instructions in the
BigQuery Quickstart Using Client Libraries
.
For more information, see the
BigQuery Go API reference documentation
.
Creating a clustered table from a query result
There are two ways to create a clustered table from a query result:
- Write the results to a new destination table and specify the clustering columns. This method is discussed below.
- By using a DDL
CREATE TABLE AS SELECT
statement. For more information about this method, see Creating a clustered table from the result of a query on the Using Data Definition Language statements page.
You can create a clustered table by querying either a partitioned table or a non-partitioned table. You cannot change an existing table to a clustered table by using query results.
When you create a clustered table from a query result, you must use standard SQL. Currently, legacy SQL is not supported for querying clustered tables or for writing query results to clustered tables.
Console
You cannot specify clustering options for a destination table when you query data using the console BigQuery web UI unless you use a DDL statement. For more information, see Using Data Definition Language statements.
Classic UI
You cannot specify clustering options for a destination table when you query data using the classic BigQuery web UI unless you use a DDL statement. For more information, see Using Data Definition Language statements.
CLI
Enter the following command to create a new, clustered destination table from a query result:
bq --location=location query \ --use_legacy_sql=false 'query'
Replace the following:
- location: The name of your location. The
--location
flag is optional. For example, if you are using BigQuery in the Tokyo region, you can set the flag's value toasia-northeast1
. You can set a default value for the location using the .bigqueryrc file. - query: A query in standard SQL syntax. Currently, you cannot
use legacy SQL to query clustered tables or to write query results to
clustered tables. The query can contain a
CREATE TABLE
DDL statement that specifies the options for creating your clustered table. You can use DDL rather than specifying the individual command-line flags.
Examples:
Enter the following command to write query results to a clustered
destination table named myclusteredtable
in mydataset
. mydataset
is in
your default project. The query retrieves data from a non-partitioned table:
mytable. The table's customer_id
column is used to cluster the
table. The table's timestamp
column is used to create a partitioned table.
bq query --use_legacy_sql=false \
'CREATE TABLE
mydataset.myclusteredtable
PARTITION BY
DATE(timestamp)
CLUSTER BY
customer_id AS
SELECT
*
FROM
`mydataset.mytable`'
API
To save query results to a clustered table,
call the jobs.insert
method, configure a
query
job,
and include a CREATE TABLE
DDL
statement that creates your clustered table.
Specify your location in the location
property in the
jobReference
section of the
job resource.
Creating a clustered table when you load data
You can create a clustered table by specifying clustering columns when you load data into a new table. You do not need to create an empty table before loading data into it. You can create the clustered table and load your data at the same time.
For more information about loading data, see Introduction to loading data into BigQuery.
To define clustering when defining a load job:
API
To define a clustering configuration when creating a table through a
load job, you can populate the
Clustering
properties for the table.
Go
Before trying this sample, follow the Go setup instructions in the
BigQuery Quickstart Using Client Libraries
.
For more information, see the
BigQuery Go API reference documentation
.
Controlling access to clustered tables
You cannot assign access controls directly to tables or views. The lowest level of BigQuery resources to which you are able to grant access is the dataset level. To configure access to tables and views, you grant a Cloud IAM role to an entity at the dataset level or higher.
Granting a role at the dataset-level specifies the operations an entity is allowed to perform on tables and views in that specific dataset. For information on configuring dataset-level access controls, see Controlling access to datasets.
You can also grant Cloud IAM roles at a higher level in the Google Cloud resource hierarchy such as the project, folder, or organization level. Granting roles at a higher level gives the entity access to a broader set of resources. For example, granting a role to an entity at the project level gives that entity permissions that apply to all datasets throughout the project. For more information on granting access to resources, see Granting, changing, and revoking access to resources in the Cloud IAM documentation.
You can also create Cloud IAM custom roles. If you create a custom role, the permissions you grant depend on the specific operations you want the entity to be able to perform.
For more information on roles and permissions, see:
- Understanding roles in the Cloud IAM documentation
- BigQuery Predefined roles and permissions
- Controlling access to datasets
Using clustered tables
Getting information about clustered tables
You can get information about tables by:
- Using the Cloud Console or the classic BigQuery web UI
- Using the
bq show
CLI command - Calling the
tables.get
API method - Querying
INFORMATION_SCHEMA
views
Required permissions
At a minimum, to get information about tables, you must be granted
bigquery.tables.get
permissions. The following predefined Cloud IAM
roles include bigquery.tables.get
permissions:
bigquery.metadataViewer
bigquery.dataViewer
bigquery.dataOwner
bigquery.dataEditor
bigquery.admin
In addition, if a user has bigquery.datasets.create
permissions, when that
user creates a dataset, they are granted bigquery.dataOwner
access to it.
bigquery.dataOwner
access gives the user the ability to get information about
tables in a dataset.
For more information about Cloud IAM roles and permissions in BigQuery, see Predefined roles and permissions.
Getting clustered table information
To view information about a clustered table:
Console
In the Google Cloud Console, go to the Resources pane. Click your dataset name to expand it, and then click the table name you want to view.
Click Details. This page displays the table's details including the clustering columns.
Classic UI
In the navigation pane, click the down arrow icon
to the left of your dataset to expand it, or double-click the dataset name. This displays the tables and views in the dataset.
Click the table name.
Click Details. The Table Details page displays the table's details including the clustering columns.
CLI
Issue the bq show
command to display all table information. Use the
--schema
flag to display only table schema information. The --format
flag can be used to control the output.
If you are getting information about a table in a project other than
your default project, add the project ID to the dataset in the following
format: project_id:dataset
.
bq show \ --schema \ --format=prettyjson \ project_id:dataset.table
Replace the following:
- project_id: Your project ID.
- dataset: The name of the dataset.
- table: The name of the table.
Examples:
Enter the following command to display all information about
myclusteredtable
in mydataset
. mydataset
in your default project.
bq show --format=prettyjson mydataset.myclusteredtable
The output should look like the following:
{ "clustering": { "fields": [ "customer_id" ] }, ... }
API
Call the bigquery.tables.get
method and provide any relevant parameters.
SQL
For clustered tables, you can query the CLUSTERING_ORDINAL_POSITION
column
in the INFORMATION_SCHEMA.COLUMNS
view to retrieve information about a
table's clustering columns.
-- Set up a table with clustering. CREATE TABLE myDataset.data (column1 INT64, column2 INT64) PARTITION BY _PARTITIONDATE CLUSTER BY column1, column2; -- This query returns 1 for column1 and 2 for column2. SELECT column_name, clustering_ordinal_position FROM myDataset.INFORMATION_SCHEMA.COLUMNS;
More table metadata is available through the TABLES
, TABLE_OPTIONS
,
COLUMNS
, and COLUMN_FIELD_PATH
views in INFORMATION_SCHEMA
.
Listing clustered tables in a dataset
You can list clustered tables in datasets by using the following approaches:
- Using the Cloud Console or the classic BigQuery web UI
- Using the
bq ls
CLI command - Calling the
tables.list
API method - Using the client libraries
- Querying the
CLUSTERING_ORDINAL_POSITION
column in theINFORMATION_SCHEMA.COLUMNS
view
The permissions required to list clustered tables, and the steps to list them are the same as for partitioned tables. For more information about listing tables, see Listing partitioned tables in a dataset.
Features under development
The following features are being developed but are not currently available in the alpha release:
- Support for clustering native (non-partitioned) tables.
- Reducing the costs for particular types of queries that use filters on clustering columns.
Next steps
- For information about querying clustered tables, see Querying clustered tables.
- For an overview of partitioned table support in BigQuery, see Introduction to partitioned tables.
- To learn how to create and use ingestion-time partitioned tables, see Creating and using ingestion-time partitioned tables.
- To learn how to create and use partitioned tables, see Creating and using partitioned tables.
- To see an overview of
INFORMATION_SCHEMA
, go to Introduction to BigQueryINFORMATION_SCHEMA
.