Creating and Using Clustered Tables

This document describes how to create and use clustered tables in BigQuery.

Limitations

Clustered tables in BigQuery are subject to the following limitations:

  • Currently, clustering is supported only for partitioned tables.
  • Only standard SQL is supported for querying clustered tables and for writing query results to clustered tables.
  • You can only specify clustering columns when a table is created.
  • After a clustered table is created, clustering columns cannot be modified.
  • Clustering columns must be top-level, non-repeated columns of one of the following types: INT64, STRING, DATE, TIMESTAMP, BOOL or NUMERIC. For more information on data types, see Standard SQL Data Types.
  • You can specify up to four clustering columns.

Creating clustered tables

Currently, you can only cluster a partitioned table. This includes both ingestion-time partitioned tables and partitioned tables (tables partitioned by a TIMESTAMP or DATE column).

You can create a clustered table in BigQuery:

When you create a clustered table in BigQuery, the table name must be unique per dataset. The table name can:

  • Contain up to 1,024 characters
  • Contain letters (upper or lower case), numbers, and underscores

Required permissions

To create a clustered table, you must have WRITER access at the dataset level, or you must be assigned a project-level IAM role that includes bigquery.tables.create permissions. The following predefined, project-level IAM roles include bigquery.tables.create permissions:

In addition, because the bigquery.user role has bigquery.datasets.create permissions, a user assigned to the bigquery.user role can create clustered tables in any dataset that user creates. By default, when a user assigned to the bigquery.user role creates a dataset, that user is given OWNER access to the dataset. OWNER access to a dataset gives the user full control over it and all the tables in it.

For more information on IAM roles and permissions in BigQuery, see Access Control. For more information on dataset-level roles, see Primitive roles for datasets.

Creating an empty clustered table with a schema definition

You specify clustering columns when you create a table in BigQuery. After the table is created, you cannot modify the clustering columns. Currently, you can only specify clustering columns for a partitioned table.

Clustering columns must be top-level, non-repeated columns, and they must be one of the following simple data types: INTEGER, STRING, DATE, TIMESTAMP, BOOLEAN or NUMERIC.

You can specify up to four clustering columns. When you specify multiple columns, the order of the columns determines how the data is sorted. For example, if the table is clustered by columns a, b and c, the data is sorted in the same order — first by column a, then by column b, and then by column c. As a best practice, the most frequently filtered or aggregated column should appear first.

The order of your clustering columns also affects query performance and pricing. For more information on query best practices for clustered tables, see Querying Clustered Tables.

To create an empty clustered table with a schema definition:

Web UI

  1. Go to the BigQuery web UI.

    Go to the BigQuery web UI

  2. Click the down arrow icon down arrow icon next to your dataset name in the navigation and click Create new table.

  3. On the Create Table page, in the Source Data section, click Create empty table.

  4. On the Create Table page, in the Destination Table section:

    • For Table name, choose the appropriate dataset, and in the table name field, enter the name of the table you're creating.
    • Verify that Table type is set to Native table.
  5. In the Schema section, manually enter the schema definition.

    • You can enter schema information manually by:

      • Clicking Edit as text and entering the table schema as a JSON array.

      • Using Add Field to input the schema.

  6. In the Options section:

    • For Partitioning Type, click None and choose Day.
    • For Partitioning Field, choose one of the following:
      • Choose timestamp to create a table partitioned by a DATE or TIMESTAMP column.
      • Choose _PARTITIONTIME to create an ingestion-time partitioned table.
    • For Clustering columns, enter between one and four field names.
    • For Encryption Type, leave the Default option. This property is for customer-managed encryption keys. By default, BigQuery encrypts customer content stored at rest.

      Partitioned table details

  7. Click Create Table.

After the table is created, you can update the clustered table's table expiration, description, and labels. You cannot add a partition expiration after a table is created using the BigQuery web UI.

Command-line

Use the mk command with the following flags:

  • --table (or the -t shortcut).
  • --schema &mdash: You can supply the table's schema definition inline or via a JSON schema file.
  • Either --time_partitioning_type (for ingestion-time partitioned tables) or --time_partitioning_field (for partitioned tables). Currently, DAY is the only supported value for --time_partitioning_type.
  • --clustering_fields to specify up to four clustering columns.

Optional parameters include --expiration, --description, --time_partitioning_expiration, --destination_kms_key, and --label.

If you are creating a table in a project other than your default project, add the project ID to the dataset in the following format: [PROJECT_ID]:[DATASET].

--destination_kms_key is not demonstrated here. For more information on using this flag, see customer-managed encryption keys.

Enter the following command to create an empty clustered table with a schema definition:

bq mk --table --expiration [INTEGER1] --schema [SCHEMA] --time_partitioning_type=DAY --time_partitioning_field [COLUMN] --clustering_fields [COLUMNS] --time_partitioning_expiration [INTEGER2] --description "[DESCRIPTION]" --label [KEY:VALUE, KEY:VALUE] [PROJECT_ID]:[DATASET].[TABLE]

Where:

  • [INTEGER1] is the default lifetime (in seconds) for the table. The minimum value is 3600 seconds (one hour). The expiration time evaluates to the current time plus the integer value. If you set the table's expiration time when you create a time-partitioned table, the dataset's default table expiration setting is ignored. Setting this value deletes the table and all partitions after the specified time.
  • [SCHEMA] is an inline schema definition in the format [FIELD]:[DATA_TYPE],[FIELD]:[DATA_TYPE] or the path to the JSON schema file on your local machine.
  • [COLUMN] is the name of the TIMESTAMP or DATE column used to create a partitioned table. If you create a partitioned table, you do not need to specify the --time_partitioning_type=DAY flag.
  • [COLUMNS] is a comma-separated list of up to four clustering columns.
  • [INTEGER2] is the default lifetime (in seconds) for the table's partitions. There is no minimum value. The expiration time evaluates to the partition's date plus the integer value. The partition expiration is independent of the table's expiration but does not override it. If you set a partition expiration that is longer than the table's expiration, the table expiration takes precedence.
  • [DESCRIPTION] is a description of the table in quotes.
  • [KEY:VALUE] is the key:value pair that represents a label. You can enter multiple labels using a comma-separated list.
  • [PROJECT_ID] is your project ID.
  • [DATASET] is a dataset in your project.
  • [TABLE] is the name of the partitioned table you're creating.

When you specify the schema on the command line, you cannot include a RECORD (STRUCT) type, you cannot include a column description, and you cannot specify the column's mode. All modes default to NULLABLE. To include descriptions, modes, and RECORD types, supply a JSON schema file instead.

Examples:

Enter the following command to create a clustered table named myclusteredtable in mydataset in your default project. The table is a partitioned table (partitioned by a TIMESTAMP column). The partitioning expiration is set to 86,400 seconds (1 day), the table's expiration is set to 2,592,000 (1 30-day month), the description is set to This is my clustered table, and the label is set to organization:development. The command uses the -t shortcut instead of --table.

The schema is specified inline as: timestamp:timestamp,customer_id:string,transaction_amount:float. The specified clustering field customer_id is used to cluster the partitions.

bq mk -t --expiration 2592000 --schema 'timestamp:timestamp,customer_id:string,transaction_amount:float' --time_partitioning_field timestamp --clustering_fields customer_id --time_partitioning_expiration 86400  --description "This is my clustered table" --label org:dev mydataset.myclusteredtable

Enter the following command to create a clustered table named myclusteredtable in myotherproject, not your default project. The table is an ingestion-time partitioned table. The partitioning expiration is set to 259,200 seconds (3 days), the description is set to This is my partitioned table, and the label is set to organization:development. The command uses the -t shortcut instead of --table. This command does not specify a table expiration. If the dataset has a default table expiration, it is applied. If the dataset has no default table expiration, the table will never expire, but the partitions will expire in 3 days.

The schema is specified in a local JSON file: /tmp/myschema.json. The customer_id field is used to cluster the partitions.

bq mk -t --expiration 2592000 --schema /tmp/myschema.json --time_partitioning_type=DAY --clustering_fields=customer_id --time_partitioning_expiration 86400  --description "This is my partitioned table" --label org:dev myotherproject:mydataset.myclusteredtable

After the table is created, you can update the partitioned table's table expiration, partition expiration, description, and labels.

API

Call the tables.insert method with a defined table resource that specifies the timePartitioning property, the clustering.fields property, and the schema property.

Go

Before trying this sample, follow the Go setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Go API reference documentation .

// To run this sample, you will need to create (or reuse) a context and
// an instance of the bigquery client.  For example:
// import "cloud.google.com/go/bigquery"
// ctx := context.Background()
// client, err := bigquery.NewClient(ctx, "your-project-id")
sampleSchema := bigquery.Schema{
	{Name: "timestamp", Type: bigquery.TimestampFieldType},
	{Name: "origin", Type: bigquery.StringFieldType},
	{Name: "destination", Type: bigquery.StringFieldType},
	{Name: "amount", Type: bigquery.NumericFieldType},
}
metaData := &bigquery.TableMetadata{
	Schema: sampleSchema,
	TimePartitioning: &bigquery.TimePartitioning{
		Field:      "timestamp",
		Expiration: 90 * 24 * time.Hour,
	},
	Clustering: &bigquery.Clustering{
		Fields: []string{"origin", "destination"},
	},
}
tableRef := client.Dataset(datasetID).Table(tableID)
if err := tableRef.Create(ctx, metaData); err != nil {
	return err
}

Creating a clustered table from a query result

There are two ways to create a clustered table from a query result:

  • Write the results to a new destination table and specify the clustering columns. This method is discussed below.
  • By using a DDL CREATE TABLE AS SELECT statement. For more information on this method, see Creating a clustered table from the result of a query on the "Using Data Definition Language Statements" page.

You can create a clustered table by querying either a partitioned table or a non-partitioned table. You cannot change an existing table to a clustered table using query results.

When you create a clustered table from a query result, you must use standard SQL. Currently, legacy SQL is not supported for querying clustered tables or for writing query results to clustered tables.

Web UI

You cannot specify clustering options for a destination table when you query data using the BigQuery web UI unless you use a DDL statement. For more information, see Using Data Definition Language Statements.

CLI

Enter the bq query command, and specify the following flags:

  • Specify the use_legacy_sql=false flag to use standard SQL syntax.
  • Supply the --location flag and set the value to your location.

Enter the following command to create a new, clustered destination table from a query result:

    bq --location=[LOCATION] query --use_legacy_sql=false '[QUERY]'

Where:

  • [LOCATION] is the name of your location. The --location flag is optional. For example, if you are using BigQuery in the Tokyo region, you can set the flag's value to asia-northeast1. You can set a default value for the location using the .bigqueryrc file.
  • [QUERY] is a query in standard SQL syntax. Currently, you cannot use legacy SQL to query clustered tables or to write query results to clustered tables. The query can contain a CREATE TABLE DDL statement that specifies the options for creating your clustered table. You can use DDL rather than specifying the individual command-line flags.

Examples:

Enter the following command to write query results to a clustered destination table named myclusteredtable in mydataset. mydataset is in your default project. The query retrieves data from a non-partitioned table — mytable. The table's customer_id column is used to cluster the table. The table's timestamp column is used to create a partitioned table.

bq --location=US query --use_legacy_sql=false 'CREATE TABLE mydataset.myclusteredtable PARTITION BY DATE(timestamp) CLUSTER BY customer_id AS SELECT * FROM mydataset.mytable'

API

To save query results to a clustered table, call the jobs.insert method, configure a query job, and include a CREATE TABLE DDL statement that creates your clustered table.

Specify your location in the location property in the jobReference section of the job resource.

Creating a clustered table when you load data

You can create a clustered table by specifying clustering columns when you load data into a new table. You do not need to create an empty table before loading data into it. You can create the clustered table and load your data at the same time.

For more information on loading data, see Introduction to Loading Data into BigQuery.

To define clustering when defining a load job:

API

To define clustering configuration when creating a table via a load job, you can populate the configuration.load.clustering message, populating the configuration.load.clustering.Fields property with up to four clustering columns in prioritized order.

Go

Before trying this sample, follow the Go setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Go API reference documentation .

// To run this sample, you will need to create (or reuse) a context and
// an instance of the bigquery client.  For example:
// import "cloud.google.com/go/bigquery"
// ctx := context.Background()
// client, err := bigquery.NewClient(ctx, "your-project-id")
gcsRef := bigquery.NewGCSReference("gs://cloud-samples-data/bigquery/sample-transactions/transactions.csv")
gcsRef.SkipLeadingRows = 1
gcsRef.Schema = bigquery.Schema{
	{Name: "timestamp", Type: bigquery.TimestampFieldType},
	{Name: "origin", Type: bigquery.StringFieldType},
	{Name: "destination", Type: bigquery.StringFieldType},
	{Name: "amount", Type: bigquery.NumericFieldType},
}
loader := client.Dataset(destDatasetID).Table(destTableID).LoaderFrom(gcsRef)
loader.TimePartitioning = &bigquery.TimePartitioning{
	Field: "timestamp",
}
loader.Clustering = &bigquery.Clustering{
	Fields: []string{"origin", "destination"},
}
loader.WriteDisposition = bigquery.WriteEmpty

job, err := loader.Run(ctx)
if err != nil {
	return err
}
status, err := job.Wait(ctx)
if err != nil {
	return err
}

if status.Err() != nil {
	return fmt.Errorf("Job completed with error: %v", status.Err())
}

Controlling access to clustered tables

You cannot assign access controls directly to clustered tables or to partitions. You can control table access by configuring access controls at the dataset level or at the project level.

Dataset-level access controls specify the operations users, groups, and service accounts are allowed to perform on tables in that specific dataset. If you assign only dataset-level permissions, you must also assign a primitive or predefined, project-level role that provides access to the project, for example, bigquery.user.

Instead of granting access to individual datasets, you can assign predefined, project-level IAM roles that grant permissions to all table data in all datasets in a project.

You can also create IAM custom roles. If you create a custom role, the permissions you grant depend on the table operations you want the user, group, or service account to be able to perform.

For more information on roles and permissions, see:

Using clustered tables

Getting information about clustered tables

You can get information about tables using the BigQuery web UI, using the bq show CLI command or by calling the tables.get API method.

Required permissions

To get information about tables, you must be assigned the READER role on the dataset, or you must be assigned a project-level IAM role that includes bigquery.tables.get permissions. If you are granted bigquery.tables.get permissions at the project level, you can get information about all tables in the project. All predefined, project-level IAM roles include bigquery.tables.get permissions except for bigquery.jobUser and bigquery.user.

In addition, a user assigned the bigquery.user role has bigquery.datasets.create permissions. This allows a user assigned to the bigquery.user role to get information about tables in any dataset that user creates. When a user assigned to the bigquery.user role creates a dataset, that user is given OWNER access to the dataset. OWNER access to a dataset gives the user full control over it and all the tables in it.

For more information on IAM roles and permissions in BigQuery, see Access Control. For more information on dataset-level roles, see Primitive roles for datasets.

Getting clustered table information

To view information about a clustered table:

Web UI

  1. In the navigation pane, down arrow icon down arrow icon to the left of your dataset to expand it, or double-click the dataset name. This displays the tables and views in the dataset.

  2. Click the table name.

  3. Click Details. The Table Details page displays the table's details including the clustering columns.

    Clustered table details

Command-line

Issue the bq show command to display all table information. Use the --schema flag to display only table schema information. The --format flag can be used to control the output.

If you are getting information about a table in a project other than your default project, add the project ID to the dataset in the following format: [PROJECT_ID]:[DATASET].

bq show --schema --format=prettyjson [PROJECT_ID]:[DATASET].[TABLE]

Where:

  • [PROJECT_ID] is your project ID.
  • [DATASET] is the name of the dataset.
  • [TABLE] is the name of the table.

Examples:

Enter the following command to display all information about myclusteredtable in mydataset. mydataset is in your default project.

bq show --format=prettyjson mydataset.myclusteredtable

The output should look like the following:

{
  "clustering": {
    "fields": [
      "customer_id"
    ]
  },
...
}
   

API

Call the bigquery.tables.get method and provide any relevant parameters.

Listing clustered tables in a dataset

You can list clustered tables in datasets by using the BigQuery web UI, the bq ls CLI command or by calling the tables.list API method.

The permissions required to list clustered tables, and the steps to list them are the same as for partitioned tables. For more information on listing tables, see Listing partitioned tables in a dataset.

Features under development

The following features are being developed but are not currently available in the alpha release:

  • Support for clustering native (non-partitioned) tables.
  • Reducing the costs for particular types of queries that use filters on clustering columns.

Next steps

Was this page helpful? Let us know how we did:

Send feedback about...

Need help? Visit our support page.