创建聚簇表

您可以在 BigQuery 中使用聚簇表来减少查询处理的数据量。

对于聚簇表，系统会根据指定列（也称为聚簇列）的值来整理表数据。BigQuery 会按聚簇列对数据进行排序，然后将具有相似值的行存储在相同或附近的物理块中。当查询按聚簇列进行过滤时，BigQuery 只会高效地扫描相关区块，并跳过与过滤条件不匹配的数据。

详情请参阅以下内容：

如需详细了解 BigQuery 中的聚簇表，请参阅聚簇表简介。
如需了解如何使用聚簇表以及如何控制对聚簇表的访问权限，请参阅管理聚簇表。

准备工作

所需的角色

如需获得创建表所需的权限，请让管理员向您授予以下 IAM 角色：

针对项目的 BigQuery Job User (roles/bigquery.jobUser) - 如果您要通过加载数据或将查询结果保存到表来创建表。
针对要在其中创建表的数据集的 BigQuery Data Editor (roles/bigquery.dataEditor)。

如需详细了解如何授予角色，请参阅管理对项目、文件夹和组织的访问权限。

这些预定义角色包含创建表所需的权限。如需查看所需的确切权限，请展开所需权限部分：

所需权限

如需创建表，需要以下权限：

bigquery.tables.create，针对您要在其中创建表的数据集。
bigquery.tables.getData，针对查询引用的所有表和视图（如果您要将查询结果保存为表）。
bigquery.jobs.create，针对项目（如果您要通过加载数据或将查询结果保存到表中来创建表）。
bigquery.tables.updateData，针对表（如果您要使用查询结果附加到表或覆盖表）。

您也可以使用自定义角色或其他预定义角色来获取这些权限。

表命名要求

在 BigQuery 中创建表时，每个数据集的表名称必须是唯一的。表名称可：

包含的字符总数不超过 1024 个 UTF-8 字节。
包含类别 L（字母）、M（符号）、N（数字）、Pc（连接符，包括下划线）、Pd（短划线）、Zs（空格）中的 Unicode 字符。如需了解详情，请参阅常规类别。

以下是有效表名称的所有示例：table 01、ग्राहक、00_お客様、étudiant-01。

注意事项：

默认情况下，表名称区分大小写。mytable 和 MyTable 可以位于同一数据集中，除非它们属于关闭区分大小写的数据集。
某些表名称和表名称前缀已被预留。如果您收到错误，表示您的表名称或前缀已被预留，请选择其他名称并重试。
如果您在序列中添加多个点运算符 (.)，则系统会隐式删除重复的运算符。

例如，project_name....dataset_name..table_name

会变为 project_name.dataset_name.table_name

聚簇列要求

在 BigQuery 中创建表时，您可以指定用于创建聚簇表的列。创建表后，您可以修改用于创建聚簇表的列。如需了解详情，请参阅修改聚簇规范。

聚簇列必须是以下任一数据类型的顶级非重复列：

BIGNUMERIC
BOOL
DATE
DATETIME
GEOGRAPHY
INT64
NUMERIC
RANGE
STRING
TIMESTAMP

您最多可以指定四个聚簇列。指定多个列时，对列进行指定的顺序决定了数据的排序方式。例如，如果表按 a、b 和 c 列进行聚簇，则数据按相同顺序排序：首先按 a 列排序，然后按 b 列排序，最后按 c 列排序。最佳做法是将最常过滤或聚合的列放在首位。

聚簇列的顺序也会影响查询性能和价格。如需详细了解查询聚簇表的最佳做法，请参阅查询聚簇表。

创建遵循架构定义的空聚簇表

要创建遵循某个架构定义的空聚簇表，请执行以下操作：

控制台

在 Google Cloud 控制台中，前往 BigQuery 页面。

转到 BigQuery
在浏览器窗格中，展开您的项目，然后选择数据集。
在数据集信息部分中，点击 创建表。
在创建表面板中，指定以下详细信息：

在来源部分，在基于以下数据源创建表列表中选择空表。
在目标部分，指定以下详细信息：
1. 在数据集部分，选择您要在其中创建表的数据集。
2. 在表字段中，输入您要创建的表的名称。
3. 确认表类型字段是否设置为原生表。
在架构部分，输入架构定义。您可以使用以下任一方法手动输入架构信息：
- 选项 1：点击以文本形式修改，并以 JSON 数组的形式粘贴架构。使用 JSON 数组时，您要使用与创建 JSON 架构文件相同的流程生成架构。您可以输入以下命令，以 JSON 格式查看现有表的架构：
```
    bq show --format=prettyjson dataset.table
    
```
- 选项 2：点击 添加字段，然后输入表架构。指定每个字段的名称、类型和模式。
对于聚簇顺序，请输入一至四个以英文逗号分隔的列名称。
可选：如果要使用客户管理的加密密钥，在高级选项部分，选择使用客户管理的加密密钥 (CMEK) 选项。默认情况下，BigQuery 会使用 Google-owned and Google-managed encryption key对以静态方式存储的客户内容进行加密。
点击创建表。

SQL

使用带有 CLUSTER BY 选项的 CREATE TABLE DDL 语句命令。以下示例在 mydataset 中创建名为 myclusteredtable 的聚簇表：

在 Google Cloud 控制台中，前往 BigQuery 页面。

转到 BigQuery

在查询编辑器中，输入以下语句：

CREATE TABLE mydataset.myclusteredtable
(
  customer_id STRING,
  transaction_amount NUMERIC
)
CLUSTER BY
  customer_id
  OPTIONS (
    description = 'a table clustered by customer_id');

点击运行。

如需详细了解如何运行查询，请参阅运行交互式查询。

bq

使用带有以下标志的 bq mk 命令：

--table（或 -t 快捷方式）。
--schema。您能够以内嵌方式提供表的架构定义或使用 JSON 架构文件。
--clustering_fields。您最多可以指定四个聚簇列。

可选参数包括 --expiration、--description、--time_partitioning_type、--time_partitioning_field、--time_partitioning_expiration、--destination_kms_key 和 --label。

如果您要在非默认项目中创建表，请按以下格式将相应项目 ID 添加到数据集中：project_id:dataset。

本文未演示 --destination_kms_key。要了解如何使用 --destination_kms_key，请参阅客户管理的加密密钥。

输入以下命令可创建包括架构定义的空聚簇表：

bq mk \
    --table \
    --expiration INTEGER1 \
    --schema SCHEMA \
    --clustering_fields CLUSTER_COLUMNS \
    --description "DESCRIPTION" \
    --label KEY:VALUE,KEY:VALUE \
    PROJECT_ID:DATASET.TABLE

替换以下内容：

INTEGER1：表的默认生命周期（以秒为单位）。最小值为 3600 秒（一小时）。到期时间以当前世界协调时间 (UTC) 加上这个整数值为准。如果您在创建表时设置了该表的到期时间，则系统会忽略数据集的默认表到期时间设置。如果设置了此值，系统将在指定的时间后删除表。
SCHEMA：采用 COLUMN:DATA_TYPE,COLUMN:DATA_TYPE 格式的内嵌架构定义，或者是本地机器上 JSON 架构文件的路径。
CLUSTER_COLUMNS：英文逗号分隔的聚簇列（最多 4 个）列表。该列表不能包含任何空格。
DESCRIPTION：加英文引号的表说明。
KEY:VALUE：代表标签的键值对。您可以使用英文逗号分隔列表输入多个标签。
PROJECT_ID：您的项目 ID。
DATASET：项目中的数据集。
TABLE：您要创建的表的名称。

在命令行中指定架构时，您不能添加 RECORD (STRUCT) 类型和列说明，也不能指定列的模式。所有模式均默认为 NULLABLE。如需添加说明、模式和 RECORD 类型，请改为提供 JSON 架构文件。

示例：

输入以下命令，在默认项目的 mydataset 中创建一个名为 myclusteredtable 的聚簇表。该表的到期时间设为 2,592,000（1 个月/30 天），说明设为 This is my clustered table，标签设为 organization:development。该命令使用 -t 快捷键代替 --table。

该架构以内嵌方式指定为：timestamp:timestamp,customer_id:string,transaction_amount:float。指定的聚簇字段 customer_id 用于聚簇表。

bq mk \
    -t \
    --expiration 2592000 \
    --schema 'timestamp:timestamp,customer_id:string,transaction_amount:float' \
    --clustering_fields customer_id \
    --description "This is my clustered table" \
    --label org:dev \
    mydataset.myclusteredtable

输入以下命令，在 myotherproject（非默认项目）中创建一个名为 myclusteredtable 的聚簇表。说明设为 This is my clustered table，标签设为 organization:development。该命令使用 -t 快捷键代替 --table。此命令没有指定表过期时间。如果相关数据集有默认的表过期时间，则系统会应用该过期时间。如果数据集没有默认的表过期时间，则该表将永不过期。

相应架构在如下本地 JSON 文件中指定：/tmp/myschema.json。customer_id 字段用于聚簇表。

bq mk \
    -t \
    --expiration 2592000 \
    --schema /tmp/myschema.json \
    --clustering_fields=customer_id \
    --description "This is my clustered table" \
    --label org:dev \
    myotherproject:mydataset.myclusteredtable

创建表后，您可以更新表的说明和标签。

Terraform

使用 google_bigquery_table 资源。

如需向 BigQuery 进行身份验证，请设置应用默认凭据。如需了解详情，请参阅为客户端库设置身份验证。

以下示例会创建一个名为 mytable 的表，该表按 ID 和 Created 列进行聚簇：

resource "google_bigquery_dataset" "default" {
  dataset_id                      = "mydataset"
  default_partition_expiration_ms = 2592000000  # 30 days
  default_table_expiration_ms     = 31536000000 # 365 days
  description                     = "dataset description"
  location                        = "US"
  max_time_travel_hours           = 96 # 4 days

  labels = {
    billing_group = "accounting",
    pii           = "sensitive"
  }
}

resource "google_bigquery_table" "default" {
  dataset_id          = google_bigquery_dataset.default.dataset_id
  table_id            = "mytable"
  deletion_protection = false # set to "true" in production

  clustering = ["ID", "Created"]

  schema = <<EOF
[
  {
    "name": "ID",
    "type": "INT64",
    "description": "Item ID"
  },
  {
    "name": "Item",
    "type": "STRING",
    "mode": "NULLABLE"
  },
 {
   "name": "Created",
   "type": "TIMESTAMP"
 }
]
EOF

}

如需在 Google Cloud 项目中应用 Terraform 配置，请完成以下部分中的步骤。

准备 Cloud Shell

启动 Cloud Shell。
设置要应用 Terraform 配置的默认 Google Cloud 项目。

您只需为每个项目运行一次以下命令，即可在任何目录中运行它。
```
export GOOGLE_CLOUD_PROJECT=PROJECT_ID
```
如果您在 Terraform 配置文件中设置显式值，则环境变量会被替换。

准备目录

每个 Terraform 配置文件都必须有自己的目录（也称为“根模块”）。

在 Cloud Shell 中，创建一个目录，并在该目录中创建一个新文件。文件名必须具有 .tf 扩展名，例如 main.tf。在本教程中，该文件称为 main.tf。
```
mkdir DIRECTORY && cd DIRECTORY && touch main.tf
```
如果您按照教程进行操作，可以在每个部分或步骤中复制示例代码。

将示例代码复制到新创建的 main.tf 中。

（可选）从 GitHub 中复制代码。如果端到端解决方案包含 Terraform 代码段，则建议这样做。
查看和修改要应用到您的环境的示例参数。
保存更改。
初始化 Terraform。您只需为每个目录执行一次此操作。
```
terraform init
```
（可选）如需使用最新的 Google 提供程序版本，请添加 -upgrade 选项：
```
terraform init -upgrade
```

应用更改

查看配置并验证 Terraform 将创建或更新的资源是否符合您的预期：
```
terraform plan
```
根据需要更正配置。
通过运行以下命令并在提示符处输入 yes 来应用 Terraform 配置：
```
terraform apply
```
等待 Terraform 显示“应用完成！”消息。
打开您的 Google Cloud 项目以查看结果。在 Google Cloud 控制台的界面中找到资源，以确保 Terraform 已创建或更新它们。

API

使用指定了 clustering.fields 属性和 schema 属性的已定义表资源调用 tables.insert 方法。

Python

试用此示例之前，请按照 BigQuery 快速入门：使用客户端库中的 Python 设置说明进行操作。如需了解详情，请参阅 BigQuery Python API 参考文档。

如需向 BigQuery 进行身份验证，请设置应用默认凭证。如需了解详情，请参阅为客户端库设置身份验证。

from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

# TODO(developer): Set table_id to the ID of the table to create.
# table_id = "your-project.your_dataset.your_table_name"

schema = [
    bigquery.SchemaField("full_name", "STRING"),
    bigquery.SchemaField("city", "STRING"),
    bigquery.SchemaField("zipcode", "INTEGER"),
]

table = bigquery.Table(table_id, schema=schema)
table.clustering_fields = ["city", "zipcode"]
table = client.create_table(table)  # Make an API request.
print(
    "Created clustered table {}.{}.{}".format(
        table.project, table.dataset_id, table.table_id
    )
)

Go

试用此示例之前，请按照 BigQuery 快速入门：使用客户端库中的 Go 设置说明进行操作。如需了解详情，请参阅 BigQuery Go API 参考文档。

如需向 BigQuery 进行身份验证，请设置应用默认凭证。如需了解详情，请参阅为客户端库设置身份验证。

import (
	"context"
	"fmt"
	"time"

	"cloud.google.com/go/bigquery"
)

// createTableClustered demonstrates creating a BigQuery table with advanced properties like
// partitioning and clustering features.
func createTableClustered(projectID, datasetID, tableID string) error {
	// projectID := "my-project-id"
	// datasetID := "mydatasetid"
	// tableID := "mytableid"
	ctx := context.Background()

	client, err := bigquery.NewClient(ctx, projectID)
	if err != nil {
		return fmt.Errorf("bigquery.NewClient: %v", err)
	}
	defer client.Close()

	sampleSchema := bigquery.Schema{
		{Name: "timestamp", Type: bigquery.TimestampFieldType},
		{Name: "origin", Type: bigquery.StringFieldType},
		{Name: "destination", Type: bigquery.StringFieldType},
		{Name: "amount", Type: bigquery.NumericFieldType},
	}
	metaData := &bigquery.TableMetadata{
		Schema: sampleSchema,
		TimePartitioning: &bigquery.TimePartitioning{
			Field:      "timestamp",
			Expiration: 90 * 24 * time.Hour,
		},
		Clustering: &bigquery.Clustering{
			Fields: []string{"origin", "destination"},
		},
	}
	tableRef := client.Dataset(datasetID).Table(tableID)
	if err := tableRef.Create(ctx, metaData); err != nil {
		return err
	}
	return nil
}

Java

试用此示例之前，请按照 BigQuery 快速入门：使用客户端库中的 Java 设置说明进行操作。如需了解详情，请参阅 BigQuery Java API 参考文档。

如需向 BigQuery 进行身份验证，请设置应用默认凭证。如需了解详情，请参阅为客户端库设置身份验证。

import com.google.cloud.bigquery.BigQuery;
import com.google.cloud.bigquery.BigQueryException;
import com.google.cloud.bigquery.BigQueryOptions;
import com.google.cloud.bigquery.Clustering;
import com.google.cloud.bigquery.Field;
import com.google.cloud.bigquery.Schema;
import com.google.cloud.bigquery.StandardSQLTypeName;
import com.google.cloud.bigquery.StandardTableDefinition;
import com.google.cloud.bigquery.TableId;
import com.google.cloud.bigquery.TableInfo;
import com.google.cloud.bigquery.TimePartitioning;
import com.google.common.collect.ImmutableList;

public class CreateClusteredTable {
  public static void runCreateClusteredTable() {
    // TODO(developer): Replace these variables before running the sample.
    String datasetName = "MY_DATASET_NAME";
    String tableName = "MY_TABLE_NAME";
    createClusteredTable(datasetName, tableName);
  }

  public static void createClusteredTable(String datasetName, String tableName) {
    try {
      // Initialize client that will be used to send requests. This client only needs to be created
      // once, and can be reused for multiple requests.
      BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();

      TableId tableId = TableId.of(datasetName, tableName);

      TimePartitioning partitioning = TimePartitioning.of(TimePartitioning.Type.DAY);

      Schema schema =
          Schema.of(
              Field.of("name", StandardSQLTypeName.STRING),
              Field.of("post_abbr", StandardSQLTypeName.STRING),
              Field.of("date", StandardSQLTypeName.DATE));

      Clustering clustering =
          Clustering.newBuilder().setFields(ImmutableList.of("name", "post_abbr")).build();

      StandardTableDefinition tableDefinition =
          StandardTableDefinition.newBuilder()
              .setSchema(schema)
              .setTimePartitioning(partitioning)
              .setClustering(clustering)
              .build();
      TableInfo tableInfo = TableInfo.newBuilder(tableId, tableDefinition).build();

      bigquery.create(tableInfo);
      System.out.println("Clustered table created successfully");
    } catch (BigQueryException e) {
      System.out.println("Clustered table was not created. \n" + e.toString());
    }
  }
}

基于查询结果创建聚簇表

您可以通过以下两种方法，基于查询结果创建聚簇表：

将结果写入新目标表并指定聚簇列。
使用 DDL CREATE TABLE AS SELECT 语句创建。如需详细了解此方法，请参阅使用数据定义语言语句页面上的从查询结果创建聚簇表。

您可以通过查询分区表或非分区表来创建聚簇表。但不能使用查询结果将现有表更改为聚簇表。

基于查询结果创建聚簇表时，您必须使用标准 SQL。目前，无法使用旧版 SQL 查询聚簇表或将查询结果写入聚簇表。

SQL

如需基于查询结果创建聚簇表，请使用带有 CLUSTER BY 选项的 CREATE TABLE DDL 语句。以下示例通过查询现有的未聚簇表创建按 customer_id 聚簇的新表：

在 Google Cloud 控制台中，前往 BigQuery 页面。

转到 BigQuery

在查询编辑器中，输入以下语句：

CREATE TABLE mydataset.clustered_table
(
  customer_id STRING,
  transaction_amount NUMERIC
)
CLUSTER BY
  customer_id
AS (
  SELECT * FROM mydataset.unclustered_table
);

点击运行。

如需详细了解如何运行查询，请参阅运行交互式查询。

bq

输入以下命令，基于查询结果创建一个新的聚簇目标表：

bq --location=LOCATION query \
    --use_legacy_sql=false 'QUERY'

请替换以下内容：

LOCATION：您所在位置的名称。--location 是可选标志。例如，如果您在东京区域使用 BigQuery，可将该标志的值设置为 asia-northeast1。您可以使用 .bigqueryrc 文件设置位置的默认值。
QUERY：采用 GoogleSQL 语法的查询。目前，无法使用旧版 SQL 查询聚簇表或将查询结果写入聚簇表中。该查询可以包含 CREATE TABLE DDL 语句，以便指定用于创建聚簇表的选项。您可以使用 DDL 而不是指定单独的命令行标志。

示例：

输入以下命令可将查询结果写入 mydataset 中名为 myclusteredtable 的聚簇目标表。mydataset 属于默认项目。该查询从非分区表 mytable 中检索数据。该表的 customer_id 列用于创建聚簇该表。该表的 timestamp 列用于创建分区表。

bq query --use_legacy_sql=false \
    'CREATE TABLE
       mydataset.myclusteredtable
     PARTITION BY
       DATE(timestamp)
     CLUSTER BY
       customer_id
     AS (
       SELECT
         *
       FROM
         `mydataset.mytable`
     );'

API

要将查询结果保存到聚簇表，请调用 jobs.insert 方法，配置一个 query 作业，并添加一个用于创建聚簇表的 CREATE TABLE DDL 语句。

在作业资源 jobReference 部分的 location 属性中指定您的位置。

在加载数据时创建聚簇表

您可通过在向新表加载数据的过程中指定聚簇列来创建聚簇表。无需先创建空表再加载数据，创建聚簇表和加载数据可同时进行。

要详细了解如何加载数据，请参阅将数据加载到 BigQuery 简介。

在定义加载作业时定义聚簇：

SQL

使用 LOAD DATA 语句。以下示例会加载 AVRO 数据，以创建按 transaction_date 字段分区并按 customer_id 字段聚簇的表。此外，它还会将分区配置为在 3 天后过期。

在 Google Cloud 控制台中，前往 BigQuery 页面。

转到 BigQuery

在查询编辑器中，输入以下语句：

LOAD DATA INTO mydataset.mytable
PARTITION BY transaction_date
CLUSTER BY customer_id
  OPTIONS (
    partition_expiration_days = 3)
FROM FILES(
  format = 'AVRO',
  uris = ['gs://bucket/path/file.avro']);

点击运行。

如需详细了解如何运行查询，请参阅运行交互式查询。

API

要在通过加载作业创建表时定义聚簇配置，您可以为该表填充 Clustering 属性。

Go

试用此示例之前，请按照 BigQuery 快速入门：使用客户端库中的 Go 设置说明进行操作。如需了解详情，请参阅 BigQuery Go API 参考文档。

如需向 BigQuery 进行身份验证，请设置应用默认凭证。如需了解详情，请参阅为客户端库设置身份验证。

import (
	"context"
	"fmt"

	"cloud.google.com/go/bigquery"
)

// importClusteredTable demonstrates creating a table from a load job and defining partitioning and clustering
// properties.
func importClusteredTable(projectID, destDatasetID, destTableID string) error {
	// projectID := "my-project-id"
	// datasetID := "mydataset"
	// tableID := "mytable"
	ctx := context.Background()
	client, err := bigquery.NewClient(ctx, projectID)
	if err != nil {
		return fmt.Errorf("bigquery.NewClient: %v", err)
	}
	defer client.Close()

	gcsRef := bigquery.NewGCSReference("gs://cloud-samples-data/bigquery/sample-transactions/transactions.csv")
	gcsRef.SkipLeadingRows = 1
	gcsRef.Schema = bigquery.Schema{
		{Name: "timestamp", Type: bigquery.TimestampFieldType},
		{Name: "origin", Type: bigquery.StringFieldType},
		{Name: "destination", Type: bigquery.StringFieldType},
		{Name: "amount", Type: bigquery.NumericFieldType},
	}
	loader := client.Dataset(destDatasetID).Table(destTableID).LoaderFrom(gcsRef)
	loader.TimePartitioning = &bigquery.TimePartitioning{
		Field: "timestamp",
	}
	loader.Clustering = &bigquery.Clustering{
		Fields: []string{"origin", "destination"},
	}
	loader.WriteDisposition = bigquery.WriteEmpty

	job, err := loader.Run(ctx)
	if err != nil {
		return err
	}
	status, err := job.Wait(ctx)
	if err != nil {
		return err
	}

	if status.Err() != nil {
		return fmt.Errorf("job completed with error: %v", status.Err())
	}
	return nil
}

Java

试用此示例之前，请按照 BigQuery 快速入门：使用客户端库中的 Java 设置说明进行操作。如需了解详情，请参阅 BigQuery Java API 参考文档。

如需向 BigQuery 进行身份验证，请设置应用默认凭证。如需了解详情，请参阅为客户端库设置身份验证。

import com.google.cloud.bigquery.BigQuery;
import com.google.cloud.bigquery.BigQueryException;
import com.google.cloud.bigquery.BigQueryOptions;
import com.google.cloud.bigquery.Clustering;
import com.google.cloud.bigquery.Field;
import com.google.cloud.bigquery.FormatOptions;
import com.google.cloud.bigquery.Job;
import com.google.cloud.bigquery.JobInfo;
import com.google.cloud.bigquery.LoadJobConfiguration;
import com.google.cloud.bigquery.Schema;
import com.google.cloud.bigquery.StandardSQLTypeName;
import com.google.cloud.bigquery.TableId;
import com.google.cloud.bigquery.TimePartitioning;
import com.google.common.collect.ImmutableList;

public class LoadTableClustered {

  public static void runLoadTableClustered() throws Exception {
    // TODO(developer): Replace these variables before running the sample.
    String datasetName = "MY_DATASET_NAME";
    String tableName = "MY_TABLE_NAME";
    String sourceUri = "/path/to/file.csv";
    loadTableClustered(datasetName, tableName, sourceUri);
  }

  public static void loadTableClustered(String datasetName, String tableName, String sourceUri)
      throws Exception {
    try {
      // Initialize client that will be used to send requests. This client only needs to be created
      // once, and can be reused for multiple requests.
      BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();

      TableId tableId = TableId.of(datasetName, tableName);

      Schema schema =
          Schema.of(
              Field.of("name", StandardSQLTypeName.STRING),
              Field.of("post_abbr", StandardSQLTypeName.STRING),
              Field.of("date", StandardSQLTypeName.DATE));

      TimePartitioning partitioning = TimePartitioning.of(TimePartitioning.Type.DAY);

      Clustering clustering =
          Clustering.newBuilder().setFields(ImmutableList.of("name", "post_abbr")).build();

      LoadJobConfiguration loadJobConfig =
          LoadJobConfiguration.builder(tableId, sourceUri)
              .setFormatOptions(FormatOptions.csv())
              .setSchema(schema)
              .setTimePartitioning(partitioning)
              .setClustering(clustering)
              .build();

      Job loadJob = bigquery.create(JobInfo.newBuilder(loadJobConfig).build());

      // Load data from a GCS parquet file into the table
      // Blocks until this load table job completes its execution, either failing or succeeding.
      Job completedJob = loadJob.waitFor();

      // Check for errors
      if (completedJob == null) {
        throw new Exception("Job not executed since it no longer exists.");
      } else if (completedJob.getStatus().getError() != null) {
        // You can also look at queryJob.getStatus().getExecutionErrors() for all
        // errors, not just the latest one.
        throw new Exception(
            "BigQuery was unable to load into the table due to an error: \n"
                + loadJob.getStatus().getError());
      }
      System.out.println("Data successfully loaded into clustered table during load job");
    } catch (BigQueryException | InterruptedException e) {
      System.out.println("Data not loaded into clustered table during load job \n" + e.toString());
    }
  }
}

Python

试用此示例之前，请按照 BigQuery 快速入门：使用客户端库中的 Python 设置说明进行操作。如需了解详情，请参阅 BigQuery Python API 参考文档。

如需向 BigQuery 进行身份验证，请设置应用默认凭证。如需了解详情，请参阅为客户端库设置身份验证。

from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

# TODO(developer): Set table_id to the ID of the table to create.
# table_id = "your-project.your_dataset.your_table_name"

job_config = bigquery.LoadJobConfig(
    skip_leading_rows=1,
    source_format=bigquery.SourceFormat.CSV,
    schema=[
        bigquery.SchemaField("timestamp", bigquery.SqlTypeNames.TIMESTAMP),
        bigquery.SchemaField("origin", bigquery.SqlTypeNames.STRING),
        bigquery.SchemaField("destination", bigquery.SqlTypeNames.STRING),
        bigquery.SchemaField("amount", bigquery.SqlTypeNames.NUMERIC),
    ],
    time_partitioning=bigquery.TimePartitioning(field="timestamp"),
    clustering_fields=["origin", "destination"],
)

job = client.load_table_from_uri(
    ["gs://cloud-samples-data/bigquery/sample-transactions/transactions.csv"],
    table_id,
    job_config=job_config,
)

job.result()  # Waits for the job to complete.

table = client.get_table(table_id)  # Make an API request.
print(
    "Loaded {} rows and {} columns to {}".format(
        table.num_rows, len(table.schema), table_id
    )
)

后续步骤

如需了解如何使用聚簇表，请参阅管理聚簇表。
要了解如何查询聚簇表，请参阅查询聚簇表。
如需大致了解 BigQuery 中的分区表支持，请参阅分区表简介。
如需了解如何创建分区表，请参阅创建分区表。
如需查看 INFORMATION_SCHEMA 的概览，请参阅 BigQuery INFORMATION_SCHEMA 简介。