使用 VPC 网络迁移 Amazon Redshift 数据

本文档介绍如何使用 VPC 将数据从 Amazon Redshift 迁移到 BigQuery。

如果您在 AWS 中有专用 Amazon Redshift 实例，则可以通过创建虚拟私有云 (VPC) 网络并将其与 Amazon Redshift VPC 网络连接，将该数据迁移到 BigQuery。数据迁移过程如下：

您在要用于转移作业的项目中创建 VPC 网络。VPC 网络不能是共享 VPC 网络。
您设置虚拟专用网 (VPN) 并连接您的项目 VPC 网络和 Amazon Redshift VPC 网络。
您在设置转移作业时指定项目 VPC 网络和预留的 IP 地址范围。
BigQuery Data Transfer Service 创建一个租户项目并将其附加到您用于转移作业的项目。
BigQuery Data Transfer Service 使用您指定的预留 IP 地址范围在租户项目中创建具有一个子网的 VPC 网络。
BigQuery Data Transfer Service 在您的项目 VPC 网络和租户项目 VPC 网络之间创建 VPC 对等互连。
BigQuery Data Transfer Service 迁移在租户项目中运行。它会触发从 Amazon Redshift 到 Amazon S3 存储桶中的暂存区域的卸载操作。卸载速度取决于集群配置。
BigQuery Data Transfer Service 迁移会将数据从 Amazon S3 存储桶迁移到 BigQuery。

如果您想要通过公共 IP 转移 Amazon Redshift 实例中的数据，您可以按照以下说明将您的 Amazon Redshift 数据迁移到 BigQuery。

准备工作

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the BigQuery and BigQuery Data Transfer Service APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the BigQuery and BigQuery Data Transfer Service APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

设置所需权限

在创建 Amazon Redshift 转移作业之前，请按照以下步骤操作：

确保创建转移作业的人员在 BigQuery 中拥有以下必要 Identity and Access Management (IAM) 权限：
- 创建转移作业所需的 bigquery.transfers.update 权限
- 针对目标数据集的 bigquery.datasets.update 权限
预定义的 IAM 角色 role/bigquery.admin 包含 bigquery.transfers.update 和 bigquery.datasets.update 权限。如需详细了解 BigQuery Data Transfer Service 中的 IAM 角色，请参阅访问权限控制。
请参阅 Amazon S3 的相关文档，以确保您已配置启用转移作业所需的所有权限。Amazon S3 源数据必须至少应用 AWS 托管政策 AmazonS3ReadOnlyAccess。
向设置转移作业的个人授予创建和删除 VPC 网络对等互连的相应 IAM 权限。该服务使用个人的 Google Cloud 用户凭证创建 VPC 对等互连连接。
- 创建 VPC 对等互连的权限：compute.networks.addPeering
- 删除 VPC 对等互连的权限：compute.networks.removePeering
roles/project.owner、roles/project.editor 和 roles/compute.networkAdmin 预定义 IAM 角色默认包含 compute.networks.addPeering 和 compute.networks.removePeering 权限。

创建数据集

创建 BigQuery 数据集来存储数据。您不需要创建任何表。

授予对 Amazon Redshift 集群的访问权限

通过配置安全群组规则，将专用 Amazon Redshift 集群的以下 IP 范围添加到许可清单中。在稍后的步骤中，您将在设置转移时在此 VPC 网络中定义专用 IP 范围。

授予对 Amazon S3 存储分区的访问权限

您必须拥有 S3 存储桶才能用作将 Amazon Redshift 数据转移到 BigQuery 的暂存区。如需详细说明，请参阅 Amazon 文档。

建议您创建一个专用的 Amazon IAM 用户，并授予该用户对 Amazon Redshift 的只读访问权限以及对 Amazon S3 的读写访问权限。如需实现此步骤，您可以应用以下政策：
创建 Amazon IAM 用户访问密钥对。

通过单独的迁移队列配置工作负载控制

您可以选择定义一个专供迁移使用的 Amazon Redshift 队列，借此限制并分隔用于迁移的资源。您可以使用最大并发查询计数配置此迁移队列。然后，您可以将特定迁移用户组与此队列关联，并在设置迁移时使用这些凭据将数据转移到 BigQuery。转移服务仅拥有迁移队列的访问权限。

收集转移信息

收集使用 BigQuery Data Transfer Service 设置迁移所需的信息：

获取 Amazon Redshift 中的 VPC 和预留的 IP 范围。

按照获取 JDBC 网址的说明进行操作。
获取对您的 Amazon Redshift 数据库具有适当权限的用户的用户名和密码。
按照授予对 Amazon S3 存储桶的访问权限中的说明获取 AWS 访问密钥对。
获取您要用于转移作业的 Amazon S3 存储桶的 URI。我们建议您为此存储桶设置生命周期政策，以避免产生不必要的费用。建议的到期时间为 24 小时，以便您有足够的时间将所有数据转移到 BigQuery。

评估您的数据

在数据转移过程中，BigQuery Data Transfer Service 会将 Amazon Redshift 中的数据以 CSV 文件的形式写入 Cloud Storage。如果这些文件包含 ASCII 0 字符，则无法加载到 BigQuery 中。我们建议您评估自己的数据，以确定是否存在此问题。如果存在此问题，您可以通过将数据以 Parquet 文件的形式导出到 Amazon S3 来解决此问题，然后使用 BigQuery Data Transfer Service 导入这些文件。如需了解详情，请参阅 Amazon S3 转移作业概览。

设置 VPC 网络和 VPN

确保您有权启用 VPC 对等互连。如需了解详情，请参阅设置所需权限。
按照本指南中的说明，设置 Google Cloud VPC 网络，在您的Google Cloud 项目的 VPC 网络和 Amazon Redshift VPC 网络之间设置 VPN，并启用 VPC 对等互连。

注意：该服务使用您的 VPC 网络名称作为 VPC 对等互连连接名称，因此请确保没有已使用该名称的现有 VPC 对等互连连接。
配置 Amazon Redshift 以允许连接到 VPN。如需了解详情，请参阅 Amazon Redshift 集群安全群组。
在 Google Cloud 控制台中，前往 VPC 网络页面，以验证您的Google Cloud VPC 网络是否存在于 Google Cloud 项目中，是否通过 VPN 连接到 Amazon Redshift。

进入 VPC 网络页面

控制台页面列出了所有 VPC 网络。

设置 Amazon Redshift 转移作业

使用以下说明设置 Amazon Redshift 转移作业：

在 Google Cloud 控制台中，前往 BigQuery 页面。

转到 BigQuery
点击数据传输。
点击创建转移作业。
在来源类型部分中，从来源列表中选择迁移：Amazon Redshift。
在转移配置名称部分的显示名字段中，输入转移作业的名称，例如 My migration。显示名可以是任何容易辨识的值，让您以后在需要修改时能够轻松识别。
在目标设置部分中，从数据集列表中选择您创建的数据集。
在数据源详细信息部分中，执行以下操作：
1. 在 Amazon Redshift 的 JDBC 连接网址 字段中，提供用来访问您 Amazon Redshift 集群的 JDBC 网址。
2. 在您的数据库用户名字段中，输入要迁移的 Amazon Redshift 数据库的用户名。
3. 在您的数据库密码字段，输入数据库密码。
  
  注意：一旦提供您的 Amazon 凭据，即表示您同意 BigQuery Data Transfer Service 作为您的代理，唯一的目的是访问您的数据以执行其转移。
4. 在访问密钥 ID 和私有访问密钥字段中，输入您从授予对 S3 存储桶的访问权限步骤获取的访问密钥对。
5. 在 Amazon S3 URI 字段中，输入您将用作暂存区域的 S3 存储桶的 URI。
6. 在 Amazon Redshift 架构字段中，输入您要迁移的 Amazon Redshift 架构。
7. 在表名模式字段中，指定与架构中的表名匹配的名称或模式。您可以使用正则表达式指定采用以下格式的模式：<table1Regex>;<table2Regex>。此模式应遵循 Java 正则表达式语法。例如：
  - lineitem;ordertb 匹配名为 lineitem 和 ordertb 的表。
  - .* 匹配所有表。
  如果将此字段留空，则会迁移来自指定架构的所有表。
  
  注意：如果表都非常大，建议您一次迁移一个表。对于每个加载作业，BigQuery 有 15 TB 的加载配额。
8. 对于 VPC 和预留的 IP 地址范围，请指定要在租户项目 VPC 网络中使用的 VPC 网络名称和专用 IP 地址范围。将 IP 地址范围指定为 CIDR 地址块。
  - 表单为 VPC_network_name:CIDR，例如：my_vpc:10.251.1.0/24。
  - 在 CIDR 表示法中使用以 10.x.x.x 开头的标准专用 VPC 网络地址范围。
  - IP 地址范围的 IP 地址必须超过 10 个。
  - IP 地址范围不得与项目 VPC 网络或 Amazon Redshift VPC 网络中的任何子网重叠。
  - 如果您为同一 Amazon Redshift 容器配置了多个转移，请务必在每个实例中使用相同的 VPC_network_name:CIDR 值，以便多个转移可以重复使用相同的迁移基础架构。
  注意：配置完成后，此 CIDR 地址块的值不可变。
可选：在通知选项部分中，执行以下操作：
1. 点击切换开关以启用电子邮件通知。启用此选项后，转移作业管理员会在转移作业运行失败时收到电子邮件通知。
2. 在选择 Pub/Sub 主题部分，选择您的主题名称，或点击创建主题。此选项用于为您的转移作业配置 Pub/Sub 运行通知。
点击保存。
Google Cloud 控制台会显示所有转移作业设置详细信息，包括此转移作业的资源名称。

配额和限制

如果使用 VPC 网络迁移 Amazon Redshift 专用实例，则系统会在单个租户基础架构上运行迁移代理。由于计算资源的限制，最多允许 5 个并发转移运行。

对于每个表的每个加载作业，BigQuery 都有 15 TB 的加载配额。Amazon Redshift 会在内部执行表数据压缩，因此实际导出的表大小会超过 Amazon Redshift 报告的表大小。如果您打算迁移 15 TB 以上的表，请先联系 Cloud Customer Care。

使用此服务可能会在 Google 外部产生费用。如需了解详情，请查看 Amazon Redshift 和 Amazon S3 的价格页面。

由于 Amazon S3 的一致性模型要求，向 BigQuery 转移的作业中可能不包括某些文件。

后续步骤

了解标准 Amazon Redshift 迁移。
详细了解 BigQuery Data Transfer Service。
使用批量 SQL 转换迁移 SQL 代码。