使用 Go 创建 Dataflow 流水线

本页面介绍如何使用 Go 版 Apache Beam SDK 构建用于定义流水线的程序。然后，您将在本地和 Dataflow 服务上运行该流水线。如需了解 WordCount 流水线，请观看如何在 Apache Beam 中使用 WordCount 视频。

准备工作

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

Install the Google Cloud CLI.

如果您使用的是外部身份提供方 (IdP)，则必须先使用联合身份登录 gcloud CLI。

如需初始化 gcloud CLI，请运行以下命令：

gcloud init

Create or select a Google Cloud project.

Create a Google Cloud project:
```
gcloud projects create PROJECT_ID
```
Replace PROJECT_ID with a name for the Google Cloud project you are creating.
Select the Google Cloud project that you created:
```
gcloud config set project PROJECT_ID
```
Replace PROJECT_ID with your Google Cloud project name.

Verify that billing is enabled for your Google Cloud project.

Enable the Dataflow, Compute Engine, Cloud Logging, Cloud Storage, Google Cloud Storage JSON, and Cloud Resource Manager APIs:

gcloud services enable dataflow compute_component logging storage_component storage_api cloudresourcemanager.googleapis.com

Create local authentication credentials for your user account:

gcloud auth application-default login

If an authentication error is returned, and you are using an external identity provider (IdP), confirm that you have signed in to the gcloud CLI with your federated identity.

Grant roles to your user account. Run the following command once for each of the following IAM roles: roles/iam.serviceAccountUser

gcloud projects add-iam-policy-binding PROJECT_ID --member="user:USER_IDENTIFIER" --role=ROLE

Replace the following:

PROJECT_ID: your project ID.
USER_IDENTIFIER: the identifier for your user account—for example, myemail@example.com.
ROLE: the IAM role that you grant to your user account.

Install the Google Cloud CLI.

如果您使用的是外部身份提供方 (IdP)，则必须先使用联合身份登录 gcloud CLI。

如需初始化 gcloud CLI，请运行以下命令：

gcloud init

Create or select a Google Cloud project.

Create a Google Cloud project:
```
gcloud projects create PROJECT_ID
```
Replace PROJECT_ID with a name for the Google Cloud project you are creating.
Select the Google Cloud project that you created:
```
gcloud config set project PROJECT_ID
```
Replace PROJECT_ID with your Google Cloud project name.

Verify that billing is enabled for your Google Cloud project.

Enable the Dataflow, Compute Engine, Cloud Logging, Cloud Storage, Google Cloud Storage JSON, and Cloud Resource Manager APIs:

gcloud services enable dataflow compute_component logging storage_component storage_api cloudresourcemanager.googleapis.com

Create local authentication credentials for your user account:

gcloud auth application-default login

If an authentication error is returned, and you are using an external identity provider (IdP), confirm that you have signed in to the gcloud CLI with your federated identity.

Grant roles to your user account. Run the following command once for each of the following IAM roles: roles/iam.serviceAccountUser

gcloud projects add-iam-policy-binding PROJECT_ID --member="user:USER_IDENTIFIER" --role=ROLE

Replace the following:

PROJECT_ID: your project ID.
USER_IDENTIFIER: the identifier for your user account—for example, myemail@example.com.
ROLE: the IAM role that you grant to your user account.

向您的 Compute Engine 默认服务账号授予角色。对以下每个 IAM 角色运行以下命令一次：
- roles/dataflow.admin
- roles/dataflow.worker
- roles/storage.objectAdmin
```
gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:PROJECT_NUMBER-compute@developer.gserviceaccount.com" --role=SERVICE_ACCOUNT_ROLE
```
- 将 PROJECT_ID 替换为您的项目 ID。
- 将 PROJECT_NUMBER 替换为您的项目编号。如需查找项目编号，请参阅识别项目或使用 gcloud projects describe 命令。
- 将 SERVICE_ACCOUNT_ROLE 替换为每个角色。
Create a Cloud Storage bucket and configure it as follows:
- Set the storage class to S (Standard)。
- 将存储位置设置为以下项： US（美国）。
- 将 BUCKET_NAME 替换为唯一的存储桶名称。请勿在存储桶名称中添加敏感信息，因为存储桶命名空间是全局性的，公开可见。
- 复制 Google Cloud 项目 ID 和 Cloud Storage 存储桶名称。您将在本快速入门的后面部分用到这些值。

使用 Go 创建 Dataflow 流水线

准备工作

设置开发环境

运行 Beam 字数统计示例

修改流水线代码

创建 Go 模块

运行未经修改的流水线

更改流水线代码

在本地运行更新后的流水线

在 Dataflow 服务上运行流水线

查看结果

控制台

终端

清理

后续步骤