Python을 사용하여 Dataflow 파이프라인 만들기

이 문서에서는 Python용 Apache Beam SDK를 사용하여 파이프라인을 정의하는 프로그램을 빌드하는 방법을 설명합니다. 그런 다음 직접 로컬 실행기 또는 Dataflow와 같은 클라우드 기반 실행기를 사용하여 파이프라인을 실행합니다. WordCount 파이프라인 소개는 Apache Beam에서 WordCount를 사용하는 방법 동영상을 참조하세요.

Google Cloud 콘솔에서 이 태스크에 대한 단계별 안내를 직접 수행하려면 둘러보기를 클릭합니다.

둘러보기

시작하기 전에

Sign in to your Google Cloud Platform account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

Install the Google Cloud CLI.

외부 ID 공급업체(IdP)를 사용하는 경우 먼저 제휴 ID로 gcloud CLI에 로그인해야 합니다.

gcloud CLI를 초기화하려면, 다음 명령어를 실행합니다.

gcloud init

Create or select a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Create a Google Cloud project:
```
gcloud projects create PROJECT_ID
```
Replace PROJECT_ID with a name for the Google Cloud project you are creating.
Select the Google Cloud project that you created:
```
gcloud config set project PROJECT_ID
```
Replace PROJECT_ID with your Google Cloud project name.

Verify that billing is enabled for your Google Cloud project.

Enable the Dataflow, Compute Engine, Cloud Logging, Cloud Storage, Google Cloud Storage JSON, BigQuery, Cloud Pub/Sub, Cloud Datastore, and Cloud Resource Manager APIs:

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

gcloud services enable dataflow compute_component logging storage_component storage_api bigquery pubsub datastore.googleapis.com cloudresourcemanager.googleapis.com

Create local authentication credentials for your user account:

gcloud auth application-default login

If an authentication error is returned, and you are using an external identity provider (IdP), confirm that you have signed in to the gcloud CLI with your federated identity.

Grant roles to your user account. Run the following command once for each of the following IAM roles: roles/iam.supportUser, roles/datastream.admin, roles/monitoring.metricsScopesViewer, roles/cloudaicompanion.settingsAdmin

gcloud projects add-iam-policy-binding PROJECT_ID --member="user:USER_IDENTIFIER" --role=ROLE

Replace the following:

PROJECT_ID: Your project ID.
USER_IDENTIFIER: The identifier for your user account. For example, myemail@example.com.
ROLE: The IAM role that you grant to your user account.

Install the Google Cloud CLI.

외부 ID 공급업체(IdP)를 사용하는 경우 먼저 제휴 ID로 gcloud CLI에 로그인해야 합니다.

gcloud CLI를 초기화하려면, 다음 명령어를 실행합니다.

gcloud init

Create or select a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Create a Google Cloud project:
```
gcloud projects create PROJECT_ID
```
Replace PROJECT_ID with a name for the Google Cloud project you are creating.
Select the Google Cloud project that you created:
```
gcloud config set project PROJECT_ID
```
Replace PROJECT_ID with your Google Cloud project name.

Verify that billing is enabled for your Google Cloud project.

Enable the Dataflow, Compute Engine, Cloud Logging, Cloud Storage, Google Cloud Storage JSON, BigQuery, Cloud Pub/Sub, Cloud Datastore, and Cloud Resource Manager APIs:

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

gcloud services enable dataflow compute_component logging storage_component storage_api bigquery pubsub datastore.googleapis.com cloudresourcemanager.googleapis.com

Create local authentication credentials for your user account:

gcloud auth application-default login

If an authentication error is returned, and you are using an external identity provider (IdP), confirm that you have signed in to the gcloud CLI with your federated identity.

gcloud projects add-iam-policy-binding PROJECT_ID --member="user:USER_IDENTIFIER" --role=ROLE

Replace the following:

PROJECT_ID: Your project ID.
USER_IDENTIFIER: The identifier for your user account. For example, myemail@example.com.
ROLE: The IAM role that you grant to your user account.

Compute Engine 기본 서비스 계정에 역할을 부여합니다. 다음 IAM 역할마다 다음 명령어를 1회 실행합니다.
- roles/dataflow.admin
- roles/dataflow.worker
- roles/storage.objectAdmin
```
gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:PROJECT_NUMBER-compute@developer.gserviceaccount.com" --role=SERVICE_ACCOUNT_ROLE
```
- PROJECT_ID를 프로젝트 ID로 바꿉니다.
- 여기에서 PROJECT_NUMBER를 프로젝트 번호로 바꿉니다. 프로젝트 번호를 찾으려면 프로젝트 식별을 참조하거나 gcloud projects describe 명령어를 사용합니다.
- SERVICE_ACCOUNT_ROLE을 각 개별 역할로 바꿉니다.
Create a Cloud Storage bucket and configure it as follows:
- Set the storage class to S(Standard).
- 스토리지 위치를 다음과 같이 설정합니다. US(미국)
- BUCKET_NAME을 고유한 버킷 이름으로 바꿉니다. 버킷 네임스페이스는 전역적이며 공개로 표시되므로 버킷 이름에 민감한 정보를 포함해서는 안 됩니다.
- Google Cloud 프로젝트 ID 및 Cloud Storage 버킷 이름을 복사합니다. 이 값은 이 문서의 뒷부분에서 필요합니다.

Python을 사용하여 Dataflow 파이프라인 만들기

시작하기 전에

환경 설정

Apache Beam SDK 가져오기

로컬에서 파이프라인 실행

Dataflow 서비스에서 파이프라인 실행

결과 보기

Google Cloud 콘솔

로컬 터미널

파이프라인 코드 수정

삭제

다음 단계