Flex 템플릿 빌드 및 실행

Flex 템플릿을 사용하면 Dataflow 파이프라인을 배포용으로 패키징할 수 있습니다. 이 튜토리얼에서는 Dataflow Flex 템플릿을 빌드한 후 해당 템플릿을 사용하여 Dataflow 작업을 실행하는 방법을 보여줍니다.

목표

Dataflow Flex 템플릿을 빌드합니다.
템플릿을 사용하여 Dataflow 작업을 실행합니다.

비용

이 문서에서는 비용이 청구될 수 있는 다음과 같은 Google Cloud Platform 구성요소를 사용합니다.

프로젝트 사용량을 기준으로 예상 비용을 산출하려면 가격 계산기를 사용합니다.

Google Cloud 신규 사용자는 무료 체험판을 사용할 수 있습니다.

이 문서에 설명된 태스크를 완료했으면 만든 리소스를 삭제하여 청구가 계속되는 것을 방지할 수 있습니다. 자세한 내용은 삭제를 참조하세요.

시작하기 전에

Sign in to your Google Cloud Platform account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

Install the Google Cloud CLI.

외부 ID 공급업체(IdP)를 사용하는 경우 먼저 제휴 ID로 gcloud CLI에 로그인해야 합니다.

gcloud CLI를 초기화하려면, 다음 명령어를 실행합니다.

gcloud init

Create or select a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Create a Google Cloud project:
```
gcloud projects create PROJECT_ID
```
Replace PROJECT_ID with a name for the Google Cloud project you are creating.
Select the Google Cloud project that you created:
```
gcloud config set project PROJECT_ID
```
Replace PROJECT_ID with your Google Cloud project name.

Verify that billing is enabled for your Google Cloud project.

Enable the Dataflow, Compute Engine, Logging, Cloud Storage, Cloud Storage JSON, Resource Manager, Artifact Registry, and Cloud Build API:

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

gcloud services enable dataflow compute_component logging storage_component storage_api cloudresourcemanager.googleapis.com artifactregistry.googleapis.com cloudbuild.googleapis.com

If you're using a local shell, then create local authentication credentials for your user account:

gcloud auth application-default login

You don't need to do this if you're using Cloud Shell.

If an authentication error is returned, and you are using an external identity provider (IdP), confirm that you have signed in to the gcloud CLI with your federated identity.

Grant roles to your user account. Run the following command once for each of the following IAM roles: roles/iam.serviceAccountUser

gcloud projects add-iam-policy-binding PROJECT_ID --member="user:USER_IDENTIFIER" --role=ROLE

Replace the following:

PROJECT_ID: Your project ID.
USER_IDENTIFIER: The identifier for your user account. For example, myemail@example.com.
ROLE: The IAM role that you grant to your user account.

Install the Google Cloud CLI.

외부 ID 공급업체(IdP)를 사용하는 경우 먼저 제휴 ID로 gcloud CLI에 로그인해야 합니다.

gcloud CLI를 초기화하려면, 다음 명령어를 실행합니다.

gcloud init

Create or select a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Create a Google Cloud project:
```
gcloud projects create PROJECT_ID
```
Replace PROJECT_ID with a name for the Google Cloud project you are creating.
Select the Google Cloud project that you created:
```
gcloud config set project PROJECT_ID
```
Replace PROJECT_ID with your Google Cloud project name.

Verify that billing is enabled for your Google Cloud project.

Enable the Dataflow, Compute Engine, Logging, Cloud Storage, Cloud Storage JSON, Resource Manager, Artifact Registry, and Cloud Build API:

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

gcloud services enable dataflow compute_component logging storage_component storage_api cloudresourcemanager.googleapis.com artifactregistry.googleapis.com cloudbuild.googleapis.com

If you're using a local shell, then create local authentication credentials for your user account:

gcloud auth application-default login

You don't need to do this if you're using Cloud Shell.

If an authentication error is returned, and you are using an external identity provider (IdP), confirm that you have signed in to the gcloud CLI with your federated identity.

Grant roles to your user account. Run the following command once for each of the following IAM roles: roles/iam.serviceAccountUser

gcloud projects add-iam-policy-binding PROJECT_ID --member="user:USER_IDENTIFIER" --role=ROLE

Replace the following:

PROJECT_ID: Your project ID.
USER_IDENTIFIER: The identifier for your user account. For example, myemail@example.com.
ROLE: The IAM role that you grant to your user account.

Compute Engine 기본 서비스 계정에 역할을 부여합니다. 다음 IAM 역할마다 다음 명령어를 1회 실행합니다.
- roles/dataflow.admin
- roles/dataflow.worker
- roles/storage.objectAdmin
- roles/artifactregistry.writer
```
gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:PROJECT_NUMBER-compute@developer.gserviceaccount.com" --role=SERVICE_ACCOUNT_ROLE
```
다음을 바꿉니다.
- PROJECT_ID: 프로젝트 ID
- PROJECT_NUMBER: 프로젝트 번호
- SERVICE_ACCOUNT_ROLE: 개별 역할

환경 준비

SDK와 개발 환경에 필요한 모든 요구사항을 설치합니다.

자바

JDK(Java Development Kit) 버전17을 다운로드하고 설치합니다. JAVA_HOME 환경 변수가 설정되어 있고 JDK 설치를 가리키는지 확인합니다.
특정 운영체제에 대한 Maven의 설치 가이드를 따라 Apache Maven을 다운로드하고 설치합니다.

Python

Python용 Apache Beam SDK 설치하기

Go

Go의 다운로드 및 설치 가이드를 사용하여 특정 운영체제에 맞는 Go를 다운로드하고 설치합니다. Apache Beam에서 지원하는 Go 런타임 환경을 알아보려면 Apache Beam 런타임 지원을 참조하세요.

코드 샘플을 다운로드합니다.

자바

java-docs-samples 저장소를 클론합니다.

git clone https://github.com/GoogleCloudPlatform/java-docs-samples.git

이 튜토리얼의 코드 샘플로 이동합니다.

cd java-docs-samples/dataflow/flex-templates/getting_started

자바 프로젝트를 Uber JAR 파일에 빌드합니다.
```
mvn clean package
```
이 Uber JAR 파일에는 모든 종속 항목이 포함되어 있습니다. 다른 라이브러리에 외부 종속 항목이 없는 독립형 애플리케이션으로 이 파일을 실행할 수 있습니다.

Python

python-docs-samples 저장소를 클론합니다.

git clone https://github.com/GoogleCloudPlatform/python-docs-samples.git

이 튜토리얼의 코드 샘플로 이동합니다.

cd python-docs-samples/dataflow/flex-templates/getting_started

Go

golang-samples 저장소를 클론합니다.

git clone https://github.com/GoogleCloudPlatform/golang-samples.git

이 튜토리얼의 코드 샘플로 이동합니다.
```
cd golang-samples/dataflow/flex-templates/wordcount
```

Go 바이너리를 컴파일합니다.

CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -o wordcount .

Cloud Storage 버킷 만들기

Cloud Storage 버킷을 만들려면 gcloud storage buckets create 명령어를 사용합니다.

gcloud storage buckets create gs://BUCKET_NAME

BUCKET_NAME을 Cloud Storage 버킷 이름으로 바꿉니다. Cloud Storage 버킷 이름은 전역적으로 고유해야 하며 버킷 이름 지정 요구사항을 충족해야 합니다.

Artifact Registry 저장소 만들기

템플릿에 대한 Docker 컨테이너 이미지를 푸시할 Artifact Registry 저장소를 만듭니다.

gcloud artifacts repositories create 명령어를 사용하여 새 Artifact Registry 저장소를 만듭니다.
```
gcloud artifacts repositories create REPOSITORY \
 --repository-format=docker \
 --location=LOCATION
```
다음을 바꿉니다.
- REPOSITORY: 저장소 이름입니다. 저장소 이름은 프로젝트의 각 저장소 위치에 대해 고유해야 합니다.
- LOCATION: 저장소의 리전 또는 멀티 리전 위치입니다.
gcloud auth configure-docker 명령어를 사용하여 Artifact Registry에 대한 요청을 인증하도록 Docker를 구성합니다. 이 명령어는 Artifact Registry와 연결하여 이미지를 푸시할 수 있도록 Docker 구성을 업데이트합니다.
```
gcloud auth configure-docker LOCATION-docker.pkg.dev
```

Flex 템플릿에서 비공개 레지스트리에 저장된 이미지를 사용할 수도 있습니다. 자세한 내용은 비공개 레지스트리에서 이미지 사용을 참고하세요.

Flex 템플릿을 빌드합니다.

이 단계에서는 gcloud dataflow flex-template build 명령어를 사용하여 Flex 템플릿을 빌드합니다.

Flex 템플릿은 다음 구성요소로 구성됩니다.

파이프라인 코드를 패키징하는 Docker 컨테이너 이미지입니다. Java 및 Python Flex 템플릿의 경우 gcloud dataflow flex-template build 명령어를 실행하면 Docker 이미지가 빌드되어 Artifact Registry 저장소에 푸시됩니다.
템플릿 사양 파일입니다. 이 파일은 컨테이너 이미지의 위치와 템플릿에 관한 메타데이터(예: 파이프라인 매개변수)를 포함하는 JSON 문서입니다.

GitHub의 샘플 저장소에는 metadata.json 파일이 포함되어 있습니다.

추가 메타데이터로 템플릿을 확장하려면 자체 metadata.json 파일을 만들면 됩니다.

자바

gcloud dataflow flex-template build gs://BUCKET_NAME/getting_started-java.json \
 --image-gcr-path "LOCATION-docker.pkg.dev/PROJECT_ID/REPOSITORY/getting-started-java:latest" \
 --sdk-language "JAVA" \
 --flex-template-base-image JAVA17 \
 --metadata-file "metadata.json" \
 --jar "target/flex-template-getting-started-1.0.jar" \
 --env FLEX_TEMPLATE_JAVA_MAIN_CLASS="com.example.dataflow.FlexTemplateGettingStarted"

다음을 바꿉니다.

BUCKET_NAME: 이전에 만든 Cloud Storage 버킷의 이름입니다.
LOCATION: 위치입니다.
PROJECT_ID Google Cloud 프로젝트 ID입니다.
REPOSITORY: 이전에 만든 Artifact Registry 저장소 이름

Python

gcloud dataflow flex-template build gs://BUCKET_NAME/getting_started-py.json \
 --image-gcr-path "LOCATION-docker.pkg.dev/PROJECT_ID/REPOSITORY/getting-started-python:latest" \
 --sdk-language "PYTHON" \
 --flex-template-base-image "PYTHON3" \
 --metadata-file "metadata.json" \
 --py-path "." \
 --env "FLEX_TEMPLATE_PYTHON_PY_FILE=getting_started.py" \
 --env "FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE=requirements.txt"

다음을 바꿉니다.

BUCKET_NAME: 이전에 만든 Cloud Storage 버킷의 이름입니다.
LOCATION: 위치입니다.
PROJECT_ID Google Cloud 프로젝트 ID입니다.
REPOSITORY: 이전에 만든 Artifact Registry 저장소 이름

Go

gcloud builds submit 명령어를 사용하여 Cloud Build에서 Dockerfile을 사용하여 Docker 이미지를 빌드합니다. 이 명령어는 파일을 빌드하여 Artifact Registry 저장소에 푸시합니다.
```
gcloud builds submit --tag LOCATION-docker.pkg.dev/PROJECT_ID/REPOSITORY/dataflow/wordcount-go:latest .
```
다음을 바꿉니다.
- LOCATION: 위치입니다.
- PROJECT_ID Google Cloud 프로젝트 ID입니다.
- REPOSITORY: 이전에 만든 Artifact Registry 저장소 이름

gcloud dataflow flex-template build 명령어를 사용하여 Cloud Storage 버킷에 wordcount-go.json이라는 Flex 템플릿을 만듭니다.

gcloud dataflow flex-template build gs://BUCKET_NAME/samples/dataflow/templates/wordcount-go.json \
  --image "LOCATION-docker.pkg.dev/PROJECT_ID/REPOSITORY/dataflow/wordcount-go:latest" \
  --sdk-language "GO" \
  --metadata-file "metadata.json"

BUCKET_NAME을 이전에 만든 Cloud Storage 버킷의 이름으로 바꿉니다.

Flex 템플릿 실행

이 단계에서는 템플릿을 사용하여 Dataflow 작업을 실행합니다.

자바

gcloud dataflow flex-template run 명령어를 사용하여 Flex 템플릿을 사용하는 Dataflow 작업을 실행합니다.
```
gcloud dataflow flex-template run "getting-started-`date +%Y%m%d-%H%M%S`" \
 --template-file-gcs-location "gs://BUCKET_NAME/getting_started-java.json" \
 --parameters output="gs://BUCKET_NAME/output-" \
 --additional-user-labels "LABELS" \
 --region "REGION"
```
다음을 바꿉니다.
- BUCKET_NAME: 이전에 만든 Cloud Storage 버킷의 이름입니다.
- REGION: 리전
- LABELS: 선택사항. <key1>=<val1>,<key2>=<val2>,... 형식을 사용하여 작업에 연결된 라벨입니다.
Google Cloud 콘솔에서 Dataflow 작업 상태를 보려면 Dataflow 작업 페이지로 이동합니다.

작업으로 이동

작업이 성공적으로 실행되면 Cloud Storage 버킷에 있는 gs://BUCKET_NAME/output--00000-of-00001.txt라는 파일에 출력이 기록됩니다.

Python

gcloud dataflow flex-template run 명령어를 사용하여 Flex 템플릿을 사용하는 Dataflow 작업을 실행합니다.
```
gcloud dataflow flex-template run "getting-started-`date +%Y%m%d-%H%M%S`" \
 --template-file-gcs-location "gs://BUCKET_NAME/getting_started-py.json" \
 --parameters output="gs://BUCKET_NAME/output-" \
 --additional-user-labels "LABELS" \
 --region "REGION"
```
다음을 바꿉니다.
- BUCKET_NAME: 이전에 만든 Cloud Storage 버킷의 이름입니다.
- REGION: 리전
- LABELS: 선택사항. <key1>=<val1>,<key2>=<val2>,... 형식을 사용하여 작업에 연결된 라벨입니다.
Google Cloud 콘솔에서 Dataflow 작업 상태를 보려면 Dataflow 작업 페이지로 이동합니다.

작업으로 이동

작업이 성공적으로 실행되면 Cloud Storage 버킷에 있는 gs://BUCKET_NAME/output--00000-of-00001.txt라는 파일에 출력이 기록됩니다.

Go

gcloud dataflow flex-template run 명령어를 사용하여 Flex 템플릿을 사용하는 Dataflow 작업을 실행합니다.
```
gcloud dataflow flex-template run "wordcount-go-`date +%Y%m%d-%H%M%S`" \
 --template-file-gcs-location "gs://BUCKET_NAME/samples/dataflow/templates/wordcount-go.json" \
 --parameters output="gs://BUCKET_NAME/samples/dataflow/templates/counts.txt" \
 --additional-user-labels "LABELS" \
 --region "REGION"
```
다음을 바꿉니다.
- BUCKET_NAME: 이전에 만든 Cloud Storage 버킷의 이름입니다.
- REGION: 리전
- LABELS: 선택사항. <key1>=<val1>,<key2>=<val2>,... 형식을 사용하여 작업에 연결된 라벨입니다.
Google Cloud 콘솔에서 Dataflow 작업 상태를 보려면 Dataflow 작업 페이지로 이동합니다.

작업으로 이동

작업이 성공적으로 실행되면 Cloud Storage 버킷에 있는 gs://BUCKET_NAME/samples/dataflow/templates/count.txt라는 파일에 출력이 기록됩니다.

삭제

이 튜토리얼에서 사용된 리소스 비용이 Google Cloud 계정에 청구되지 않도록 하려면 리소스가 포함된 프로젝트를 삭제하거나 프로젝트를 유지하고 개별 리소스를 삭제하세요.

프로젝트 삭제

주의: 프로젝트 삭제가 미치는 영향은 다음과 같습니다.

프로젝트의 모든 항목이 삭제됩니다. 이 문서의 태스크에 기존 프로젝트를 사용한 경우 프로젝트를 삭제하면 프로젝트에서 수행한 다른 작업도 삭제됩니다.
커스텀 프로젝트 ID가 손실됩니다. 이 프로젝트를 만들 때 앞으로 사용할 커스텀 프로젝트 ID를 만들었을 수 있습니다. appspot.com URL과 같이 프로젝트 ID를 사용하는 URL을 보존하려면 전체 프로젝트를 삭제하는 대신 프로젝트 내에서 선택한 리소스만 삭제합니다.

여러 아키텍처, 튜토리얼, 빠른 시작을 살펴보려는 경우 프로젝트를 재사용하면 프로젝트 할당량 한도 초과를 방지할 수 있습니다.

Delete a Google Cloud project:

gcloud projects delete PROJECT_ID

개별 리소스 삭제

Cloud Storage 버킷 및 버킷의 모든 객체를 삭제합니다.
```
gcloud storage rm gs://BUCKET_NAME --recursive
```

Artifact Registry 저장소를 삭제합니다.

gcloud artifacts repositories delete REPOSITORY \
    --location=LOCATION

Compute Engine 기본 서비스 계정에 부여한 역할을 취소합니다. 다음 IAM 역할마다 다음 명령어를 1회 실행합니다.
- roles/dataflow.admin
- roles/dataflow.worker
- roles/storage.objectAdmin
- roles/artifactregistry.writer
```
gcloud projects remove-iam-policy-binding PROJECT_ID \
    --member=serviceAccount:PROJECT_NUMBER-compute@developer.gserviceaccount.com \
    --role=SERVICE_ACCOUNT_ROLE
```
Optional: Revoke the authentication credentials that you created, and delete the local credential file.
```
gcloud auth application-default revoke
```
Optional: Revoke credentials from the gcloud CLI.
```
gcloud auth revoke
```

다음 단계

Flex 템플릿 구성 방법 알아보기
Google 제공 템플릿 목록 참조