Esta página foi traduzida pela API Cloud Translation.

Criar um pipeline do Dataflow usando Python

Este documento mostra como usar o SDK do Apache Beam para Python na criação de um programa que defina um pipeline. Em seguida, execute o pipeline usando um executor local direto ou um executor baseado na nuvem, como o Dataflow. Para uma introdução ao pipeline do WordCount, consulte o vídeo Como usar o WordCount no Apache Beam.

Para seguir as instruções detalhadas desta tarefa diretamente no console do Google Cloud , clique em Orientação:

Orientações

Antes de começar

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

Install the Google Cloud CLI.

Se você estiver usando um provedor de identidade externo (IdP), primeiro faça login na CLI gcloud com sua identidade federada.

Para inicializar a CLI gcloud, execute o seguinte comando:

gcloud init

Create or select a Google Cloud project.

Create a Google Cloud project:
```
gcloud projects create PROJECT_ID
```
Replace PROJECT_ID with a name for the Google Cloud project you are creating.
Select the Google Cloud project that you created:
```
gcloud config set project PROJECT_ID
```
Replace PROJECT_ID with your Google Cloud project name.

Verify that billing is enabled for your Google Cloud project.

Enable the Dataflow, Compute Engine, Cloud Logging, Cloud Storage, Google Cloud Storage JSON, BigQuery, Cloud Pub/Sub, Cloud Datastore, and Cloud Resource Manager APIs:

gcloud services enable dataflow compute_component logging storage_component storage_api bigquery pubsub datastore.googleapis.com cloudresourcemanager.googleapis.com

Create local authentication credentials for your user account:

gcloud auth application-default login

If an authentication error is returned, and you are using an external identity provider (IdP), confirm that you have signed in to the gcloud CLI with your federated identity.

Grant roles to your user account. Run the following command once for each of the following IAM roles: roles/iam.serviceAccountUser

gcloud projects add-iam-policy-binding PROJECT_ID --member="user:USER_IDENTIFIER" --role=ROLE

Replace the following:

PROJECT_ID: your project ID.
USER_IDENTIFIER: the identifier for your user account—for example, myemail@example.com.
ROLE: the IAM role that you grant to your user account.

Install the Google Cloud CLI.

Se você estiver usando um provedor de identidade externo (IdP), primeiro faça login na CLI gcloud com sua identidade federada.

Para inicializar a CLI gcloud, execute o seguinte comando:

gcloud init

Create or select a Google Cloud project.

Create a Google Cloud project:
```
gcloud projects create PROJECT_ID
```
Replace PROJECT_ID with a name for the Google Cloud project you are creating.
Select the Google Cloud project that you created:
```
gcloud config set project PROJECT_ID
```
Replace PROJECT_ID with your Google Cloud project name.

Verify that billing is enabled for your Google Cloud project.

Enable the Dataflow, Compute Engine, Cloud Logging, Cloud Storage, Google Cloud Storage JSON, BigQuery, Cloud Pub/Sub, Cloud Datastore, and Cloud Resource Manager APIs:

gcloud services enable dataflow compute_component logging storage_component storage_api bigquery pubsub datastore.googleapis.com cloudresourcemanager.googleapis.com

Create local authentication credentials for your user account:

gcloud auth application-default login

If an authentication error is returned, and you are using an external identity provider (IdP), confirm that you have signed in to the gcloud CLI with your federated identity.

Grant roles to your user account. Run the following command once for each of the following IAM roles: roles/iam.serviceAccountUser

gcloud projects add-iam-policy-binding PROJECT_ID --member="user:USER_IDENTIFIER" --role=ROLE

Replace the following:

PROJECT_ID: your project ID.
USER_IDENTIFIER: the identifier for your user account—for example, myemail@example.com.
ROLE: the IAM role that you grant to your user account.

Conceda papéis à conta de serviço padrão do Compute Engine. Execute uma vez o seguinte comando para cada um dos seguintes papéis do IAM:
- roles/dataflow.admin
- roles/dataflow.worker
- roles/storage.objectAdmin
```
gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:PROJECT_NUMBER-compute@developer.gserviceaccount.com" --role=SERVICE_ACCOUNT_ROLE
```
- Substitua PROJECT_ID pela ID do seu projeto.
- Substitua PROJECT_NUMBER pelo número do projeto. Para encontrar o número do projeto, consulte Identificar projetos ou use o comando gcloud projects describe.
- Substitua SERVICE_ACCOUNT_ROLE por cada papel individual.
Create a Cloud Storage bucket and configure it as follows:
- Set the storage class to S (Standard).
- Defina o local de armazenamento como o seguinte: US (Estados Unidos).
- Substitua BUCKET_NAME por um nome de bucket exclusivo . Não inclua informações confidenciais no nome do bucket já que o namespace dele é global e visível para o público.
- Copie o ID do projeto Google Cloud e o nome do bucket do Cloud Storage. Você precisará desses valores posteriormente neste documento.

Criar um pipeline do Dataflow usando Python

Antes de começar

Configurar o ambiente

Instale o SDK do Apache Beam

Execute o pipeline localmente

Executar o pipeline no serviço do Dataflow

Ver os resultados

Google Cloud console

Terminal local

Modificar o código do pipeline

Limpar

A seguir