Mantenha tudo organizado com as coleções
Salve e categorize o conteúdo com base nas suas preferências.
Criar um pipeline de dados
Neste guia de início rápido, você vai aprender a fazer o seguinte:
Crie uma instância do Cloud Data Fusion.
Implante um pipeline de amostra fornecido com sua instância do Cloud Data Fusion. O pipeline faz o seguinte:
Lê um arquivo JSON contendo dados de best-sellers do NYT a partir do Cloud Storage.
Executa transformações no arquivo para analisar e limpar os dados.
Carrega os livros com melhor classificação adicionados na última semana que custem menos de US$ 25 no BigQuery.
Antes de começar
Sign in to your Google Cloud account. If you're new to
Google Cloud,
create an account to evaluate how our products perform in
real-world scenarios. New customers also get $300 in free credits to
run, test, and deploy workloads.
In the Google Cloud console, on the project selector page,
select or create a Google Cloud project.
Para as versões 6.2.3 e mais recentes do Cloud Data Fusion, no campo
Autorização, escolha a conta de serviço do Dataproc
a ser usada para executar seu pipeline do Cloud Data Fusion no
Dataproc. O valor padrão, a conta do Compute Engine, é
pré-selecionado.
Clique em Criar.
O processo de criação da instância leva até 30 minutos para ser concluído.
Enquanto o Cloud Data Fusion cria a instância, um
indicador de progresso é exibido ao lado do nome dela na página Instâncias. Após
a conclusão, o indicador se transforma em uma marca de
seleção verde, indicando que a instância já pode ser usada.
Navegar pela interface da Web do Cloud Data Fusion
Ao usar o Cloud Data Fusion, você usa o console Google Cloud e a interface da Web separada do Cloud Data Fusion.
No console Google Cloud , é possível fazer o seguinte:
Criar um Google Cloud projeto do console
Criar e excluir instâncias do Cloud Data Fusion
Ver os detalhes da instância do Cloud Data Fusion
Na interface da Web do Cloud Data Fusion, é possível usar várias páginas, como Studio ou Wrangler, para usar a funcionalidade do Cloud Data Fusion.
Para navegar na interface do Cloud Data Fusion, siga estas etapas:
No console do Google Cloud , abra a página Instâncias.
Na coluna Ações da instância, clique no link Visualizar instância.
Na interface da Web do Cloud Data Fusion, use o painel de navegação à esquerda para
navegar até a página de que você precisa.
Implantar um pipeline de amostra
Os pipelines de amostra estão disponíveis no Hub do Cloud Data Fusion, que permite compartilhar pipelines, plug-ins e soluções reutilizáveis do Cloud Data Fusion.
Na interface da Web do Cloud Data Fusion, clique em Hub.
No painel esquerdo, clique em Pipelines.
Clique no pipeline do Guia de início rápido do Cloud Data Fusion.
Clique em Criar.
No painel de configuração do Guia de início rápido do Cloud Data Fusion, clique em Concluir.
Clique em Personalizar pipeline.
Uma representação visual do pipeline aparece na página Studio, que é uma interface gráfica para desenvolver pipelines de integração de dados.
Os plug-ins de pipeline disponíveis são listados à esquerda, e o pipeline é exibido na área de tela principal. Para explorar o pipeline, mantenha o ponteiro sobre cada nó e clique em Propriedades. O menu de propriedades de cada nó permite visualizar os objetos e as operações associadas a ele.
No menu superior direito, clique em Implantar. Essa etapa envia o pipeline para o Cloud Data Fusion. Você executará o pipeline na próxima seção deste guia de início rápido.
Visualizar o pipeline
O pipeline implantado aparece na visualização de detalhes do pipeline, onde é possível fazer o seguinte:
visualizar a estrutura e configuração do pipeline;
executar o pipeline manualmente ou configurar uma programação ou um gatilho;
ver um resumo das execuções históricas do pipeline, incluindo ambientes de execução, registros e métricas.
Executar o pipeline
Na visualização de detalhes do pipeline, clique em Executar para executar o pipeline.
Ao executar um pipeline, o Cloud Data Fusion faz o seguinte:
Provisiona um cluster temporário do Dataproc.
Executa o pipeline no cluster usando o Apache Spark
Exclui o cluster
Ver os resultados
Após alguns minutos, o pipeline é concluído. O status do pipeline muda para Concluído e o número de registros processados por cada nó é exibido.
Para ver uma amostra dos resultados, acesse o conjunto de dados DataFusionQuickstart no projeto, clique na tabela top_rated_inexpensive e execute uma consulta simples. Exemplo:
SELECT * FROM PROJECT_ID.GCPQuickStart.top_rated_inexpensive LIMIT 10
Substitua PROJECT_ID pela ID do seu projeto.
Limpar
Para evitar cobranças na conta do Google Cloud pelos
recursos usados nesta página, siga as etapas abaixo.
[[["Fácil de entender","easyToUnderstand","thumb-up"],["Meu problema foi resolvido","solvedMyProblem","thumb-up"],["Outro","otherUp","thumb-up"]],[["Difícil de entender","hardToUnderstand","thumb-down"],["Informações incorretas ou exemplo de código","incorrectInformationOrSampleCode","thumb-down"],["Não contém as informações/amostras de que eu preciso","missingTheInformationSamplesINeed","thumb-down"],["Problema na tradução","translationIssue","thumb-down"],["Outro","otherDown","thumb-down"]],["Última atualização 2025-09-04 UTC."],[[["\u003cp\u003eThis guide demonstrates creating a Cloud Data Fusion instance, which can take up to 30 minutes to provision, and is accessible through both the Google Cloud console and a separate web interface.\u003c/p\u003e\n"],["\u003cp\u003eA sample pipeline is deployed from the Cloud Data Fusion Hub, which reads and transforms JSON data from Cloud Storage, then loads filtered data into BigQuery.\u003c/p\u003e\n"],["\u003cp\u003eThe deployed pipeline is managed in the pipeline details view, allowing users to view its configuration, run it manually, schedule runs, and check its execution history.\u003c/p\u003e\n"],["\u003cp\u003eExecuting the pipeline provisions a temporary Dataproc cluster to process the data using Apache Spark, which is then deleted after completion.\u003c/p\u003e\n"],["\u003cp\u003eAfter the pipeline runs successfully, the processed data can be reviewed by querying the designated BigQuery table, and users can clean up resources, including deleting the BigQuery dataset and the Cloud Data Fusion instance.\u003c/p\u003e\n"]]],[],null,["# Create a data pipeline by using Cloud Data Fusion\n\nCreate a data pipeline\n======================\n\nThis quickstart shows you how to do the following:\n\n1. Create a Cloud Data Fusion instance.\n2. Deploy a sample pipeline that's provided with your Cloud Data Fusion instance. The pipeline does the following:\n 1. Reads a JSON file containing NYT bestseller data from Cloud Storage.\n 2. Runs transformations on the file to parse and clean the data.\n 3. Loads the top-rated books added in the last week that cost less than $25 into BigQuery.\n\nBefore you begin\n----------------\n\n- Sign in to your Google Cloud account. If you're new to Google Cloud, [create an account](https://console.cloud.google.com/freetrial) to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.\n- In the Google Cloud console, on the project selector page,\n select or create a Google Cloud project.\n\n | **Note**: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.\n\n [Go to project selector](https://console.cloud.google.com/projectselector2/home/dashboard)\n-\n\n\n Enable the Cloud Data Fusion API.\n\n\n [Enable the API](https://console.cloud.google.com/flows/enableapi?apiid=datafusion.googleapis.com)\n\n- In the Google Cloud console, on the project selector page,\n select or create a Google Cloud project.\n\n | **Note**: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.\n\n [Go to project selector](https://console.cloud.google.com/projectselector2/home/dashboard)\n-\n\n\n Enable the Cloud Data Fusion API.\n\n\n [Enable the API](https://console.cloud.google.com/flows/enableapi?apiid=datafusion.googleapis.com)\n\nCreate a Cloud Data Fusion instance\n-----------------------------------\n\n1. Click **Create an instance** .\n\n [Go to Instances](https://console.cloud.google.com/data-fusion/instance-create)\n2. Enter an **Instance name**.\n3. Enter a **Description** for your instance.\n4. Enter the **Region** in which to create the instance.\n5. Choose the Cloud Data Fusion **Version** to use.\n6. Choose the Cloud Data Fusion [**Edition**](/data-fusion/pricing).\n7. For Cloud Data Fusion versions 6.2.3 and later, in the **Authorization** field, choose the [**Dataproc service account**](/dataproc/docs/concepts/configuring-clusters/service-accounts) to use for running your Cloud Data Fusion pipeline in Dataproc. The default value, Compute Engine account, is pre-selected.\n8. Click **Create** . It takes up to 30 minutes for the instance creation process to complete. While Cloud Data Fusion creates your instance, a progress wheel displays next to the instance name on the **Instances** page. After completion, it turns into a green check mark and indicates that you can start using the instance.\n\nNavigate the Cloud Data Fusion web interface\n--------------------------------------------\n\nWhen using Cloud Data Fusion, you use both the Google Cloud console\nand the separate Cloud Data Fusion web interface.\n\n- In the Google Cloud console, you can do the following:\n\n - Create a Google Cloud console project\n - Create and delete Cloud Data Fusion instances\n - View the Cloud Data Fusion instance details\n- In the Cloud Data Fusion web interface, you can use various pages, such\n as **Studio** or **Wrangler**, to use Cloud Data Fusion functionality.\n\nTo navigate the Cloud Data Fusion interface, follow these steps:\n\n1. In the Google Cloud console, open the **Instances** page.\n\n [Go to Instances](https://console.cloud.google.com/data-fusion/locations/-/instances)\n2. In the instance **Actions** column, click the **View Instance** link.\n3. In the Cloud Data Fusion web interface, use the left navigation panel to navigate to the page you need.\n\nDeploy a sample pipeline\n------------------------\n\nSample pipelines are available through the Cloud Data Fusion **Hub**,\nwhich lets you share reusable Cloud Data Fusion pipelines, plugins,\nand solutions.\n\n1. In the Cloud Data Fusion web interface, click **Hub**.\n2. In the left panel, click **Pipelines**.\n3. Click the **Cloud Data Fusion Quickstart** pipeline.\n4. Click **Create**.\n5. In the Cloud Data Fusion Quickstart configuration panel, click **Finish**.\n6. Click **Customize Pipeline**.\n\n A visual representation of your pipeline appears on the **Studio** page,\n which is a graphical interface for developing data integration pipelines.\n Available pipeline plugins are listed on the left, and your pipeline is\n displayed on the main canvas area. You can explore your pipeline by holding\n the pointer over each pipeline *node* and clicking **Properties**. The\n properties menu for each node lets you view the objects and operations\n associated with the node.\n7. In the top-right menu, click **Deploy**. This step submits the pipeline to\n Cloud Data Fusion. You will execute the pipeline in the next section of\n this quickstart.\n\n### View your pipeline\n\nThe deployed pipeline appears in the pipeline details view, where you can do\nthe following:\n\n- View the structure and configuration of the pipeline.\n- Run the pipeline manually or set up a schedule or a trigger.\n- View a summary of historical runs of the pipeline, including execution times, logs, and metrics.\n\nExecute your pipeline\n---------------------\n\nIn the pipeline details view, click **Run** to execute your pipeline.\n\nWhen executing a pipeline, Cloud Data Fusion does the following:\n\n1. Provisions an ephemeral Dataproc cluster\n2. Executes the pipeline on the cluster using Apache Spark\n3. Deletes the cluster\n\n| **Note:** When the pipeline transitions to the *Running* state, you can [monitor the Dataproc cluster creation and deletion](https://console.cloud.google.com/dataproc/clusters). This cluster only exists for the duration of the pipeline.\n\nView the results\n----------------\n\nAfter a few minutes, the pipeline finishes. The pipeline status changes to\n**Succeeded** and the number of records processed by each node is displayed.\n\n1. Go to the [BigQuery web interface](https://console.cloud.google.com/bigquery).\n2. To view a sample of the results, go to the `DataFusionQuickstart` dataset\n in your project, click the\n `top_rated_inexpensive` table, then run a simple query. For example:\n\n SELECT * FROM \u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e.GCPQuickStart.top_rated_inexpensive LIMIT 10\n\n Replace \u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e with your project ID.\n\nClean up\n--------\n\n\nTo avoid incurring charges to your Google Cloud account for\nthe resources used on this page, follow these steps.\n\n1. [Delete the BigQuery dataset](https://console.cloud.google.com/bigquery) that your pipeline wrote to in this quickstart.\n2. [Delete the Cloud Data Fusion instance](https://console.cloud.google.com/data-fusion/locations/-/instances).\n\n | **Note:** Deleting your instance does not delete any of your data in the project.\n3. Optional: Delete the project.\n\n\u003c!-- --\u003e\n\n| **Caution** : Deleting a project has the following effects:\n|\n| - **Everything in the project is deleted.** If you used an existing project for the tasks in this document, when you delete it, you also delete any other work you've done in the project.\n| - **Custom project IDs are lost.** When you created this project, you might have created a custom project ID that you want to use in the future. To preserve the URLs that use the project ID, such as an `appspot.com` URL, delete selected resources inside the project instead of deleting the whole project.\n|\n|\n| If you plan to explore multiple architectures, tutorials, or quickstarts, reusing projects\n| can help you avoid exceeding project quota limits.\n1. In the Google Cloud console, go to the **Manage resources** page.\n\n [Go to Manage resources](https://console.cloud.google.com/iam-admin/projects)\n2. In the project list, select the project that you want to delete, and then click **Delete**.\n3. In the dialog, type the project ID, and then click **Shut down** to delete the project.\n\n\u003cbr /\u003e\n\nWhat's next\n-----------\n\n- Work through a Cloud Data Fusion [tutorial](/data-fusion/docs/tutorials/targeting-campaign-pipeline)\n- Learn about Cloud Data Fusion [concepts](/data-fusion/docs/concepts/overview)"]]