This page shows you how to do basic operations in Cloud Data Fusion using the Google Cloud Platform Console. You will create a Cloud Data Fusion pipeline that completes the following tasks:

  1. Reads a JSON file containing NYT bestseller data from Cloud Storage
  2. Runs transformations on the file to parse and clean the data
  3. Loads the top rated books added in the last week that cost less than $25 into BigQuery

This page also shows you how to access your Cloud Data Fusion instance through the GCP Console.

Before you begin

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. Select or create a GCP project.

    Go to the Project selector page

  3. Enable the Cloud Data Fusion API.

    Enable the API

  4. Create a Cloud Data Fusion instance.

    Open the Create Instance page

    It takes up to 20 minutes to deploy Cloud Data Fusion. The instance creation process is completed when a green checkmark displays to the left of the instance name on the Instances page in the GCP Console.

  5. When instance creation completes, grant permissions to the service account associated with the instance.
    1. In the Instances page, under Instance name, click the name of your instance.
      Click the name of your instance.
    2. In the Instance details page that opens up, copy the Service account value.
      Copy the service account.
    3. In the left navigation menu, under IAM & Admin, navigate to the IAM page.
    4. At the top of the IAM page, click Add.
    5. In the Add members panel that opens, in the New members box, paste the service account that you copied.
    6. Under Select a role, start typing Cloud Data Fusion API Service Agent. Click the Cloud Data Fusion API Service Agent role.
      Paste the service account and add IAM role.
    7. Click Save.

View instance details

  1. In the GCP Console, open the Instances page.

    Open the Instances page

  2. Click the name of the instance to see its details. The Instance Details page provides information, such as the Cloud Data Fusion graphical interface, the zone, networking configuration, labels and advanced configuration.

Access the Cloud Data Fusion graphical interface

  1. In the GCP Console, open the Instances page.

    Open the Instances page

  2. In the Actions column for the instance, click the View Instance link. The Cloud Data Fusion graphical interface opens.

New user experience

If you are accessing the Cloud Data Fusion graphical interface for the first time, a Welcome prompt takes you through a tour of the interface.

Deploy a sample pipeline

Some sample pipelines are available through the Cloud Data Fusion Hub, which lets you share reusable Cloud Data Fusion pipelines, plugins, and solutions.

  1. Click the HUB link on the navigation bar at the top.
  2. Click the Pipelines tab on the left.
  3. Choose the Cloud Data Fusion Quickstart pipeline.
  4. Click the Customize pipeline button.

The pipeline now appears in the Data Fusion Studio, which is a graphical designer interface to develop data integration pipelines visually. All the available plugins are on the left, and your pipeline is on the main canvas area. You can explore your pipeline by pointing at a node in the pipeline and clicking the Properties button that appears.

For the purposes of this Quick Start, deploy the pipeline by clicking the Deploy button. Deploying the pipeline submits it to Cloud Data Fusion and enables you to execute it later.

Deploy the pipeline.

View your pipeline

After a pipeline is deployed, Cloud Data Fusion displays the pipeline on the Pipeline Detail page where you can:

  • View the pipeline's structure and configuration
  • Run the pipeline manually or set up a schedule or a trigger
  • View a summary of the pipeline's historical runs, including execution times, logs, and metrics
Copy the service account.

Execute your pipeline

To execute your pipeline, click Run in the Pipeline Detail view.

When you run a pipeline, Cloud Data Fusion provisions an ephemeral Cloud Dataproc cluster, executes the pipeline on the cluster by using Apache Hadoop MapReduce or Apache Spark, and then tears down the cluster.

View the results

The pipeline takes a few minutes to complete. After pipeline completes, you can see the status of the pipeline transition to Succeeded and the number of records that each node in the pipeline has processed.

Pipeline run complete.

To view the results of the DataFusionQuickstart pipeline, go to the BigQuery UI. The results of the pipeline are in the top_rated_inexpensive table, which is inside the DataFusionQuickstart dataset in your project.

View results.

Clean up

To avoid incurring charges to your GCP account for the resources used in this quickstart:

  1. Delete the BigQuery dataset your pipeline wrote to in this quickstart.
  2. Delete the Cloud Data Fusion instance.

  3. (Optional) Delete the project.

    1. In the GCP Console, go to the Projects page.

      Go to the Projects page

    2. In the project list, select the project you want to delete and click Delete .
    3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

Gửi phản hồi về...