This quickstart shows you how to:

  1. Create a Cloud Data Fusion instance.
  2. Deploy a sample pipeline that's provided with your Cloud Data Fusion instance. The pipeline does the following:
    1. Reads a JSON file containing NYT bestseller data from Cloud Storage.
    2. Runs transformations on the file to parse and clean the data.
    3. Loads the top-rated books added in the last week that cost less than $25 into BigQuery.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Enable the Cloud Data Fusion API.

    Enable the API

Create a Cloud Data Fusion instance

Create a Cloud Data Fusion instance

When using Cloud Data Fusion, you use both the Cloud Console and the separate Cloud Data Fusion UI.

  • In the Cloud Console, you can create a Cloud Console project, create and delete Cloud Data Fusion instances, and view Cloud Data Fusion instance details.

  • In the Cloud Data Fusion web UI, you can use the various pages, such as Pipeline Studio or Wrangler, to use Cloud Data Fusion functionality.

To navigate the Cloud Data Fusion UI, follow these steps:

  1. In the Cloud Console, open the Instances page.

    Open the Instances page

  2. In the instance Actions column, click the View Instance link.
  3. In the Cloud Data Fusion web UI, use the left navigation panel to navigate to the page you need.

Deploy a sample pipeline

Sample pipelines are available through the Cloud Data Fusion Hub, which allows you to share reusable Cloud Data Fusion pipelines, plugins, and solutions.

  1. In the Cloud Data Fusion web UI, click HUB.
  2. In the left panel, click Pipelines.
  3. Click the Cloud Data Fusion Quickstart pipeline.
  4. Click Create.
  5. In the Cloud Data Fusion Quickstart configuration panel, Click Finish.
  6. Click Customize Pipeline. A visual representation of your pipeline appears in the Pipeline Studio, which is a graphical interface for developing data integration pipelines. Available pipeline plugins are listed on the left, and your pipeline is displayed on the main canvas area. You can explore your pipeline by holding the pointer over each pipeline node and clicking the Properties button that appears. The properties menu for each node allows you to view the objects and operations associated with the node.
  7. In the top right menu, click Deploy. This submits the pipeline to Cloud Data Fusion. You will execute the pipeline in the next section of this quickstart.
Deploy the pipeline.

View your pipeline

The deployed pipeline appears in the pipeline details view, where you can do the following:

  • View the pipeline's structure and configuration.
  • Run the pipeline manually or set up a schedule or a trigger.
  • View a summary of the pipeline's historical runs, including execution times, logs, and metrics.
Copy the service account.

Execute your pipeline

In the pipeline details view, click Run to execute your pipeline.

View the results

After a few minutes, the pipeline finishes. The pipeline status changes to Succeeded and the number of records processed by each node is displayed.

Pipeline run complete.
  1. Go to the BigQuery UI.
  2. Under the DataFusionQuickstart dataset in your project, click the top_rated_inexpensive table, then run a simple query, such as, SELECT * FROM `my-project.GCPQuickStart.top_rated_inexpensive` LIMIT 10 (replace "my-project" with your project-id), to view a sample of the results.
View results.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this page, follow these steps.

  1. Delete the BigQuery dataset your pipeline wrote to in this quickstart.
  2. Delete the Cloud Data Fusion instance.

  3. (Optional) Delete the project.

    1. In the Cloud Console, go to the Manage resources page.

      Go to Manage resources

    2. In the project list, select the project that you want to delete, and then click Delete.
    3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next