Creating a reusable pipeline

This tutorial shows how to build a reusable pipeline that reads data from Cloud Storage, performs data quality checks, and writes to Cloud Storage.

Objectives

  • Use the Argument Setter plugin to allow the pipeline to read different input in every run.
  • Use the Argument Setter plugin to allow the pipeline to perform different quality checks in every run.
  • Write the output data of each run to Cloud Storage.

Costs

This tutorial uses billable components of Google Cloud, including:

  • Cloud Data Fusion
  • Cloud Storage

Use the pricing calculator to generate a cost estimate based on your projected usage. New Google Cloud users might be eligible for a free trial.

Before you begin

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. In the Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to the project selector page

  3. Make sure that billing is enabled for your Google Cloud project. Learn how to confirm billing is enabled for your project.

  4. Enable the Cloud Data Fusion, Cloud Storage, and Cloud Dataproc APIs.

    Enable the APIs

  5. Create a Cloud Data Fusion instance.

When using Cloud Data Fusion, you use both the Cloud Console and the separate Cloud Data Fusion UI.

  • In the Cloud Console, you can create a Cloud Console project, create and delete Cloud Data Fusion instances, and view Cloud Data Fusion instance details.

  • In the Cloud Data Fusion UI, you can use the various pages such as Pipeline Studio or Wrangler to use Cloud Data Fusion features.

To navigate the Cloud Data Fusion UI, follow these steps:

  1. In the Cloud Console, open the Instances page.

    Open the Instances page

  2. In the Actions column for the instance, click the View Instance link.
  3. In the Cloud Data Fusion UI, use the left navigation panel to navigate to the page you need.

Deploying the Argument Setter plugin

  1. From any page in the Cloud Data Fusion UI, in the top menu bar, click Hub.
  2. Click the Argument setter action plugin and click Deploy.
  3. In the Deploy window that opens, click Finish.
  4. Click Create a pipeline. The Pipeline Studio page opens.

Reading from Cloud Storage

  1. In the left panel of the Pipeline Studio page, using the Source drop-down menu, select Google Cloud Storage.
  2. Hover over the Cloud Storage source card and click the Properties button that appears.
  3. In the Reference name field, enter a name.
  4. In the Path field, enter ${input.path}. This macro controls what the Cloud Storage input path will be in the different pipeline runs.
  5. In the right Output Schema panel, remove the offset field from the output schema by clicking the trash icon in the offset field row.
  6. Click the X button to exit the Properties dialog box.

Transforming your data

  1. In the left panel of the Pipeline Studio page, using the Transform drop-down menu , select Wrangler.
  2. In the Pipeline Studio canvas, drag an arrow from the Cloud Storage card to the Wrangler card.
  3. Hover over the Wrangler card and click the Properties button that appears.
  4. In the Input field name, enter body.
  5. In the Recipe field, enter ${directives}. This macro controls what the transform logic will be in the different pipeline runs.
  6. Click the X button to exit the Properties dialog box.

Writing to Cloud Storage

  1. In the left panel of the Pipeline Studio page, using the Sink drop-down menu , select Cloud Storage.
  2. In the Pipeline Studio canvas, drag an arrow from the Wrangler card to the Cloud Storage card you just added.
  3. Hover over the Cloud Storage sink card and click the Properties button that appears.
  4. In the Reference name field, enter a name.
  5. In the Path field, enter the path of a Cloud Storage bucket in your project, where your pipeline can write the output files. If you don't have a Cloud Storage bucket, create one.
  6. Click the X button to exit the Properties menu.

Setting the macro arguments

  1. In the left panel of the Pipeline Studio page, using the Conditions and Actions drop-down menu, select the Argument Setter plugin.
  2. In the Pipeline Studio canvas, drag an arrow from the Argument Setter card to the Cloud Storage source card.
  3. Hover over the Argument Setter card and click the Properties button that appears.
  4. In the URL field, enter the following.

    https://storage.googleapis.com/datafusion-example-data/args.json
    

    The URL corresponds to a publicly accessible object in Cloud Storage that contains the following content:

    {
      "arguments" : [
        {
          "name": "input.path",
          "value": "gs://datafusion-example-data/user-emails.txt"
        },
        {
          "name": "directives",
          "value": "send-to-error !dq:isEmail(body)"
        }
      ]
    }
    

    The first of the two arguments is the value for input.path. The path gs://datafusion-example-data/user-emails.txt is a publicly accessible object in Cloud Storage that contains the following test data:

    alice@example.com
    bob@example.com
    craig@invalid@example.com
    

    The second argument is the value for directives. The value send-to-error !dq:isEmail(body) configures Wrangler to filter out any lines that are not a valid email address. For example, craig@invalid@example.com is filtered out.

  5. Click the X button to exit the Properties menu.

Deploying and running your pipeline

  1. In the top bar of the Pipeline Studio page, click Name your pipeline. Name your pipeline and click Save.
  2. Click Deploy.
  3. Click the drop-down menu on the Run button. Check the boxes for the input.path and directives arguments. This notifies Cloud Data Fusion that the pipeline will set values for these required arguments during runtime. Click Run.

Cleaning up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial:

After you've finished the tutorial, clean up the resources you created on Google Cloud so they won't take up quota and you won't be billed for them in the future. The following sections describe how to delete or turn off these resources.

Deleting the Cloud Data Fusion instance

Follow the instructions to delete your Cloud Data Fusion instance.

Deleting the project

The easiest way to eliminate billing is to delete the project that you created for the tutorial.

To delete the project:

  1. In the Cloud Console, go to the Manage resources page.

    Go to the Manage resources page

  2. In the project list, select the project you want to delete and click Delete .
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

Trang này có hữu ích không? Hãy cho chúng tôi biết đánh giá của bạn:

Gửi phản hồi về...

Cloud Data Fusion Documentation