This tutorial shows how to build a reusable pipeline that reads data from Cloud Storage, performs data quality checks, and writes to Cloud Storage.
Objectives
- Use the Argument Setter plugin to allow the pipeline to read different input in every run.
- Use the Argument Setter plugin to allow the pipeline to perform different quality checks in every run.
- Write the output data of each run to Cloud Storage.
Costs
This tutorial uses billable components of Google Cloud, including:
- Cloud Data Fusion
- Cloud Storage
Use the pricing calculator to generate a cost estimate based on your projected usage. New Google Cloud users might be eligible for a free trial.
Before you begin
-
Sign in to your Google Account.
If you don't already have one, sign up for a new account.
-
In the Cloud Console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project. Learn how to confirm billing is enabled for your project.
- Enable the Cloud Data Fusion, Cloud Storage, and Cloud Dataproc APIs.
- Create a Cloud Data Fusion instance.
Navigating the Cloud Data Fusion UI
When using Cloud Data Fusion, you use both the Cloud Console and the separate Cloud Data Fusion UI.
In the Cloud Console, you can create a Cloud Console project, create and delete Cloud Data Fusion instances, and view Cloud Data Fusion instance details.
In the Cloud Data Fusion UI, you can use the various pages such as Pipeline Studio or Wrangler to use Cloud Data Fusion features.
To navigate the Cloud Data Fusion UI, follow these steps:
- In the Cloud Console, open the Instances page.
- In the Actions column for the instance, click the View Instance link.
- In the Cloud Data Fusion UI, use the left navigation panel to navigate to the page you need.
Deploying the Argument Setter plugin
- From any page in the Cloud Data Fusion UI, in the top menu bar, click Hub.
- Click the Argument setter action plugin and click Deploy.
- In the Deploy window that opens, click Finish.
- Click Create a pipeline. The Pipeline Studio page opens.
Reading from Cloud Storage
- In the left panel of the Pipeline Studio page, using the Source drop-down menu, select Google Cloud Storage.
- Hover over the Cloud Storage source card and click the Properties
button that appears.
- In the Reference name field, enter a name.
- In the Path field, enter
${input.path}
. This macro controls what the Cloud Storage input path will be in the different pipeline runs. - In the right Output Schema panel, remove the offset field from the output
schema by clicking the trash icon in the offset field row.
- Click the X button to exit the Properties dialog box.
Transforming your data
- In the left panel of the Pipeline Studio page, using the Transform drop-down menu arrow_drop_down, select Wrangler.
- In the Pipeline Studio canvas, drag an arrow from the Cloud Storage card
to the Wrangler card.
- Hover over the Wrangler card and click the Properties button that appears.
- In the Input field name, enter
body
. - In the Recipe field, enter
${directives}
. This macro controls what the transform logic will be in the different pipeline runs. - Click the X button to exit the Properties dialog box.
Writing to Cloud Storage
- In the left panel of the Pipeline Studio page, using the Sink drop-down menu arrow_drop_down, select Cloud Storage.
- In the Pipeline Studio canvas, drag an arrow from the Wrangler card to the
Cloud Storage card you just added.
- Hover over the Cloud Storage sink card and click the Properties button that appears.
- In the Reference name field, enter a name.
- In the Path field, enter the path of a Cloud Storage bucket in
your project, where your pipeline can write the output files. If you don't have
a Cloud Storage bucket, create one.
- Click the X button to exit the Properties menu.
Setting the macro arguments
- In the left panel of the Pipeline Studio page, using the Conditions and Actions drop-down menu, select the Argument Setter plugin.
- In the Pipeline Studio canvas, drag an arrow from the Argument Setter card to
the Cloud Storage source card.
- Hover over the Argument Setter card and click the Properties button that appears.
In the URL field, enter the following.
https://storage.googleapis.com/datafusion-example-data/args.json
The URL corresponds to a publicly accessible object in Cloud Storage that contains the following content:
{ "arguments" : [ { "name": "input.path", "value": "gs://datafusion-example-data/user-emails.txt" }, { "name": "directives", "value": "send-to-error !dq:isEmail(body)" } ] }
The first of the two arguments is the value for
input.path
. The pathgs://datafusion-example-data/user-emails.txt
is a publicly accessible object in Cloud Storage that contains the following test data:alice@example.com bob@example.com craig@invalid@example.com
The second argument is the value for
directives
. The valuesend-to-error !dq:isEmail(body)
configures Wrangler to filter out any lines that are not a valid email address. For example,craig@invalid@example.com
is filtered out.Click the X button to exit the Properties menu.
Deploying and running your pipeline
- In the top bar of the Pipeline Studio page, click Name your pipeline. Name your pipeline and click Save.
- Click Deploy.
- Click the drop-down menu arrow_drop_down
on the Run button. Check the boxes for the
input.path
anddirectives
arguments. This notifies Cloud Data Fusion that the pipeline will set values for these required arguments during runtime. Click Run.
Cleaning up
To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial:
After you've finished the tutorial, clean up the resources you created on Google Cloud so they won't take up quota and you won't be billed for them in the future. The following sections describe how to delete or turn off these resources.
Deleting the Cloud Data Fusion instance
Follow the instructions to delete your Cloud Data Fusion instance.
Deleting the project
The easiest way to eliminate billing is to delete the project that you created for the tutorial.
To delete the project:
- In the Cloud Console, go to the Manage resources page.
- In the project list, select the project you want to delete and click Delete delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
- Read the how-to guides
- Work through another tutorial