This page shows you how to get started using the Dataprep Web application.

Before you begin

Set up a project

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. In the Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to the project selector page

  3. Make sure that billing is enabled for your Google Cloud project. Learn how to confirm billing is enabled for your project.

  4. Enable the Cloud Dataflow, BigQuery, and Cloud Storage APIs.

    Enable the APIs

Create a Cloud Storage bucket in your project

  1. In the Cloud Console, go to the Cloud Storage Browser page.

    Go to the Cloud Storage Browser page

  2. Click Create bucket.
  3. In the Create bucket dialog, specify the following attributes:
  4. Click Create.

Set up Dataprep

Open Dataprep on Google Cloud Console. When you first open Dataprep, the project owner is asked to allow data access by Google and Trifacta. The user must accept the terms of service, sign in to their Google account, and choose a Cloud Storage bucket to use with Dataprep (see Enabling Dataprep).

After completing these steps, the Dataprep home page appears. You can Show and run the tour or hide the tour. The tour will take you through steps that parallel the steps in this quickstart.

Create a flow

Dataprep uses a flow workspace to access and manipulate datasets. From the Dataprep home page, click Create Flow, fill in a flow name and description, then click Create. Since this quickstart uses 2016 data from the United States Federal Elections Commission 2016, you may wish to name it, "FEC-2016", with a description that refers to this data.

The flow screen opens.

Import datasets

From your flow page, click Import & Add Datasets to open the "Import Data and Add to Flow" page. Select select GCS in the left panel. Under Choose a file or folder, click the GCS edit path pencil widget, then insert gs://dataprep-samples/us-fec in the GCS text box, and then click Go.

Add the cn-2016.txt dataset, naming it “Candidate Master 2016”, and add the itcont-2016.txt dataset, naming it “Campaign Contributions 2016”. After both datasets are listed in the right pane, click Import & Add to Flow to add the datasets.

Prep the Candidate file

On the FEC 2016 Flow page, select the Candidate Master 2016 dataset, then click Add new Recipe.

A new recipe icon appears. Click Edit Recipe.

The grid view opens, where you can explore your data and apply transformation steps to the recipe using the Recipe pane displayed at the right side of the page.

To Display the Recipe pane: If the Recipe pane is not displayed in Grid view, click the Recipe icon at the top-right of the Grid view page.

column5 (Date/Time) contains a year value. Select the years 2016 and 2017 in the histogram by dragging across them.

In the right pane titled "Keep rows where value is in selected range", click Add.

The following receipt step is added to the recipe:

Keep rows where(date(2016, 1, 1) <= column5) && (column5 < date(2018, 1, 1))

Hover over and click the mismatched (red) portion of the column6 (State) header to select the mismatched rows.

Scroll down toward the end of the column. Notice that the red highlighted (mismatched) items have the value "US" in column6 and "P" in column7. These are presidential candidates. The mismatch occurs because column6 is marked as a "State" column (indicated by the flag icon), but it also includes non-state (such as "US") values.


To correct the mismatch, mark it as a "String" column. Click Cancel in the right pane to cancel the transformation, then click the flag icon above column6 and select "String".

The mismatch is removed and the column marker is now green. Now, let’s filter on just the presidential candidates. In the histogram for column7, click the “P” bin.

Click Add in the right pane suggestion, "Keep rows where column7 is 'P'".

Prep and join the Contributions file

Click FEC 2016 in the upper left of the Candidate Master 2016 grid view to return to the main dataset page. Select Campaign Contributions 2016, then select Add new Recipe, then click Edit Recipe to open the grid view for the contributions dataset.

In the grid view, add a new step to the recipe that removes extra delimiters from the contributions dataset. To do this, insert the following Wrangle language command in the Search box.

replacepatterns col: * with: '' on: `{start}"|"{end}` global: true

The Transformation Builder parses the Wrangle command and populates the Find and Replace transformation fields. Click Add to add the transform to the recipe.

Add another new step to the recipe, then type "Join" in the Search box.

Click the "Join datasets" link to open the Joins page. Select Candidate Master 2016-2 (the Candidate Master file with the Recipe steps added above), then click Preview.

Select the Join Keys tab. Dataprep infers common keys. There are many common values that Dataprep suggests as join keys. Edit the Join Keys field, and select Column 2 = column 11 as the join keys.

Now select the checkbox immediately under the "Columns" label to add all columns of both datasets to the joined dataset, then click Add to Recipe to return to the grid view.

Add the following steps to the recipe to generate a useful summary by aggregating, averaging and counting the contributions in column 16 and grouping on the candidates by IDs, names, and party affiliation in columns 2, 24, 8 respectively.

pivot value:sum(column16),average(column16),countif(column16 > 0) group: column2,column24,column8

An initial sample of the joined and aggregated data is displayed, representing a summary table of US presidential candidates and their 2016 campaign contribution metrics.

To view more data, click the "Initial Sample" link at the top-left of the page to open the Samples panel. Select a random, quick sample, then click Collect.

After the job completes, click Load Sample in the Samples panel to view the data.

What's next

Was this page helpful? Let us know how we did:

Send feedback about...

Google Cloud Dataprep Documentation
Need help? Visit our support page.