Dataset Objects

Introduction

This page provides an overview of Cloud Dataprep datasets and dataset objects.

Datasets

The fundamental area of work within Cloud Dataprep is the dataset. There are two types of datasets:

Type Description Editable? Executable?
Imported An imported dataset is a reference to a source of data. This source can be a file, multiple files, database table, or other type of data. NOTE: An imported dataset is a pointer to an external source of data. It cannot be modified within Cloud Dataprep. N N
Wrangled A wrangled dataset is an editable object for which you build your recipes to transform the source data. It contains:
  • A reference to another dataset (imported or wrangled)
  • A recipe of sequential steps that transform your data into the desired output
  • Any number of recipe executions that result in generated results on success or screen information on failure
Y Y

For detailed information on dataset types, see Imported vs Wrangled Dataset.

The following diagram illustrates the component objects of a dataset and how they are created during dataset development in the application:

Dataset Objects

Data that is imported to the platform is referenced in an imported dataset. This source is simply a reference to the original data; it is not modified or stored within the platform.

  • An imported dataset can be used in multiple wrangled datasets.
  • Imported datasets are created through the Import Dataset Page.
  • For more information on the process, see Import Basics.

Create a wrangled dataset

To begin wrangling data, you must create a wrangled dataset from your imported dataset. You can do this as part of the import process or later, whenever an imported or wrangled dataset is added to a flow.

  • To create a wrangled dataset, you must add it to an existing flow or create a new flow. A flow is a container for holding imported and wrangled datasets.
  • For more information, see Flows below.

Open in Transformer page

When the wrangled dataset is first opened in the Transformer page, the following dataset-related objects become available:

  • A recipe identifies the sequential set of steps that you define to cleanse and transform your data.

    • When the recipe is created, it may contain a set of steps that perform initial parsing of the data into rows and columns. These steps may vary depending on the type of source data. See Initial Parsing Steps.
    • Recipes are interpreted by the Cloud Dataprep and turned into commands that can be executed against the wrangled dataset.
    • Recipes are created using the various visual tools in the Transformer page. The Transformer page provides multiple interfaces for quickly selecting and building recipe steps. Your selections are converted into steps written in Wrangle (a domain-specific language for data transformation). For details on the syntax of this language, see Wrangle Language.
    • For more information on the process, see Transform Basics.
  • Within the Transformer page, you build the steps of your recipe against a sample of the dataset.

    • A sample is typically a subset of the entire dataset. For smaller datasets, the sample may be the entire dataset.
    • As you build or modify your recipe, the results of each modification are immediately reflected in the sampled data. So, you can rapidly iterate on the steps of your recipe within the same interface.
    • If you change the number of rows or columns of your dataset, you can generate additional samples, which may offer different perspectives on the data.
    • For more information, see Transform Basics.

Flows

A flow is a container for holding one or more datasets and their associated objects. A wrangled dataset must be contained in a flow.

The following diagram illustrates the flexibility of object relationships within a flow. In this example, the first four datasets feed into Dataset 3, from which are generated the final outputs.

Flow Example

Datasets Description
W‑Dataset 1 Results of the job are used to create a new imported dataset (I-Dataset 2)
W‑Dataset 2 I-Dataset 2 is added directly to W-Dataset 2 (wrangled dataset 2) through the Transformer Page. See Transformer Page.
W‑Dataset 3 W-Dataset 2 is included in W-Dataset 3. This step could be a join, union, or similar data blending statement in Recipe 3
W‑Dataset 4 Although stored in the same flow, W-Dataset 4 is independent of the other datasets
Was this page helpful? Let us know how we did:

Send feedback about...

Google Cloud Dataprep Documentation