There are two kinds of Google Cloud Dataprep datasets: imported datasets and wrangled datasets.
An imported dataset is a reference in the platform to a set of one or more assets that constitute source data. This object is simply a pointer to these assets.
- These assets can include one or more files, a database table, or other data storage object.
NOTE: When data is transformed through the Cloud Dataprep application or job execution, all work is done on in-memory versions of the imported dataset, the source of which is untouched. Cloud Dataprep does not modify source data.
A Wrangled dataset is a dataset built on top of another dataset. It contains the following:
- A reference to the other dataset, which can be:
- an imported dataset
- another wrangled dataset
- A recipe that has been created in the
Google Dataprep application. This recipe is applied to the data in the dataset when:
- the sample of the dataset is displayed in the Transformer page
- a job is executed across the entire dataset
NOTE: Most work done in the Cloud Dataprep application is to build and modify the recipe of the wrangled dataset. You can modify some metadata for each type of dataset (such as its name).
When Data Is Imported
In the Import Dataset page, you can import data into the application.
- Initially, imported data is stored in the application as an imported dataset (a reference).
- To begin working with the data in the application, you must move the imported dataset into a flow. You can do this when:
- The imported dataset is initially created.
You create a flow for the imported dataset in the Dataset Details page.
NOTE: If you create a flow for the imported dataset when it is already in use of a flow, the imported dataset is still in use in the original dataset, since it is simply a reference to source data that is managed outside of the platform.
- When an imported dataset is moved into a flow, a corresponding wrangled dataset is created.
- When you select the wrangled dataset, you can begin building your recipe in the Transformer page.
When Data Is Integrated or Swapped
The following operations can be be applied to change the data in one wrangled dataset based on the data in another:
|Operation||Description||Source Dataset Types|
|Join||Join one dataset into another based on a common key between the two datasets||Wrangled|
|Union||Concatenate one or more datasets with another based on align between column names or positions in the dataset||Wrangled|
|Lookup||Bring in columns from a dataset typically containing reference data into another wrangled dataset based on a single column||Wrangled|
Change the source of the data for a wrangled dataset to another dataset.
NOTE: If the new source for your data is a wrangled dataset, your dataset will inherit all subsequent changes to the new source. If you later add or remove recipe steps to the new source, those changes are reflected in your dataset, which can have unexpected consequences, including breaking your recipe.
If the new source is an imported dataset, your dataset is impacted by changes only if the source is replaced or updated by an asset with the same name in the same location. This type of change also impacts new sources that are wrangled datasets.
|Wrangled or imported|
In pages that list datasets, it is important to be able to identify an imported versus a wrangled dataset. Below, you can see these two types listed in the Datasets page.
Tip: In the Datasets page, pay attention to the filter that is being applied at the top of the page. You can select a different filter to simplify the view.
|Type||Name||In Flows||Source||Last Updated|
Name includes a link, which opens the dataset in the Transformer page.
Since a wrangled dataset cannot be created outside of a flow, this value is at least
If a wrangled dataset has been shared into another flow via transform or source swapping, then this value can increase.
|Identifies the flow where it appears.|
Timestamp for last time it was opened in the Transformer page or its metadata was changed. Typically, this timestamp will be more recent than the timestamp for the imported dataset.
|Imported||Name is plain text and may include a filename extension for single-file sources.||This value can be 0 or more. If the dataset is unused, this value is ||Indicates the datastore where the imported dataset's source is located.||After creation, this value is only updated if the name or similar metadata for this object is changed.|