To prevent overwhelming the client or significantly impacting performance, Cloud Dataprep generates one or more samples of the data for display and manipulation in the client application. Since Cloud Dataprep supports a variety of clients and use cases, you can change the size of samples, the scope of the sample, and the method by which the sample is created. This section provides background information on how the product manages dataset sampling.
How Sampling Works
When a dataset is first loaded into the Transformer page, a background job begins to generate a sample using the first set of rows of the dataset. This sample is very quick to generate, so that you can get to work right away on your transformations.
- The default sample is called the head sample or initial sample.
- By default, each sample is 10 MB in size or the entire dataset if it's smaller. That sample size can be configured. Additional information is below.
Additional samples can be generated from the context panel on the right side of the Transformer page. Sample jobs are independent executions on the designated running environment. When a sample job succeeds or fails, a notification is displayed for you.
As you develop your recipe, you might need to take new samples of the data. For example, you might need to focus on the mismatched or invalid values that appear in a single column. Through the Transformer page, you can specify the type of sample that you wish to create (in this case, an Anomaly-based sample) and initiate the job to create the sample. This sampling job occurs in the background.
Depending on the type of sample you select, it may be generated based on one of the following methods, in increasing order of time to create:
- on a specified set of rows (first rows)
- on a quick scan across the dataset
- on a full scan of the entire dataset
You can create a new sample at any time. When a sample is created, it is stored within your storage directory on the backend datastore. See User Profile Page.
NOTE: When a flow is shared, its samples are shared with other users. However, if those users do not have access to the underlying files that back a sample, they do not have access to the sample and must create their own.
For more information on creating samples, see Samples Panel.
Special Case Sampling
NOTE: A new sampling job is executed in Cloud Dataflow, which may incur costs.
NOTE: When sampling from compressed data, the data is uncompressed and then expanded. As a result, the sample size reflects the uncompressed data.
NOTE: If your source of data is a directory containing multiple files, the initial sample for the combined dataset is generated from the first set of rows in the first filename listed in the directory.
Changes to preceding steps that alter the number of rows or columns in your dataset can invalidate the current sample, which means that the sample is no longer a valid representation of the state of the dataset in the recipe. In this case, Cloud Dataprep automatically switches you back to the most recently collected sample that is currently valid. Details are below.
After you have collected multiple samples of multiple types on your dataset, you can choose the proper sample to use for your current task, based on:
- How well each sample represents the underlying dataset. Does the current sample reflect the likely statistics and outliers of the entire dataset at scale?
- How well each sample supports your next recipe step. If you're developing steps for managing bad data or outliers, for example, you may need to choose a different sample.
Tip: You can begin work on an outdated yet still valid sample while you generate a new one based on the current recipe.
- Some advanced sampling options are available only with execution across a scan of the full dataset.
- Undo/redo do not change the sample state, even if the sample becomes invalid.
With each step that is added or modified to your recipe, Cloud Dataprep checks to see if the current sample is valid. Steps that you add to your recipe can cause the currently active sample to be invalidated. For example, if you change the source of data, then the sample in the Transformer page no longer applies, and a new sample must be displayed.
Tip: After you have completed a step that significantly changes the number of rows, columns, or both in your dataset, you may need to generate a new sample, factoring in any costs associated with running the job. Performance costs may be displayed in the Transformer page.
NOTE: If you modify a SQL statement for an imported dataset, any samples based on the old SQL statement are invalidated.
- The Transformer page reverts to displaying the most recently collected sample that is currently valid.
You can generate a new sample of the same type through the Samples panel. If no sample is valid, you must generate a new sample before you can open the dataset.
A sample that is invalidated is listed under the Unavailable tab. It cannot be selected for use. If subsequent steps make it valid again, it re-appears in the Available tab.
Cloud Dataprep currently supports the following sampling methods.
First rows samples
By default, the application loads the first N rows of the dataset as the sample, the number of which depends on column count, data density, and other factors. If the dataset is small enough, the Full Dataset is used.
NOTE: When the first rows sample is collected, the steps in your recipe are then applied to the sample. So, the net size of your sample available to the final step of your recipe could be significantly smaller than the original sample.
Tip: As you add steps to your recipe, the rows in the first N rows sample are automatically updated. You can use these updates as one verification tool for your recipe steps.
Random selection of a subset of rows in the dataset. These samples are comparatively fast to generate.
You can apply quick scan or full scan to determine the scope of the sample.
Find specific values in one or more columns. For the matching set of values, a random sample is generated.
You must define your filter in the Filter textbox.
Find mismatched or missing data or both in one or more columns.
You specify one or more columns and whether the anomaly is:
- either of the above
Optionally, you can define an additional filter on other columns.
Find all unique values within a column and create a sample that contains the unique values, up to the sample size limit. The distribution of the column values in the sample reflects the distribution of the column values in the dataset. Sampled values are sorted by frequency, relative to the specified column.
Optionally, you can apply a filter to this one.
Cluster sampling collects contiguous rows in the dataset that correspond to a random selection from the unique values in a column. All rows corresponding to the selected unique values appear in the sample, up to the maximum sample size. This sampling is useful for time-series analysis and advanced aggregations.
Optionally, you can apply an advanced filter to the column.