Google Cloud Dataprep surfaces visual representations of your data for individual columns and the entire dataset. These visual profiles allow you to make quick assessments of problems, unusual patterns, and required changes to your data, and they are available throughout the development of your dataset.
When your data is first loaded into the Transformer page, a sample of the data in your dataset is displayed in the data grid. Statistical information for the sample is displayed at the top of the data grid.
Before your job is run, profiling information, such as column statistics, are exact counts of the currently loaded sample. After the job is run, profiled results in the Job Results page may include estimates for some metrics and counts in larger datasets.
Click the links to review the columns in your dataset or their data types. From the drop-down next to the sample name, you can also create or select a different sample to use in the data grid. See Sampling Menu.
Profile Source Data
Analyze Individual Columns
The Transformer page provides multiple mechanisms for profiling your dataset, which are accessible through the data grid.
The top of each column contains a data quality bar, which identifies valid, mismatched, and missing values in the column when compared against the specified data type and column histogram, which identifies the range of values in the column.
Data Quality Bar—Missing and Mismatches Values
Below the name of the column, the multi-colored band indicates valid (green), mismatched (red), and missing (black) values in the column when matched against the column's data type. In the above image, the data type is set to URL.
Each column includes a histogram of the values in the column. In the above image, there are 394 different values in the column, and you can see how some values appear more frequently than others.
Column Details—Statistics and Outliers
In the Column Details window, you can review key statistical information associated with the values in a column. Displayed statistics are based on the column's data type.
- To explore the details for a column's data, select Column Details from
the drop-down for the specific column in the data grid.
- You can also click the Column Browser () icon. Select the column of interest. Click the Microscope icon.
- Visual profiles of these columns are displayed.
For the selected column, you can review key statistics depending on the data type. The above image shows statistics that apply to the URL data type, which is a variation on String type.
Column Details - Patterns
In the Column Details window, you can also review patterns that match values in the selected column. You can explore the counts of values in the column that match each pattern and sub-pattern and then select one as the basis for building a transform to modify the matched fields.
- Cloud Dataprep patterns are simple macro-like tools for matching patterns in your data. These proprietary patterns are built for matching the types of data supported in the Cloud Dataprep platform.
For more information, see the Column Statistics Reference.
Profile Job Results
When you are ready to execute your job, you can generate a visual profile of the entire dataset as part of the job. Profiling of results takes extra time to generate, but you can use the generated profile to simplify iteration on your recipe.
- In the Transformer page, click Run Job.
- Click the Profile Results checkbox.
- Run the job.
- When the job finishes, click View Results in the job card.
This visual profile displays statistics across the entire dataset. Since the data volume of the entire dataset can be quite large, these stats are approximations. After a visual profile has been created, you can access it again through its job card.