Cloud Dataprep by TRIFACTA® surfaces visual representations of your data for individual columns and the entire dataset. These visual profiles enable you to make quick assessments of problems, unusual patterns, and required changes to your data, and they are available for use throughout the development of your dataset.
Tip: Visual profiling is especially important in recipe development. When you identify something of interest, you can select the visual representation of it, and the platform prompts you with a set of suggested transforms to add to your recipe. Examples are provided below.
For more background information, see Overview of Visual Profiling.
When your data is first loaded into the Transformer page, a sample of the data in your dataset is displayed in the data grid. You can see a reference to the sample at the top of the data grid, which includes the method by which the sample was generated.
NOTE: Before your job is run, profiling information such as column statistics are exact counts of the sample that is currently loaded. After the job is run, profiled results in the Job Results page might include estimates for some metrics and counts, depending on the scale of the dataset.
Figure: Sample Toolbar
From the drop-down next to the sample name, you can also create or select a different sample to use in the data grid. See Samples Panel.
Counts on the rows, columns, and data types in the current sample are displayed at the bottom of the page in the status bar:
Figure: Status Bar
Profile Source Data
Tip: When you first load your dataset into the application, you might want to run a job to profile your dataset before you build your recipe. The generated results and profile are accessible through the application, which can be useful for seeing how your dataset has changed during development. For more information, see Profile Your Source Data.
Analyze Individual Columns
The Transformer page provides multiple mechanisms for profiling your dataset, which are accessible through the data grid.
The top of each column contains a data quality bar, which identifies the valid, mismatched, and missing values in the column when compared against the specified data type, and column histogram, which identifies the range of values in the column.
Figure: Example Column
Data Quality Bar - Missing and Mismatches Values
Below the name of the column, the multi-colored band indicates the valid (green), mismatched (red), and missing (black) values in the column, when matched against the column's data type. In the above image, the data type is set to URL.
Tip: Click the missing or mismatched values in a column's data quality bar. You are prompted with suggestions of transforms to fix or remove these values.
Each column includes a histogram of the values in the column. In the above image, there are 402 different values in the column, and you can see how some values appear more frequently than others.
- In the column histogram, you can select a column value and drag to select a range of values for suggestions on transformations.
- Null values are a special case of missing values. You can use the
ISNULLfunction to identify null values in a column, which appear among the category of missing values. See Manage Null Values.
- When you select one or more values in the column histogram, you can see the corresponding values for the row values in the histograms for other columns.
See Data Grid Panel.
Column Details - Statistics and Outliers
In the Column Details window, you can review key statistical information on the values in a column. Displayed statistics are appropriate, based on the column's data type.
- To explore the details for a column's data, select Column Details from the drop-down for the specific column in the data grid.
- Visual profiles of these columns are displayed.
Figure: Column Details
For the selected column, you can review key statistics depending on the data type. The above image shows statistics that apply to the URL data type, which is a variation on String type.
- Transform suggestions are updated based on your selection.
- Make a selection from the lists of top, mismatched, and other value lists to be prompted for a set of suggestions for how to modify the selected rows.
- Click the missing values in the data quality bar to prompt for suggestions to address those values in the column.
Column Details - Patterns
In the Column Details window, you can also review the Cloud Dataprep patterns that match the values in the selected column. You can explore the counts of values in the column that match each pattern and sub-pattern and then select one as the basis for building a transform to modify the matched fields.
- Cloud Dataprep patterns are simple macro-like tools for matching patterns in your data. These proprietary patterns are purpose-built for matching the types of data supported in Cloud Dataprep by TRIFACTA.
For more information, see Column Details Panel.
For more information on the statistics, see Column Statistics Reference.
Column Browser - Profiles across columns
In the column browser, you can view visual histograms for each column in the dataset and make selections to identify correlations between values in multiple columns. To open the column browser, click the Columns icon in the Transformer bar.
For more information, see Column Browser Panel.
Profile Job Results
When you execute your job, you can generate a visual profile of the entire dataset as part of the job. You can use the generated profile to simplify iteration on your recipe. Depending on the running environment, the optional profiling of the results can take extra time to generate.
- In the Transformer page, click Run Job.
Click the Profile Results checkbox.
- Run the job.
When the job finishes, click View Results in the job card.
Figure: Job ResultsThis visual profile displays statistics across the entire dataset. Since the data volume of the entire dataset can be quite large, these stats can be approximations. After a visual profile has been created, you can access it again through its job card. See Job Results Page.