The process of cleansing, enhancing, and transforming your data can introduce significant changes to it, some of which might not be intended. This page provides some tips and techniques for validating your dataset, from start to finish for your data wrangling efforts.
Data validation can be broken down into the following categories:
- Consistency - Does your data fit into expected values for it? Do field values match the data type for the column? Are values within acceptable ranges? Are rows unique? Duplicated?
- Completeness - Are all expected values included in your data? Are some fields missing values? Are there expected values that are not present in the dataset?
Before You Begin
Before you begin building your data pipeline, you should identify your standards for data quality.
NOTE: Depending on your source system, you might be able to generate data quality reports from within it. These reports can be used as the basis for validating your work in Cloud Dataprep.
If your source system does not enable generation of these reports, you should consider profiling your dataset as soon as you load your data into Cloud Dataprep.
Verify downstream requirements
Before you begin modifying your dataset, you should review the columns and ranges of values in those columns that are expected by the downstream consumer of your dataset. A quick review can provide guidance to identify the key areas of your dataset that require end-to-end validation.
Identify important fields
For datasets with many columns, it might be problematic to apply consistent validation across all columns. In these situations, you might need to decide the columns whose consistency, completeness, and accuracy are most important.
Profile your source data
Before you get started building your recipe on your dataset, it might be a good idea to create a visual profile of your source data. This process involves creating a minimal recipe on a dataset after you have loaded into the Transformer page. Then, you run a job to generate a profile of the data, which can be used as a baseline for validating the data and as an assistant in debugging the origin of any data problems you discover.
Visual profiling also generates statistics on the values in each column in the dataset. You can use this statistical information to assess overall data quality of the raw data. This visual profile information is part of the record for the job, which remains in the system after execution.
For more information, see Profile Your Source Data.
Generate a new random sample
When a dataset is first loaded into the Transformer, the default sampling collects the first N rows of data, depending on the size and density of each row. However, your dataset might contain variations in the data that are not present in this first sample. For more information, see Samples Panel .
Cloud Dataprep provides useful features for checking that your data is consistent across its rows. With a few recipe steps, you can create custom validation checks to verify values.
In the data quality bar at the top of a column, you can review the valid (green), mismatched (red), and missing (black) values.
When you click the red bar:
- The rows that contain mismatched values are highlighted in the data grid.
- The application provides suggestions in the form of suggestion cards for ways that you can transform your data.
Maybe you are unsure of what to do with your data. If you would like to examine all of the rows together, you can insert a transform like the following in your recipe:
derive value:ISMISMATCHED(Primary_Website_or_URL, 'Url') as:'mismatched_Primary_Website_or_URL'
You can paste Wrangle steps into the Transform Builder.
The above checks the values in the
Primary_Website_or_URL column against the
Url data type. If the value in the source column is not a valid URL, then the new column value is
Through the Column Details panel, you can review statistical information about individual columns. To open, select Column Details... from a column's drop-down menu.
In the Summary area, you can review the count of Outlier values. In Cloud Dataprep, an outlier is defined as any value that is more than 4 standard deviations from the mean for the set of column values.
The Column Details panel also contains:
- Counts of valid, unique, mismatched, and missing values.
- Breakdowns by quartile and information on maximum, minimum, and mean values.
For more information, see Column Details Panel.
Available statistics depend on the data type for the column. For more information, see Locate Outliers.
Data range checks
Standard deviation ranges
For example, your range of values does not match the application's definition of an outlier, and you need to identify values that are more than 5 standard deviations from the mean.
You can create your custom transforms to evaluate standard deviations from mean for a specific column. For more information, see Locate Outliers.
Fixed value ranges
If you need to test a column of values compared to two fixed values, you can use the following transform. This transform tests evaluates a column value. If the value in
Rating column is less than 10 or greater than 90, then the generated column value is
derive value:((Rating < 10) || (Rating > 90)) as:'outlier_Rating'
Entire rows can be tested for duplication. The
deduplicate transform allows you to remove identical rows. Note that whitespace and case differences are evaluated as different rows. For more information, see Deduplicate Data.
For an individual column, the column details panel contains an indicator of the number of unique values in the column. If this value does not match the count of values and the count of rows in the sample, then some values are duplicated. Remember that these counts apply to just the sample in the Transformer page and may not be consistent measures across the entire dataset. See Column Details Panel.
You can perform ad-hoc tests for uniqueness of individual values. For more information, see Deduplicate Data.
Permitted character checks
You can test for the presence of permitted characters in individual columns by using a regular expression test. The following transform evaluates to
true if all of the characters in a column field are alphanumeric or the space character:
derive value:MATCHES(MarketName, /^[a-zA-Z0-9 ]*$/)
You can add additional permitted characters inside the square brackets. For more information, see Text Matching.
Cloud Dataprep provides easy methods for identifying if cells are missing values or contain null values. You can also create lookups to identify if values are not represented in your dataset.
At the top of each column, the data quality bar includes a black bar indicating the number of cells in the column that do not contain values. This set of values includes missing values.
Click the black bar to prompt for a set of suggestion cards for handling those values.
For more information, see Find Missing Data.
While null values are categorized with missing values, they are not the same thing. In some cases, it might be important to distinguish the actual null values within your dataset, and several Wrangle can assist in finding them. See Manage Null Values.
You can also test if your dataset contains at least one instance of a set of values.
For example, your dataset contains businesses throughout the United States. You might want to check to see if each state is represented in your dataset.
Create a reference dataset that contains a single instance of each item you are checking. In this example, it'd be a simple CSV file with the name of each state on a separate line.
Tip: To your second dataset, you might want to add a second column containing the value
true, which allows you to keep separate validation data from the columns that you join.
- Add this CSV file as a new dataset to your flow.
- Open your source dataset. In the Transformation textbox in the recipe panel, enter
- In the Join Page:
- Select the reference dataset you just created. Click Preview Selected Dataset.
- Click Join Keys. Select the two fields that you want to use to join. In the example, you would select the two fields that identify state values.
- You should see a preview that includes just the two key columns. In the Columns panel, select the fields that you want to include in the final dataset.
- Select the type of join to perform:
- Right outer join: Select this join type if you want to delete rows in your source dataset that do not have a key value in the reference dataset. In the example, all rows that do not have a value in the State column would be removed from the generated dataset.
- Full outer join: Select this type to preserve all data, including the rows in the source that do not contain key values.
- Click Add to Recipe.
- The generated dataset includes all of the fields you specified.
- For one of your key values, click the black bar and select the link for the number of affected rows, which loads them into the data grid. Review the missing values in each key column.
To remove these rows, select the missing value category in the data quality bar for the appropriate column and apply a delete statement.
The generated command should look like the following:
delete row: ISMISSING([State])
For more information, see Join Page.
Generate output profile
After you have completed your recipe, you should generate a profile with your executed job. You can open this profile and the profile you created for the source data in separate browser tabs to evaluate how consistent and complete your data remains from beginning to end of the wrangling process.
NOTE: The statistical information in the generated profile should be compared to the statistics generated from the source, so that you can identify if your changes have introduced unwanted changes to these values.
After you have performed your data validation checks, you might need to make some decisions about how to address any issues you might have encountered:
- Some problems in the data might have been generated in the source system. If you plan to use additional sources from this system, you should try to get these issues corrected in the source and, if necessary, have your source data regenerated.
- Some data quality issues can be ignored. For the sake of downstream consumers of the data, you might want to annotate your dataset with information about possible issues. Be sure to inform consumers on how to identify this information.