As part of your data cleansing steps, you might need to remove duplicate rows of data from your dataset.
Validate Duplicate Data
In some cases, it might be acceptable to have duplicated data. For example, additional records using the same primary key might be included in a dataset as amendments or detail records.
NOTE: Before you remove duplicates from your dataset, you should verify that the data should not contain duplicates at all. If the data structure supports some duplicate elements including key values, you should exercise care in how you identify what constitutes duplicate information.
Cloud Dataprep provides a single transform, which can remove identical rows from your dataset:
Tip: If you are attempting to identify if there are duplicate rows, check the row count in your dataset before and after you have added this transform.
- This transform is case-sensitive. So, if a column has values
HELLO, the rows containing those values are not considered duplicates and cannot be removed with this transform.
- Whitespace and the beginning and ending of values is not ignored.
Before applying the
deduplicate transform, you should attempt to normalize your data. You can use the following techniques to normalize a few columns of data.
NOTE: If you have more than 20 columns of data, you might be better served by trying to identify a primary key method for de-duplicating your dataset. Details are below.
For individual columns, you can use the
trim function to remove leading and trailing whitespace:
NOTE: To preserve the original column values, use the
derive transform. The
set transform replaces the original values.
derive col:Item value:TRIM(Item)
You can paste Wrangle steps into the Transform Builder.
deduplicate transform is case-sensitive, you can use the
LOWER function to make the case of each entry in a column to be consistent:
derive col:Description value:LOWER(Description)
For more information, see Normalize Numeric Values.
While this form of duplicate data is rarer, you might want to check on the possibility of duplicate data between your columns. To check for duplicate column data, you can use a transform similar to the following:
derive value: (Column1 == Column2)
In the generated column, values that are
true indicate duplicate data. If all values are
true, then you can remove one of the columns.