This section describes techniques to standardize text values in your datasets. You can use the following techniques to address some common issues you might encounter in the standardization of text and other non-numeric values.
Standardize Using Patterns
From the Column Details panel, you can review and select patterns in the column's data. These selections can be used as the basis for converting all applicable values to the selected format.
NOTE: Pattern-based conversions can be applied to any data type.
In the Patterns tab, click a group of patterns and then review the Convert suggestion to define how the pattern matches can be converted to a single standardized format.
NOTE: The application does not suggest pattern-based conversions that add or remove alphanumeric characters.
Figure: Selecting Datetime patterns in the Patterns tab
In the above, the pattern block prompts suggestions for Convert tasks based on the selected patterns.
- Click Edit to modify the task.
- Click Add to add the task as a step to your recipe.
Example - Phone number patterns
For columns containing phone number data, you can use the Patterns tab to standardize formatting options. Consider the following values, which are valid phone numbers. Next to each value is a pattern representing the value:
Cloud Dataprep pattern
In the Patterns tab, you can select the patterns to which you would like the other patterns in the same pattern group to be converted. Below, the selected target pattern becomes the pattern to which other patterns in the column values are converted:
NOTE: You may have to modify the phone number values before attempting the conversion, as they may contain extra alphanumeric values. For example, international country codes (such as
044) or a preceding
1+ required in long-distance numbers, may need to extracted or removed from the column values prior to conversion.
Below are types of conversions that are supported and not supported.
|Example Source Value||Example Target Value||Notes|
|123.456.7890||123-456-7890||Changing symbolic characters|
|(123) 456-7890||123 456-7890||Removing symbolic characters|
|(123)456-7890||(123)-456-7890||Adding symbolic characters|
|1234567890||123-456-7890||Splitting a long character group and adding symbolic characters|
|123-456-7890||1234567890||Merging multiple character groups and removing symbolic characters|
|Example Source Value||Example Target Value||Notes|
|123.456.7890||+1.123.456.7890||Adding a new character group|
|+1.123.456.7890||123.456.7890||Deleting a character group (alphanumeric characters cannot be deleted through pattern standardization)|
|Adam Wilson||A Wilson||Partial deletion of data from a character group|
|+1 (123) 456-7890||+001 (123) 456-7890||Prepending or appending a character group with specified characters|
For columns of Datetime type, the available Convert mappings are based upon the supported date formats in the platform. Standardization of Datetime patterns is a specific implementation.
Notes on Datetime patterns:
Two-digit years (YY) do not yield four-digit year (YYYY) suggestions due to ambiguity. For example, it is unclear if
50 should map to
For performance reasons, a maximum of two semantic standardizations can be applied at once. Examples:
|Source Value||Possible Standardization||Semantic Mappings||Status|
| ||ok (2 mappings)|
| || ||Not suggested (3 mappings)|
For more information on supported formats, see Datetime Data Type.
For more information on converting Datetime values to a different format, see DATEFORMAT Function.
Standardize Using Functions
Convert to strings to modify
For manipulation of individual values, it is often easiest to work with the String data type, which is the most flexible. Depending on your approach, you may choose to convert some of your columns into String type:
settype col: col1,col2, col3 type:'String'
You can paste Wrangle steps into the Transformer Page.
For more information, see Settype Transform.
For more information, Valid Data Type Strings.
Trim whitespace from text
You can trim out whitespace from an individual column via transform. The
TRIM function applied to string values removes the leading and trailing whitespace:
set col: myCol value: TRIM(myCol)
To apply this function across all columns in the dataset, you can use the following:
set col: * value: TRIM($col)
- The asterisk (
*) is a wildcard, which represents all possible value. In this case, it matches with all column names in the dataset.
- You may need to move columns or use range values to apply this transform to only non-numeric column types.
$colentry denotes a reference to the current column. So for any column to which this transform is applied, the source values are pulled from the column itself and then trimmed.
- See Set Transform.
In some cases, you may wish to remove all spaces, including those in between words or digits, in your strings:
set col:* value:REMOVEWHITESPACE($col)
You can use the following steps to set all text values in a column to be the same case.
set col:myStrings value:LOWER(myStrings)
set col:myStrings value:UPPER(myStrings)
Proper (sentence) case:
set col:myStrings value:PROPER(myStrings)
Fixed length strings
You can limit the maximum size of a column or set of columns to a fixed string length. For example:
set col:col1,col2 value:IF(LENGTH($col)>32,LEFT($col,32),$col)
In the above, if the length of either column is longer than 32 characters, then the column value is set to the leftmost 32 characters. For shorter strings, the entire string is used.
After modifying non-text values as strings, remember to conver them back to their original types.