Extract data from fields

This page explains how to extract and transform data from a field (a cell) when you prepare data in the Wrangler workspace of the Cloud Data Fusion Studio.

To perform transformations on this data, you split it into separate columns. In Wrangler, you can extract data from a column and create new columns for the extracted data. You can extract values based on patterns, delimiters, or positions.

Extract data using patterns

You can extract data from fields in columns of the string data type with the following patterns:

Credit cards
Date
Date time
Email
URLs from HTML anchors
IPv4 address
ISBN codes
Mac address
N digits number
SSN
Start and End pattern
Time

To extract data based on a pattern, follow these steps:

Go to Wrangler workspace in Cloud Data Fusion.
On the Data tab, go to a column name and click the arrow_drop_down expander arrow.
Select Extract fields > Using patterns and select an option—for example, URL.
Optional: click Show pattern to view the regular expression for the pattern.
Click Extract.

Wrangler extracts the fields based on the chosen pattern and adds the extract-regex-groups directive to the recipe. When you run the data pipeline, Cloud Data Fusion applies the transformation to all rows in the column.

In the following example, a column contains a number, followed by an email address:

Emails
1 222larabrown@gmail.com
2 cloudysanfrancisco@gmail.com

To extract the email address, select the Email pattern. When you click Extract, Wrangler retains the original column and creates a new column containing only the email addresses:

Emails	Emails_1
1 222larabrown@gmail.com	222larabrown@gmail.com
2 cloudysanfrancisco@gmail.com	cloudysanfrancisco@gmail.com

Extract data with delimiters

You can extract data into two or more columns based on the following delimiters:

Comma
Tab
Pipe
Whitespace
Custom separator

If a value doesn't have the delimiter, no value is added to corresponding field in the new column.

To extract values based on a delimiter:

Go to Wrangler workspace in Cloud Data Fusion.
On the Data tab, go to a column name and click the arrow_drop_down expander arrow.
Select Extract fields > Using delimiters and select an option—for example, Comma.
Click Extract.

Wrangler extracts the fields based on the selected delimiter and adds the split-to-columns directive to the recipe. When you run the data pipeline, Cloud Data Fusion transforms all values in the column.

In the following example, a column contains multiple names separated by commas:

ID	Name
1	Lee,Lucian,Luka
2	Mahan,Noam,Nur

In this example, using the comma delimiters pattern extracts the values in the original Name column into three new columns:

ID	Name	Name_1	Name_2	Name_3
1	Lee,Lucian,Luka	Lee	Lucian	Luka
2	Mahan,Noam,Nur	Mahan	Noam	Nur

Extract data by position

You can extract part of a string based on its position in the string.

To extract data based on its position:

Go to Wrangler workspace in Cloud Data Fusion.
On the Data tab, go to a column name and click the arrow_drop_down expander arrow.
Select Extract fields > Using positions. Column values you can extract appear with a blue background.
In any cell of the column, select the characters to extract.
In the Name of destination column field, enter a name.
Click Apply.

The chosen portion of the value is extracted from each row in the column.

Wrangler extracts the fields based on the selected pattern and adds the cut-character directive to the recipe. When you run the data pipeline, Cloud Data Fusion applies the transformation to all values in the column.

What's next

Learn more about Wrangler directives.