Extract data from fields

This page explains how to extract and transform data from a field (a cell) when you prepare data in the Wrangler workspace of the Cloud Data Fusion Studio.

To perform transformations on this data, you split it into separate columns. In Wrangler, you can extract data from a column and create new columns for the extracted data. You can extract values based on patterns, delimiters, or positions.

Extract data using patterns

You can extract data from fields in columns of the string data type with the following patterns:

  • Credit cards
  • Date
  • Date time
  • Email
  • URLs from HTML anchors
  • IPv4 address
  • ISBN codes
  • Mac address
  • N digits number
  • SSN
  • Start and End pattern
  • Time

To extract data based on a pattern, follow these steps:

  1. Go to Wrangler workspace in Cloud Data Fusion.
  2. On the Data tab, go to a column name and click the arrow_drop_down expander arrow.
  3. Select Extract fields > Using patterns and select an option—for example, URL.
  4. Optional: click Show pattern to view the regular expression for the pattern.
  5. Click Extract.

Wrangler extracts the fields based on the chosen pattern and adds the extract-regex-groups directive to the recipe. When you run the data pipeline, Cloud Data Fusion applies the transformation to all rows in the column.

In the following example, a column contains a number, followed by an email address:

Emails
1 222larabrown@gmail.com
2 cloudysanfrancisco@gmail.com

To extract the email address, select the Email pattern. When you click Extract, Wrangler retains the original column and creates a new column containing only the email addresses:

Emails Emails_1
1 222larabrown@gmail.com 222larabrown@gmail.com
2 cloudysanfrancisco@gmail.com cloudysanfrancisco@gmail.com

Extract data with delimiters

You can extract data into two or more columns based on the following delimiters:

  • Comma
  • Tab
  • Pipe
  • Whitespace
  • Custom separator

If a value doesn't have the delimiter, no value is added to corresponding field in the new column.

To extract values based on a delimiter:

  1. Go to Wrangler workspace in Cloud Data Fusion.
  2. On the Data tab, go to a column name and click the arrow_drop_down expander arrow.
  3. Select Extract fields > Using delimiters and select an option—for example, Comma.
  4. Click Extract.

Wrangler extracts the fields based on the selected delimiter and adds the split-to-columns directive to the recipe. When you run the data pipeline, Cloud Data Fusion transforms all values in the column.

In the following example, a column contains multiple names separated by commas:

ID Name
1 Lee,Lucian,Luka
2 Mahan,Noam,Nur

In this example, using the comma delimiters pattern extracts the values in the original Name column into three new columns:

ID Name Name_1 Name_2 Name_3
1 Lee,Lucian,Luka Lee Lucian Luka
2 Mahan,Noam,Nur Mahan Noam Nur

Extract data by position

You can extract part of a string based on its position in the string.

To extract data based on its position:

  1. Go to Wrangler workspace in Cloud Data Fusion.
  2. On the Data tab, go to a column name and click the arrow_drop_down expander arrow.
  3. Select Extract fields > Using positions. Column values you can extract appear with a blue background.
  4. In any cell of the column, select the characters to extract.
  5. In the Name of destination column field, enter a name.
  6. Click Apply.

The chosen portion of the value is extracted from each row in the column.

Wrangler extracts the fields based on the selected pattern and adds the cut-character directive to the recipe. When you run the data pipeline, Cloud Data Fusion applies the transformation to all values in the column.

What's next