Create Dataset with Parameters

In some cases, you may need to perform the same transformations of data that is stored in parallel in the source. In these cases, you can parameterize the input paths to your data, so that all related data is processed in an identical manner.

When you create a dataset with parameters, you can replace segments of the input path with parameters. Suppose you have the following files that you'd like to capture through a parameterized dataset:

//source/user/me/datasets/month01/2017-01-31-file.csv
//source/user/me/datasets/month02/2017-02-28-file.csv
//source/user/me/datasets/month03/2017-03-31-file.csv
//source/user/me/datasets/month04/2017-04-30-file.csv
//source/user/me/datasets/month05/2017-05-31-file.csv
//source/user/me/datasets/month06/2017-06-30-file.csv
//source/user/me/datasets/month07/2017-07-31-file.csv
//source/user/me/datasets/month08/2017-08-31-file.csv
//source/user/me/datasets/month09/2017-09-30-file.csv
//source/user/me/datasets/month10/2017-10-31-file.csv
//source/user/me/datasets/month11/2017-11-30-file.csv
//source/user/me/datasets/month12/2017-12-31-file.csv

A parameterized reference to all of these files would look something like:

//source/user/me/datasets/month##/YYYY-MM-DD-file.csv

Through the application, you can specify the parameters to match all values for:

  • ## - You can use a wildcard or (better) a pattern to replace these values.
  • YYYY-MM-DD - A formatted Datetime parameter can replace these values.

Structuring Your Data

Each file that is included as part of the dataset with parameters should have identical structures:

  • Matching file formats
  • Matching column order, naming, and data type
  • Within each column, the data format should be consistent.
    • For example, if the date formats change between files in the source system, you and your recipe may not be able to manage the differences, and it is possible that data in the output may be missing.

NOTE: Avoid creating datasets with parameters where individual files or tables have differing schemas. Either import these sources separately and then correct in the application before performing a union on the datasets, or make corrections in the source application to standardize the schemas.

When working with datasets with parameters, it may be useful to do the following if you expect the underlying datasets to be less than 100% consistent with each other.

  1. Recreate the dataset with parameters, except deselect the Detect Structure option during the import step.
  2. In the Transformer page, collect a Random Sample using a Full Scan. This step attempts to gather data from multiple individual files, which may illuminate problems across the data.

    Tip: If you suspect that there is a problem with a specific file or rows of data (e.g. from a specific date), you can create a static dataset from the file in question.

Steps

NOTE: Matching file path patterns in a large directory can be slow. Where possible, avoid using multiple patterns to match a file pattern or scanning directories with a large number of files. To increase matching speed, avoid wildcards in top-level directories and be as specific as possible with your wildcards and patterns.

NOTE: Due to a limitation in Cloud Dataflow, when you run a job on a parameterized dataset containing more than 100 files, the input paths data must be compressed, which results in non-readable location values in the Cloud Dataflow console. Running jobs on datasets sourced from more than 6000 files may fail.

  1. In the Import Data page, navigate your environment to locate one of the files or tables that you wish to parameterize.
  2. Click Create Dataset with Parameters.


    Figure: Create Dataset with Parameters

  3. Within the Define Parameterized Path, select a segment of text. Then select one of the following options:
    1. Add Datetime Parameter
    2. Add Variable
    3. Add Pattern Parameter - wildcards and patterns
    4. For more information on limitations, see Overview of Parameterization.
    5. If you need to navigate elsewhere, select Browse.
  4. Specify the parameter. Click Save.
  5. Click Update matches. Verify that all of your preferred datasets are matching.

    NOTE: If you are matching with more datasets than you wish, you should review your parameters.

  6. Click Create.

  7. The parameterized dataset is loaded. See Import Data Page.

A flow containing a dataset with parameters has additional options for managing them. See Flow View Page.

Add Datetime Parameter

Datetime parameters require the following elements:

Format: You must specify the format of the matching date and/or time values using alphanumeric patterns. To review a list of example formats, click Browse Date/Timestamp Patterns.

Date range: Use these controls to specify the range that matching dates must fall within.

NOTE: Date range parameters are case-insensitive.

Time zone: The default time zone is the location of the host of the application. To change the current time zone, click Change.

For a list of supported time zone values, see Supported Time Zone Values.

Add Variable

A variable parameter is a key-value pair that can be inserted into the path.

  • At execution time, the default value is applied, or you can choose to override the value.
  • A variable can have an empty default value.

Name: The name of the variable is used to identify its purpose.

NOTE: If multiple datasets within the same flow share the same variable name, they are treated as the same variable.

Default Value: If the variable value is not overridden at execution time, this value is inserted in the variable location in the path.

Add Pattern Parameter

In the screen above, you can see an example of pattern-based parameterization. In this case, you are trying to parameterize the two digits after the value: POS-r.

Wildcard

The easiest way to is to add a wildcard: *

A wildcard can be any value of any length, including an empty string.

Tip: Wildcard matching is very broad. If you are using wildcards, you should constrain them to a very small part of the overall path. Some running environment may place limits on the number of files with which you can match.

Pattern - Regular expression

Instead of a wildcard match, you could specify a regular expression match. Regular expressions are a standardized means of expressing patterns.

Regular expressions are specified between forward slashes, as in the following:

/my_regular_expression/

NOTE: If regular expressions are poorly specified, they can create unexpected matches and results. Use them with care. For a list of limitations of regular expressions for parameterization, see Overview of Parameterization.

The following regular expression matches the same two sources in the previous screen:

/\_[0-9]*\_[0-9]*/

The above expression matches an underscore (_) followed by any number of digits, another underscore, and any number of digits.

Tip: In regular expressions, some characters have special meaning. To ensure that you are referencing the literal character, you can insert a backslash (\) before the character in question.

While the above matches the two sources, it also matches any of the following:

_2_1
__1
_1231231231231231235245234343_

These may not be proper matches. Instead, you can add some specificity to the expression to generate a better match:

/\_[0-9]{13}\_[0-9]{4}/

The above pattern matches an underscore, followed by exactly 13 digits, another underscore, and then another 4 digits. This pattern matches the above two sources exactly, without introducing the possibility of matching other numeric patterns.

Pattern - Cloud Dataprep pattern match

A Cloud Dataprep pattern is a platform-specific mechanism for specifying patterns, which is much simpler to use than regular expressions. These simple patterns can cover much of the same range of pattern expression as regular expressions without the same risks of expression and sometimes ugly syntax. For more information on Cloud Dataprep patterns, see Text Matching.

Cloud Dataprep patterns are specified between back-ticks, as in the following:

`my_pattern`

In the previous example, the following regular expression was used to match the proper set of files:

/\_[0-9]{13}\_[0-9]{4}/

In a Cloud Dataprep pattern, the above can be expressed in a simpler format:

`\_{digit}{13}\_{digit}{4}`

This simpler syntax is easier to parse and performs the same match as the regular expression version.

Delete Parameter

Steps:

  1. In Flow View, select the dataset with parameters icon.
  2. From the context menu, select Edit parameters....
  3. In the Edit Dataset with Parameters screen, select the parameter that you wish to remove.
  4. In the popup, click Delete.
  5. Save your changes.
Was this page helpful? Let us know how we did:

Send feedback about...

Google Cloud Dataprep Documentation
Need help? Visit our support page.