Overview of Parameterization

In Cloud Dataprep by TRIFACTA® INC., parameterization enables you to apply dynamic values to the data that you import and that you generate as part of job execution.

Parameter types:

  • Dataset Parameters: You can parameterize the paths to inputs for your imported datasets, creating datasets with parameters.
  • Flow Parameters: You can create parameters at the flow level, which can be referenced in any recipe in the flow.
  • Output Parameters: When you run a job, you can create parameters for the output paths for file- or table-based outputs.

These parameters can be defined by timestamp, patterns, wildcards, or variable values that you specify at runtime.

Datasets with Parameters

In some cases, you may need to be able to execute a recipe across multiple instances of identical datasets. For example, if your source dataset is refreshed each week under a parallel directory with a different timestamp, you can create a variable to replace the parts of the file path that change with each refresh. This variable can be modified as needed at job runtime.

Example

Suppose you have imported data from a file system source, which has the following source path to weekly transactions:

<file_system>:///source/transactions/2018/01/29/transactions.csv

In the above, you can infer a date pattern in the form of 2018/01/29, which suggests that there may be a pattern of paths to transaction files. Based on the pattern, it'd be useful to be able to do the following:

  • Import data from parallel paths for other weeks' data.
  • Sample across all of the available datasets.
  • Execute jobs based on runtime variables that you set for other transaction sets fitting the pattern.

In this case, you would want to parameterize the date values in the path, such that the dynamic path would look like the following:

<file_system>:///source/transactions/YYYY/MM/DD/transactions.csv

The above example implements a Datetime parameter on the path values, creating a dataset with parameters.

Parameter Types

You can use the following types of parameters to create datasets with parameters:

  • Datetime parameters: Apply parameters to date and time values appearing in source paths.
    • When specifying a Datetime parameter, you must also specify a range, which limits the range of the Datetime values.
  • Variables: Define variable names and default values for a dataset with parameters. Modify these values at runtime to parameterize execution.
  • Pattern parameters:
    • Wildcards: Apply wildcards to replace path values.
    • Regular Expressions: You can apply regular expressions to specify your dataset matches. Please see the limitations section below for more information.
    • Cloud Dataprep patterns: The platform supports a simplified means of expressing patterns.
      • For more information on Cloud Dataprep patterns, see Text Matching.

For more information, see Create Dataset with Parameters.

Guidelines for Sources

The source files or tables for a dataset with parameters should have consistent structures. Since the sources are parsed with the same recipe or recipes, variations in schema could cause breakages in the recipe or initial parsing steps, which are applied based on the schema of the first matching source.

NOTE: All datasets imported through a single parameter are expected to have exactly matching schemas. For more information on variations, see Mismatched Schemas below.

Tip: If there have been changes to the schema of the sources of your dataset with parameters, you can edit the dataset and update the parameters. See Library Page.

Mismatched Schemas

Cloud Dataprep by TRIFACTA INC. expects that all datasets imported using a single parameter have schemas that match exactly. The schema for the entire dataset is taken from the first dataset that matches for import.

If schemas do not match:

  • When the first dataset contains extra columns at the end, the subsequent datasets that match should import without issues.
  • If the subsequent datasets contain extra columns at the end, the datasets may import. Depending on the situation, there may be issues.
  • If the subsequent datasets have additional or missing columns in the middle of the dataset, results of the import are unpredictable.
    • If there are extra columns in the middle of the dataset, you may see extra data in the final column, in which the spill-over data has not been split.
  • Ideally, you should fix these issues in the source of the data. But if you cannot, you can try the following:

Tips:

  • After import of a dataset with parameters, perform a full scan random sample. When the new sample is selected:
    • Check the last column of your imported to see if you have multiple columns of data. See if you can perform split the columns yourself.
    • Scan the column histograms to see if there are columns where the number of mismatches or anomalous or outlier values has suddenly increased. This could be a sign of mismatches in the schemas.
  • Edit the dataset with parameters. Review the parameter definition. Click Update to re-infer the data types of the schemas. This step may address some issues.
  • You can use the union tool to import the oldest and most recent sources in your dataset with parameters. If you see variations in the schema, you can look to modify the sources to match.
    • If your sources have variation in structure, you should remove the structure from the imported dataset and create your own initial parsing steps to account for the variations. See Initial Parsing Steps.

Limitations

  • You cannot create datasets with parameters from uploaded data.
  • You cannot create dataset with parameters from multiple file types.
    • File extensions can be parameterized. Mixing of file types (e.g. TXT and CSV) only works if they are processed in an identical manner, which is rare.
    • You cannot create parameters across text and binary file types.
  • You cannot apply parameters to write or publishing operations.
  • Parameter and variable names can be up to 255 characters in length.
  • For regular expression patterns, the following reference types are not supported due to the length of time to evaluate:
    • Backreferences. The following example matches on axa, bxb, and cxc yet generates an error:

      ([a-c])x\1
    • Lookahead assertions: The following example matches on a, but only when it is part of an ab pattern. It generates an error:

      a(?=b)
  • For some source file types, such as Parquet, the schemas between source files must match exactly.

Creating Dataset with Parameters

From file system

When browsing for data on your default storage layer, you can choose to parameterize elements of the path. Through the Import Data page, you can select elements of the path, apply one of the supported parameter types and then create the dataset with parameters.

NOTE: Matching file path patterns in a large directory can be slow. Where possible, avoid using multiple patterns to match a file pattern or scanning directories with a large number of files. To increase matching speed, avoid wildcards in top-level directories and be as specific as possible with your wildcards and patterns.

For more information, see Create Dataset with Parameters.

From relational source

If you are creating a dataset from a relational source, you can apply parameters to the custom SQL that pulls the data from the source.

NOTE: Avoid using parameters in places in the SQL statement that change the structure of the data. For example, within a SELECT statement, you should not add parameters between the SELECT and FROM keywords.

For more information, see Create Dataset with SQL.

Matching parameters

When a dataset with parameters is imported for use, all matching source files or tables are automatically unioned together.

NOTE: Sources for a dataset with parameters should have matching schemas.

The initial sample that is loaded in the Transformer page is drawn from the first matching source file or table. If the initial sample is larger than the first file, rows may be pulled from other source objects.

Managing Datasets with Parameters

Datasets with parameters in your flows

After you have imported a dataset with parameters into your flow:

  • You can review any parameters that have been applied to the dataset through the Parameterization in Flow view.
  • When the dataset with parameters is selected, you can use the right panel to review and edit the parameters that are applied to it.
  • You can override the default value applied to the parameter through Flow View. See Manage Parameters Dialog.

For more information, see Flow View Page.

Tip: You can review details on the parameters applied to your dataset. See Dataset Details Page.

Sampling from datasets with parameters

When a dataset with parameters is first loaded into the Transformer page, the initial sample is loaded from the first found match in the range of matching datasets. If this match is a multi-sheet Excel file, the sample is taken from the first sheet in the file.

With parameters:

To work with data that appears in files other than the first match in the dataset, you must create a new sample in the Transformer page. Any sampling operations performed within the Transformer page sample across all matching sources of the dataset.

With variables:

If you have created a variable with your dataset, you can apply a variable value to override the default at sampling time. In this manner, you can specify sampling to occur from specific source files from your dataset with parameters.

For more information, see Overview of Sampling.

Scheduling for datasets with parameters

Schedules can been applied to a dataset with parameters. When resolving date range rules for scheduling a dataset with parameters, the schedule time is used.

For more information, see Add Schedule Dialog.

Sharing for datasets with parameters

By default, when a flow containing parameters is copied, any changes to parameter values in the copied flow also affect parameters in the original flow. To separate these parameters, you have the following options:

  1. Optionally, when the flow is copied, you can copy the underlying datasets.
  2. As a workaround, you can export and import the flow into the same system and replace the datasets in the imported flow.

NOTE: For copying flows using parameterized datasets, you should duplicate the datasets, which creates separate copies of parameters and their values in the new flow. If datasets are not copied, then parameter changes in the copied flow modify the values in the source flow.

For more information, see Overview of Sharing.

Housekeeping

Since Cloud Dataprep by TRIFACTA INC. never touches the source data, after a source that is matched for a dataset with parameters has been executed, you should consider removing it from the source system or adjusting any applicable ranges on the matching parameters. Otherwise, outdated data may continue to factor into operations on the dataset with parameters.

NOTE: Housekeeping of source data is outside the scope of Cloud Dataprep by TRIFACTA INC.. Please contact your IT staff to assist as needed.

Flow Parameters

You can specify flow parameters and their default values, which can be invoked in the recipe steps of your flow. Wherever the flow parameter is invoked, it is replaced by the value you set for the parameter. Uses:

  • Dynamically affect recipe steps
  • Improve flow usability; build fewer flows and recipes to maintain
  • Parameters are evaluated at design time in the Transformer page and at runtime during job execution
  • All parameter values can be overridden, as needed.

Flow parameter types:

  • Literal values: These values are always of String data type.

    Tip: You can wrap flow parameter references in your transformations with one of the PARSE functions. For more information, see Create Flow Parameter.

    NOTE: Wildcards are not supported.

  • Cloud Dataprep patterns. For more information, see Text Matching.
  • Regular expression patterns.

Limitations

  • Flow parameters are converted to constants in macros. Use of the macro in other recipes results in the constant value being applied.
  • A flow parameter cannot be used in some transformation steps or fields.

Example

Suppose you need to process your flow across several regions of your country. These regions are identified using a region ID value: pacific, mountain, central, eastern.

From the Flow View context menu, you select Manage parameters. In the Parameters tab, you specify the parameter name:

paramRegion

You must specify a default value. To verify that this critical parameter is properly specified before job execution, you set the default value to:

##UNSPECIFIED##

The above setting implies two things:

  • If the above value appears in the output, then an override value for the parameter was not specified when the job was executed, which prevents the default value being used erroneously.
  • Before the job is executed, you must specify an override value. You can specify an override:
    • At the flow level to assist in recipe development.
    • At run time to insert the proper region value for the job run.

After the flow parameter has been created, you can invoke it in a transformation step using the following syntax.

$paramRegion

Where the parameter is referenced, the default or applicable override value is applied. For more examples, see Create Flow Parameter.

Upstream flow parameters

If your flow references a recipe or dataset that is sourced from an upstream flow, the flow parameters from that flow are available in your current flow. That value of the parameter at time of execution is passed to the current flow.

NOTE: Downstream values and overrides of parameters that share the same name take precedence. When you execute the downstream flow, the parameter value is applied to the current flow and to all upstream objects. For more information, see "Order of Evaluation" below.

Creating flow parameters

Flow parameters are created at the flow level from the context menu in Flow View. See Manage Parameters Dialog.

Managing flow parameters

Flow parameters can be edited, deleted, and overridden through the Flow View context menu. See Manage Parameters Dialog.

Output Parameters

You can specify variable and timestamp parameters to apply to the file or table paths of your outputs.

NOTE: Output parameters are independent of dataset parameters.

Parameter Types

You can create the following types of output parameters:

  • Datetime parameters: Insert date and time values in output paths based on the job's start time.
  • Variables: Define variable names and default values for an output parameter. Modify these values at runtime to parameterize execution.

Tip: These types of parameters can be applied to file or table paths. An output path can contain multiple parameters.

Example

Suppose you are generating a JSON file as the results of job execution.

/outputs/myFlow/myOutput.json

Since this job is scheduled and will be executed on a regular interval, you want to insert a timestamp as part of the output, so that your output filenames are unique and timestamped:

/outputs/myFlow/myOutput_<timestamp>.json

In this case, you would create an output parameter of timestamp type as part of the write settings for the job you are scheduling.

Creating output parameters

When you are creating or editing a publishing action in the Run Jobs page, you can click the Parameterize destination link that appears in the right panel. See Run Job Page.

Using output parameters

Whenever you execute a job using the specified publishing action, the output parameters are applied.

After specifying variable parameters, you can insert new values for them at the time of job execution in the Run Job page.

For more information, see Run Job Page.

Parameter Overrides

For each type of parameter, you can apply override values as needed.

Override TypeDescription
dataset parametersWhen you run a job, you can apply override values to variables for your imported datasets. See Run Job Page.
flow parameters

At the flow level, you can apply override values to flow parameters. These values are passed into the recipe and the rest of the flow for evaluation during recipe development and job execution.

NOTE: Overrides applied at the flow level are passed into all recipes and other objects in the flow. Wherever there is case-sensitive match between the name of the overridden parameter and a parameter name in the flow, the override value is applied. These values can be overridden by ad-hoc values. See "Order of Precedence" below.

output parametersWhen you define your output objects in Flow View, you can apply override values to the parameterized output paths on an as-needed basis when you specify your job settings. See Run Job Page.

Order of Parameter Evaluation

Wherever a parameter value or override is specified in the following list, the value is applied to all matching parameters within the execution tree. Suppose you have created a parameter called varRegion, which is referenced in your imported dataset, recipe, and output object. If you specify an override value for varRegion in the Run Job page, that value is applied to the data you import (dataset parameter), the recipe during execution (flow parameter), and the path of the output that you generate (output parameter).

  • Name matches are case-sensitive.

NOTE: Override values are applied to upstream flows, as well. Any overrides specified in the current flow are passed to downstream flows, where they can be overridden as needed.

Parameter values are evaluated based on the following order of precedence (highest to lowest):

  1. Run-time overrides: Parameter values specified at run-time for jobs.

    NOTE: The override value is applied to all subsequent operations in the platform. When a job is submitted to the job queue, any overrides are applied at that time. Changes to override values do not affect jobs that are already in flight.

    NOTE: You can specify run-time override values when executing jobs through the APIs. See API Workflow - Run Job.

    See Run Job Page.

  2. Flow level overrides: At the flow level, you can specify override values, which are passed into the flow's objects. These values can be overridden by overrides set in the above locations. See Manage Parameters Dialog.
  3. Default values: If no overrides are specified, the default values are applied:
    1. Imported datasets: See Create Dataset with Parameters.
    2. Flow parameters: See Manage Parameters Dialog.
    3. Output parameters: See Run Job Page.
  4. Inherited (upstream) values: Any parameter values that are passed into a flow can be overridden by any matching override specified within the downstream flow.

Run Jobs with Parameters

When running a job based on datasets with parameters, results are written into separate folders for each parameterized path.

NOTE: During job execution, a canary file is written for each set of results to validate the path. For datasets with parameters, if the path includes folder-level parameterization, a separate folder is created for each parameterized path. During cleanup, only the the canary files and the original folder path are removed. The parameterized folders are not removed. This is a known issue.

NOTE: Due to a limitation in Cloud Dataflow, when you run a job on a parameterized dataset containing more than 100 files, the input paths data must be compressed, which results in non-readable location values in the Cloud Dataflow console. Running jobs on datasets sourced from more than 6000 files may fail.

Runtime Parameter Overrides

When you choose to run a job on a dataset with parameters from the user interface, any variables are specified using their default values.

Through the Run Job page, you can specify different values to apply to variables for the job.

NOTE: Applying runtime overrides to jobs through the APIs is not supported in your product.

NOTE: Values applied through the Run Job page to variables override the default values for the current execution of the job. Default values for the next job are not modified.

NOTE: When you edit an imported dataset, if a variable is renamed, a new variable is created using the new name. Any override values assigned under the old variable name for the dataset must be re-applied. Instances of the variable and override values used in other imported datasets remain unchanged.

For more information, see Run Job Page.

In the Job Details page, click the Parameters tab to view the parameter names and values that were used as part of the job, including the list of matching datasets. See Job Details Page.

Scheduling Jobs

You can schedule jobs for datasets with parameters. See Schedule a Job.