Run Job Page

In the Run Job page, you can specify transformation and profiling jobs for the currently loaded dataset. Available options include output formats and output destinations.

Tip: Columns that have been hidden in the Transformer page still appear in the generated output. Before you run a job, you should verify that all currently hidden columns are ok to include in the output.

Figure: Run Job Page

Options

Profile Results: Optionally, you can disable profiling of your output, which can improve the speed of overall job execution. When the profiling job finishes, details are available through the Job Details page, including links to download results.

NOTE: Percentages for valid, missing, or mismatched column values may not add up to 100% due to rounding.

See Job Details Page.

Publishing Actions

You can add, remove, or edit the outputs generated from this job. By default, a CSV output for your home directory on the selected datastore is included in the list of destinations, which can be removed if needed. You must include at least one output destination.

Columns:

  • Actions: Lists the action and the format for the output.
  • Location: The directory and filename or table information where the output is to be written.
  • Settings: Identifies the output format and any compression, if applicable, for the publication.

Actions:

  • To change format, location, and settings of an output, click the Edit icon.
  • To delete an output, click the X icon.

Add Publishing Action

From the available datastores in the left column, select the target for your publication.

Figure: Add Publishing Action

NOTE: Do not create separate publishing actions that apply to the same file or database table.

Steps:

  1. Select the publishing target. Click an icon in the left column.
    1. BigQuery: You can published your results to the current project or to a different one to which you have access.

      NOTE: You must have read and write access to any BigQuery database to which you are publishing. For more information, see Using BigQuery.

      To publish to a different project, click the BigQuery link at the front of the breadcrumb trail. Then, enter the identifier for the project where you wish to publish your job results.

      Tip: Your projects and their identifiers are available for review through the Cloud Dataprep by TRIFACTA INC. menu bar. See UI Reference.

      Click Go. Navigate to the database where you wish to write your BigQuery results. For more information, see BigQuery Browser.

  2. Locate a publishing destination: Do one of the following.

    1. Explore:

      NOTE: The publishing location must already exist before you can publish to it. The publishing user must have write permissions to the location.

      For Google Cloud Storage, you can create a new folder in an accessible location.
      1. To sort the listings in the current directory, click the carets next to any column name.
      2. For larger directories, browse using the paging controls.
      3. Use the breadcrumb trail to explore the target datastore. Navigate folders as needed.
    2. Search: Use the search bar to search for specific locations in the current folder only.
    3. Manual entry: Click the Edit icon to manually edit or paste in a destination.
  3. Create Folder: Depending on the storage destination, you can click it to create a new folder for the job inside the currently selected one. Do not include spaces in your folder name.
  4. Create a new file: Enter the filename under which to save the dataset.

    1. Select the Data Storage Format.
    2. Supported output formats:
      1. CSV
      2. JSON
      3. Avro
    3. You can also write as BigQuery Table, if connected to BigQuery.
  5. BigQuery: When publishing to BigQuery, you must specify the table to which to publish and related actions. See below.
  6. As needed, you can parameterize the outputs that you are creating. Click Parameterize destination in the right panel. See Parameterize destination settings below.

  7. To save the publishing destination, click Add.

To update a publishing action, hover over its entry. Then, click Edit.

To delete a publishing action, select Delete from its context menu.

Variables

If any variable parameters have been specified for the datasets or outputs of the flow, you can apply overrides to their default values. Click the listed default value and insert a new value. A variable can have an empty value.

NOTE: Unless this output is a scheduled destination, variable overrides apply only to this job. Subsequent jobs use the default variable values, unless specified again. No data validation is performed on entries for override values.

For more information on variables, see Overview of Parameterization.

File Settings

When you generate file-based results, you can configure the filename, storage format, compression, number of files, and the updating actions in the right-hand panel.

Figure: Output File Settings

Configure the following settings.

  1. Create a new file: Enter the filename to create. A filename extension is automatically added for you, so you should omit the extension from the filename.
  2. Output directory: Read-only value for the current directory.
    1. To change it, navigate to the proper directory.

  3. Data Storage Format: Select the output format you want to generate for the job.
    1. Avro:

      This format is preferred for importing a file into BigQuery.
    2. CSV and JSON: These formats are supported for all types of imported datasets and all running environments.

    3. For more information, see Supported File Formats.
  4. Publishing action: Select one of the following:

    NOTE: If multiple jobs are attempting to publish to the same filename, a numeric suffix (_N) is added to the end of subsequent filenames (e.g. filename_1.csv).

    1. Create new file every run: For each job run with the selected publishing destination, a new file is created with the same base name with the job number appended to it (e.g. myOutput_2.csv, myOutput_3.csv, and so on).
    2. Append to this file every run: For each job run with the selected publishing destination, the same file is appended, which means that the file grows until it is purged or trimmed.

      NOTE: Compression of published files is not supported for an append action.

    3. Replace this file every run: For each job run with the selected publishing destination, the existing file is overwritten by the contents of the new results.
  5. More Options:

    1. Include headers as first row on creation: For CSV outputs, you can choose to include the column headers as the first row in the output. For other formats, these headers are included automatically.

      NOTE: Headers cannot be applied to compressed outputs.

    2. Include quotes: For CSV outputs, you can choose to include double quote marks around all values, including headers.

    3. Delimiter: For CSV outputs, you can enter the delimiter that is used to separate fields in the output. The default value is the global delimiter, which you can override on a per-job basis in this field.

      Tip: If needed for your job, you can entire Unicode characters in the following format: \uXXXX.

    4. Single File: Output is written to a single file.

    5. Multiple Files: Output is written to multiple files.
  6. To save the publishing action, click Add.

BigQuery Table Settings

When publishing to BigQuery, please complete the following steps to configure the table and settings to apply to the publish action.

Steps:

  1. Select location: Navigate the BigQuery browser to select the database and table to which to publish.
    1. To create a new table, click Create a new table.
  2. Select table options:
    1. Table name:

      NOTE: BigQuery does not support destinations with a dot (.) in the name.

      1. New table: enter a name for it. You may use a pre-existing table name, and schema checks are performed against it.
      2. Existing table: you cannot modify the name.
    2. Output database: To change the database to which you are publishing, click the BigQuery icon in the sidebar. Select a different database.
    3. Publish actions: Select one of the following.
      1. Create new table every run: Each run generates a new table with a timestamp appended to the name.
      2. Append to this table every run: Each run adds any new results to the end of the table.
      3. Truncate the table every run: With each run, all data in the table is truncated and replaced with any new results.
      4. Drop the table every run: With each run, the table is dropped (deleted), and all data is deleted. A new table with the same name is created, and any new results are added to it.
  3. To save the publishing action, click Add.

Dataflow Execution Settings

By default, Cloud Dataprep by TRIFACTA INC. runs your job in the us-central1 region on an n1-standard-1 machine. As needed, you can change the geo location and the machine where your job is executed.

Tip: You can change the default values for the following in your project settings. See Project Settings Page .

Making changes to these settings can affect performance times for executing your job.

SettingDescription
Regional Endpoint

A regional endpoint handles execution details for your Cloud Dataflow job, its location determines where the Cloud Dataflow job is executed.

Zone

A sub-section of region, a zone contains specific resources for a given region.

Select Auto Zone to allow the platform to choose the zone for you.

Machine Type

Choose the type of machine on which to run your job. The default is n1-standard-1.

Note: not all machine types supported directly through Cloud Dataprep by TRIFACTA INC..

For more information on these regional endpoints, see https://cloud.google.com/dataflow/docs/concepts/regional-endpoints.

For more information on machine types, https://cloud.google.com/compute/docs/machine-types.

SettingDescription
VPC Network mode

As needed, you can override the default settings configured for your project for this job. Set this value to Custom.

NOTE: Avoid apply overrides unless necessary.

NetworkTo use a different VPC network, click Edit. Enter the name of the VPC network to use as an override for this job. Click Save to apply the override.
SubnetworkTo specify a different sub-network, click Edit. Enter the name of the sub-network. Click Save to apply the override.

For more information on these settings, see Project Settings Page.

Parameterize destination settings

For file- or table-based publishing actions, you can parameterize elements of the output path. Whenever you execute a job, you can pass in parameter values through the Run Job page.

NOTE: Output parameters are independent of dataset parameters. However, two variables of different types with the same name should resolve to the same value.

Supported parameter types:

  • Timestamp
  • Variable

For more information, see Overview of Parameterization.

Figure: Define destination parameter

Steps:

  1. When you add or edit a publishing action, click Parameterize destination in the right panel.
  2. On the listed output path, highlight the part that you wish to parameterize. Then, choose the type of parameter.
  3. For Timestamp parameters:
    1. Timestamp format: Specify the format for the timestamp value.
    2. Timestamp value: You can choose to record the exact job start time or the time when the results are written relative to the job start time.
    3. Timezone: To change the timezone recorded in the timestamp, click Change.
  4. For Variable parameters:
    1. Name: Enter a display name for the variable.

      NOTE: Variable names do not have to be unique. Two variables with the same name should resolve to the same value.

    2. Default value: Enter a default value for the parameter.
  5. To save your output parameter, click Save.
  6. You can create multiple output parameters for the same output.
  7. To save all of your parameters for the output path, click Submit.
  8. The parameter or parameters that you have created are displayed at the bottom of the screen. You can change the value for each parameter whenever you run the job.

Run job

To execute the job as configured, click Run Job. The job is queued for execution.

Cloud Dataflow imposes a limit on the size of the job as represented by the JSON passed in.

Tip: If this limit is exceeded, the job may fail with a job graph too large error. The workaround is to split the job into smaller jobs, such as splitting the recipe into multiple recipes. This is a known limitation of Cloud Dataflow.

After a job has been queued, you can track its progress toward completion. See Jobs Page.

Var denne siden nyttig? Si fra hva du synes:

Send tilbakemelding om ...

Google Cloud Dataprep Documentation
Trenger du hjelp? Gå til brukerstøttesiden vår.