Run Job on Cloud Dataflow

After you have executed a job in Cloud Dataprep by TRIFACTA®, you can re-run the job with different parameters directly from Cloud Dataflow. After you collect the input and output parameters from the Cloud Dataprep interface, you can apply or modify those parameters through the Cloud Dataflow interface and then execute the job.

NOTE: Cloud Dataflow templates allow you to specify Cloud Dataflow jobs that can be run at any time. For more information, see the Cloud Dataflow template docs.

Known Limitations

NOTE: If you are modifying the inputs to the job, the new input must be of the same type and schema as the original input. For example, you cannot change a BigQuery input from the original job to use a Google Cloud Storage input when the job is re-run. File formats and schemas must also match.

  • Cloud Dataflow imposes a limit on the size of the job as represented by the JSON passed in.

    Tip: If this limit is exceeded, the job may fail with a job graph too large error. The workaround is to split the job into smaller jobs, such as splitting the recipe into multiple recipes. This is a known limitation of Cloud Dataflow.

  • Cloud Dataprep by TRIFACTA publishing options such as append, single file, and header settings are ignored for jobs started in Cloud Dataflow.

  • For the Cloud Dataprep by TRIFACTA job, output files are written to temporary locations, the paths of which contain global identifiers and temporary table names. You can use parts of these URIs to specify the output locations in the Cloud Dataflow job. See the Tips section below.

  • Profiling is not recommended. When profiling is enabled:

    • Profiling is also enabled for all jobs started from the Cloud Dataflow template.
    • Two additional output files are created, which must be specified as part of the template job definition as output locations.

    • See the Tips section below.
  • Cloud Dataflow templates generated from a Cloud Dataprep by TRIFACTA job are intended for use as a static copy of the job executed at the moment in time.
    • All relative functions are computed based on the moment of execution. Functions such as NOW() and TODAY are not recomputed when the Cloud Dataflow template is executed.
    • To update the output values of these functions, re-run the job in Cloud Dataprep by TRIFACTA through the UI or a scheduled job. Then, execute the Cloud Dataflow template job.

Workflow

Steps:

  1. Run the job through the Flow View page in the Cloud Dataprep interface. For more information, see Flow View Page.
  2. When the job completes, click the job identifier in the Jobs tab.
  3. In the Job Details page, click the Overview tab. From the Job summary, click Copy to clipboard for the Cloud Dataflow template. For more information, see Job Details Page.
  4. This link is used to reference the template in Cloud Dataflow. For more information on executing a job using the Cloud Dataflow template, see the Cloud Dataflow template docs.

Tips

Disable profiling

You should disable profiling in the source Cloud Dataprep by TRIFACTA job. Since all outputs must be specified for the Cloud Dataflow job, you must specify the profiling outputs, even if you do not intend to use it. Profiling incurs additional cost for the Cloud Dataflow job, and the output is not usable.

Use source URIs

When you are specifying the URIs for the Cloud Dataflow job, you should copy the source URIs, paste them into a text editor and modify before pasting back into the Cloud Dataflow job form. Example source output:

{"location1":"gs://dp-staging-b1/dpreptester02@example.com/jobrun/
.data_prep_temp/f5d299bc-0c56-42fa-858c-627c64d9d027/POS-r01.json/file",
"location2":"gs://dp-staging-b1/dptester02@example.com/jobrun/
.data_prep_temp/00c223dc-4f06-4a5b-84a9-e9f72078a1e1/POS-r01.csv/file"}

From the above, you might remove the parts of the URIs that point to temporary locations. Then, you can modify the URIs to point to your new locations:

NOTE: The user that executes the job must have read and write permissions to any new output location.

{"location1":"gs://dp-staging-b1/dptester02@example.com/jobrun/
my-df-jobs/job02/POS-r01.json/file","location2":"gs://dp-staging-b1/
dptester02@example.com/jobrun/my-df-jobs/job02/POS-r01.csv/file"}

NOTE: On the Cloud Dataflow job, output locations are permanent. If you are re-using the Cloud Dataflow job, you may need to specify a new location when you re-run the job.

Was this page helpful? Let us know how we did:

Send feedback about...

Google Cloud Dataprep Documentation
Need help? Visit our support page.