Run Job on Dataflow

After you have executed a job in Cloud Dataprep, you can re-run the job with different parameters directly from Cloud Dataflow. After you collect the input and output parameters from the Cloud Dataprep interface, you can apply or modify those parameters through the Cloud Dataflow interface and then execute the job.

NOTE: Cloud Dataflow templates allow you to specify Cloud Dataflow jobs that can be run at any time. For more information, see the Dataflow template docs.

Known Limitations

NOTE: If you are modifying the inputs to the job, the new input must be of the same type and schema as the original input. For example, you cannot change a BigQuery input from the original job to use a Google Cloud Storage input when the job is re-run. File formats and schemas must also match.

  • Cloud Dataprep publishing options such as append, single file, and header settings are ignored for jobs started in Cloud Dataflow.

  • For the Cloud Dataprep job, output files are written to temporary locations, the paths of which contain global identifiers and temporary table names. You can use parts of these URIs to specify the output locations in the Cloud Dataflow job. See the Tips section below.

  • Profiling is not recommended. When profiling is enabled:

    • Profiling is also enabled for all jobs started from the Cloud Dataflow template.
    • Two additional output files are created, which must be specified as part of the template job definition as output locations.

    • See the Tips section below.


Workflow

Steps:

  1. Run the job through the Flow View page in the Cloud Dataprep interface. For more information, see Flow View Page.
  2. When the job completes, select the context menu next to the job in the Jobs tab. Select Export Results.
  3. In the Export Results window, copy the Dataflow Template URL value. This link is used to reference the template in Cloud Dataflow.
    1. Close the window.
    2. For more information, see Export Results Window.
  4. For more information on executing a job using the Cloud Dataflow template, see the Dataflow template docs.

Tips

Disable profiling

You should disable profiling in the source Cloud Dataprep job. Since all outputs must be specified for the Cloud Dataflow job, you must specify the profiling outputs, even if you do not intend to use it. Profiling incurs additional cost for the Cloud Dataflow job, and the output is not usable.

Use source URIs

When you are specifying the URIs for the Cloud Dataflow job, you should copy the source URIs, paste them into a text editor and modify before pasting back into the Cloud Dataflow job form. Example source output:

{"location1":"gs://dataprep-staging-b1/datapreptester02@example.com/jobrun/
.data_prep_temp/f5d299bc-0c56-42fa-858c-627c64d9d027/POS-r01.json/file",
"location2":"gs://dataprep-staging-b1/datapreptester02@example.com/jobrun/
.data_prep_temp/00c223dc-4f06-4a5b-84a9-e9f72078a1e1/POS-r01.csv/file"}

From the above, you might remove the parts of the URIs that point to temporary locations. Then, you can modify the URIs to point to your new locations:

NOTE: The user that executes the job must have read and write permissions to any new output location.

{"location1":"gs://dataprep-staging-b1/datapreptester02@example.com/jobrun/
my-dataflow-jobs/job02/POS-r01.json/file","location2":"gs://dataprep-staging-b1/
datapreptester02@example.com/jobrun/my-dataflow-jobs/job02/POS-r01.csv/file"}

NOTE: On the Cloud Dataflow job, output locations are permanent. If you are re-using the Cloud Dataflow job, you may need to specify a new location when you re-run the job.

Was this page helpful? Let us know how we did:

Send feedback about...

Google Cloud Dataprep Documentation