API Workflow - Run Job

This section describes how to run a job using the APIs available in Cloud Dataprep by TRIFACTA® INC..

A note about API URLs:

In the listed examples, URLs are referenced in the following manner:

<protocol>://<platform_base_url>/

In your product, these map references map to the following:

https://www.api.clouddataprep.com/

For more information, see API Reference.

Pre-requisites

Before you begin, you should verify the following:

  1. Get authentication credentials. As part of each request, you must pass in authentication credentials to the platform.

    Tip: The recommended method is to use an access token, which can be generated from the Cloud Dataprep application. For more information, see Access Tokens Page.

    For more information, see https://clouddataprep.com/documentation/api/#section/Authentication

  2. Verify job execution. Run the desired job through the Cloud Dataprep application and verify that the output objects are properly generated.
  3. Acquire recipe (wrangled dataset) identifier. In Flow View, click the icon for the recipe whose outputs you wish to generate. Acquire the numeric value for the recipe from the URL. In the following, the recipe Id is 28629:

    http://<platform_base_url>/flows/5479?recipe=28629&tab=recipe
  4. Create output object. A recipe must have at least one output object created for it before you can run a job via APIs. For more information, see Flow View Page.

If you wish to apply overrides to the inputs or outputs of the recipe, you should acquire those identifiers or paths now. For more information, see "Run Job with Parameter Overrides" below.

Step - Run Job

Through the APIs, you can specify and run a job. To run a job with all default settings, construct a request like the following:

NOTE: A wrangledDataset is an internal object name for the recipe that you wish to run. Please see previous section for how to acquire this value.

Tip: You cannot apply overrides to the job definition through the API request. However, overrides can be specified through the Cloud Dataprep by TRIFACTA INC. interface.

Endpoint<protocol>://<platform_base_url>/v4/jobGroups
AuthenticationRequired
MethodPOST
Request Body
{
  "wrangledDataset": {
    "id": 28629
  }
}
Response Code201 - Created
Response Body
{
    "sessionId": "79276c31-c58c-4e79-ae5e-fed1a25ebca1",
    "reason": "JobStarted",
    "jobGraph": {
        "vertices": [
            21,
            22
        ],
        "edges": [
            {
                "source": 21,
                "target": 22
            }
        ]
    },
    "id": 961247,
    "jobs": {
        "data": [
            {
                "id": 21
            },
            {
                "id": 22
            }
        ]
    }
}

If the 201 response code is returned, then the job has been queued for execution.

Tip: Retain the id value in the response. In the above, 961247 is the internal identifier for the job group for the job. You will need this value to check on your job status.

For more information, see https://clouddataprep.com/documentation/api/#operation/runJobGroup

Checkpoint: You have queued your job for execution.

Step - Monitoring Your Job

You can monitor the status of your job through the following endpoint:

Endpoint<protocol>://<platform_base_url>/v4/jobGroups/<id>/
AuthenticationRequired
MethodGET
Request BodyNone.
Response Code200 - Ok
Response Body
{
    "id": 961247,
    "name": null,
    "description": null,
    "ranfrom": "ui",
    "ranfor": "recipe",
    "status": "Complete",
    "profilingEnabled": true,
    "runParameterReferenceDate": "2019-08-20T17:46:27.000Z",
    "createdAt": "2019-08-20T17:46:28.000Z",
    "updatedAt": "2019-08-20T17:53:17.000Z",
    "workspace": {
        "id": 22
    },
    "creator": {
        "id": 38
    },
    "updater": {
        "id": 38
    },
    "snapshot": {
        "id": 774476
    },
    "wrangledDataset": {
        "id": 28629
    },
    "flowRun": null
}

When the job has successfully completed, the returned status message includes the following:

"status": "Complete",

For more information, see https://clouddataprep.com/documentation/api/#operation/getJobGroup

Tip: You have executed the job. Results have been delivered to the designated output locations.

Step - Re-run Job

In the future, you can re-run the job using the same, simple request:

Endpoint<protocol>://<platform_base_url>/v4/jobGroups
AuthenticationRequired
MethodPOST
Request Body
{
  "wrangledDataset": {
    "id": 28629
  }
}

The job is re-run as it was previously specified.

For more information, see https://clouddataprep.com/documentation/api/#operation/createJobGroup

Step - Run Job with Overrides - Files

As needed, you can specify runtime overrides for any of the settings related to the job definition or its outputs. For file-based jobs, these overrides include:

  • Data sources
  • Execution environment
  • profiling
  • Output file, format, and other settings

Input file overrides

You can override the file-based data sources your job run. In the following example, two parameterized datasets are overridden with new files.

NOTE: Overrides for data sources apply only to file-based sources. File-based sources that are converted during ingestion, such as Microsoft Excel files, cannot be swapped in this manner.

Endpoint<protocol>://<platform_base_url>/v4/jobGroups
AuthenticationRequired
MethodPOST
Request Body
{
  "wrangledDataset": {
    "id": 28629
  },
  "overrides": {
    "datasources": {
      "airlines–2.csv parameterized": [
        "s3://my-new-bucket/test-override-input/airlines1.csv",
        "s3://my-new-bucket/test-override-input/airlines2.csv",
        "s3://my-new-bucket/test-override-input/airlines3.csv"
      ],
      "airlines–4.csv": [
        "s3://my-new-bucket/test-override-input/airlines1.csv",
        "s3://my-new-bucket/test-override-input/airlines2.csv"
      ]
    }
  }
}

The job specified for recipe 28629 is re-run using the new data sources.

Notes:

  • You can use this API method to overwrite the bucket name for your source, which is not possible through standard parameterization.
    • The parameterized list of files can be from different folders, too.
  • File type and size information is not displayed in the Job Details page for these overridden jobs.
  • No validation is performed on the existence of these files prior to execution. If the files do not exist, the job fails.

For more information, see https://clouddataprep.com/documentation/api/#operation/createJobGroup

Output file overrides

NOTE: Applying runtime overrides to jobs through the APIs is not supported in your product.

  1. Acquire the internal identifier for the recipe for which you wish to execute a job. In the previous example, this identifier was 28629.
  2. Construct a request using the following:

    Endpoint<protocol>://<platform_base_url>/v4/jobGroups
    AuthenticationRequired
    MethodPOST

    Request Body:

    {
      "wrangledDataset": {
        "id": 28629
      },
      "overrides": {
        "profiler": true,
        "execution": "spark",
        "writesettings": [
          {
            "path": "<new_path_to_output>",
            "format": "csv",
            "header": true,
            "asSingleFile": true
          }
        ]
      },
      "ranfrom": null
    }
    
  3. In the above example, the job has been launched with the following overrides:
    1. Job will be executed on the Spark cluster. Other supported values depend on your deployment:

      Value for overrides.executionDescription
      photon

      Running environment on Google Cloud node

      sparkSpark on integrated cluster, with the following exceptions.
      databricksSpark

      Spark on Azure Databricks

      emrSpark

      Spark on AWS EMR

    2. Job will be executed with profiling enabled.
    3. Output is written to a new file path.
    4. Output format is CSV to the designated path.
    5. Output has a header and is generated as a single file.
  4. A response code of 201 - Created is returned. The response body should look like the following:

    {
        "sessionId": "79276c31-c58c-4e79-ae5e-fed1a25ebca1",
        "reason": "JobStarted",
        "jobGraph": {
            "vertices": [
                21,
                22
            ],
            "edges": [
                {
                    "source": 21,
                    "target": 22
                }
            ]
        },
        "id": 962221,
        "jobs": {
            "data": [
                {
                    "id": 21
                },
                {
                    "id": 22
                }
            ]
        }
    }
  5. Retain the id value, which is the job identifier, for monitoring.

Step - Run Job with Overrides - Tables

Feature Availability: This feature is available in Cloud Dataprep Premium by TRIFACTA INC.

You can also pass job definition overrides for table-based outputs. For table outputs, overrides include:

  • Path to database to which to write (must have write access)
  • Connection to write to the target.

    Tip: This identifier is for the connection used to write to the target system. This connection must already exist. For more information on how to retrieve the identifier for a connection, see

    https://clouddataprep.com/documentation/api/#operation/listConnections

  • Name of output table
  • Target table type

    Tip: You can acquire the target type from the vendor value in the connection response. For more information, see

    https://clouddataprep.com/documentation/api/#operation/listConnections

  • action:

    Key valueDescription
    createCreate a new table with each publication.
    createAndLoadAppend your data to the table.
    truncateAndLoadTruncate the table and load it with your data.

    dropAndLoad

    Drop the table and write the new table in its place.
  • Identifier of connection to use to write data.

  1. Acquire the internal identifier for the recipe for which you wish to execute a job. In the previous example, this identifier was 28629.
  2. Construct a request using the following:

    Endpoint<protocol>://<platform_base_url>/v4/jobGroups
    AuthenticationRequired
    MethodPOST

    Request Body:

    {
      "wrangledDataset": {
        "id": 28629
      },
      "overrides": {
        "publications": [
          {
            "path": "["prod_db"]",
            "tableName": "Table_CaseFctn2",
            "action": "createAndLoad",
            "targetType": "postgres",
            "connectionId": 3,
          }
        ]
      },
      "ranfrom": null
    }
    
  3. In the above example, the job has been launched with the following overrides:

    NOTE: When overrides are applied to publishing, any publications that are already attached to the recipe are ignored.

    1. Output path is to the prod_db database, using table name is Table_CaseFctn2.
    2. Output action is "create and load." See above for definitions.
    3. Target table type is a PostgreSQL table.
  4. A response code of 201 - Created is returned. The response body should look like the following:

    {
        "sessionId": "79276c31-c58c-4e79-ae5e-fed1a25ebca1",
        "reason": "JobStarted",
        "jobGraph": {
            "vertices": [
                21,
                22
            ],
            "edges": [
                {
                    "source": 21,
                    "target": 22
                }
            ]
        },
        "id": 962222,
        "jobs": {
            "data": [
                {
                    "id": 21
                },
                {
                    "id": 22
                }
            ]
        }
    }
  5. Retain the id value, which is the job identifier, for monitoring.

Step - Run Job with Overrides - Webhooks

Feature Availability: This feature is available in Cloud Dataprep Premium by TRIFACTA INC.

When you execute a job, you can pass in a set of parameters as overrides to generate a webhook message to a third-party application, based on the success or failure of the job.

For more information on webhooks, see Create Flow Webhook Task.

  1. Acquire the internal identifier for the recipe for which you wish to execute a job. In the previous example, this identifier was 28629.
  2. Construct a request using the following:

    Endpoint<protocol>://<platform_base_url>/v4/jobGroups
    AuthenticationRequired
    MethodPOST

    Request Body:

    {
      "wrangledDataset": {
        "id": 28629
      },
      "overrides": {
        "webhooks": [{
          "name": "webhook override",
          "url": "http://example.com",
          "method": "post",
          "triggerEvent": "onJobFailure",
          "body": {
            "text": "override" 
           },
          "headers": {
            "testHeader": "val1" 
           },
          "sslVerification": true,
          "secretKey": "123",
      }]
     }
    }
  3. In the above example, the job has been launched with the following overrides:

    Override settingDescription
    nameName of the webhook.
    urlURL to which to send the webhook message.
    methodThe HTTP method to use. Supported values: POST, PUT, PATCH, GET, or DELETE. Body is ignored for GET and DELETE methods.
    triggerEvent

    Supported values: onJobFailure - send webhook message if job fails onJobSuccess - send webhook message if job completes successfully onJobDone - send webhook message when job fails or finishes successfully

    body

    (optional) The value of the text field is the message that is sent.

    NOTE: Some special token values are supported. See Create Flow Webhook Task.

    header(optional) Key-value pairs of headers to include in the HTTP request.
    sslVerification(optional) Set to true if SSL verification should be completed. If not specified, the value is true.
    secretKey(optional) If enabled, this value should be set to the secret key to use.
  4. A response code of 201 - Created is returned. The response body should look like the following:

    {
        "sessionId": "79276c31-c58c-4e79-ae5e-fed1a25ebca1",
        "reason": "JobStarted",
        "jobGraph": {
            "vertices": [
                21,
                22
            ],
            "edges": [
                {
                    "source": 21,
                    "target": 22
                }
            ]
        },
        "id": 962222,
        "jobs": {
            "data": [
                {
                    "id": 21
                },
                {
                    "id": 22
                }
            ]
        }
    }
  5. Retain the id value, which is the job identifier, for monitoring.

Step - Run Job with Parameter Overrides

You can pass overrides of the default parameter values as part of the job definition. You can use the following mechanism to pass in parameter overrides of the following types:

  • Datasets with parameters (variable type)
  • Output object parameters
  • Flow parameters

The syntax is the same for each type.

  1. Acquire the internal identifier for the recipe for which you wish to execute a job. In the previous example, this identifier was 28629.
  2. Endpoint<protocol>://<platform_base_url>/v4/jobGroups
    AuthenticationRequired
    MethodPOST

    Request Body:

    {
      "wrangledDataset": {
        "id": 28629
      },
      "overrides": {
        "runParameters": {
          "overrides": {
            "data": [{
              "key": "varRegion",
              "value": "02"
            }
          ]}
        },
      },
      "ranfrom": null
    }
    
  3. In the above example, the specified job has been launched for recipe 28629. The run parameter varRegion has been set to 02 for this specific job. Depending on how it's defined in the flow, this parameter could influence change either of the following:
    1. The source for the imported dataset.
    2. The path for the generated output.
    3. A flow parameter reference in the recipe
    4. For more information, see Overview of Parameterization.
  4. A response code of 201 - Created is returned. The response body should look like the following:

    {
        "sessionId": "79276c31-c58c-4e79-ae5e-fed1a25ebca1",
        "reason": "JobStarted",
        "jobGraph": {
            "vertices": [
                21,
                22
            ],
            "edges": [
                {
                    "source": 21,
                    "target": 22
                }
            ]
        },
        "id": 962223,
        "jobs": {
            "data": [
                {
                    "id": 21
                },
                {
                    "id": 22
                }
            ]
        }
    }
  5. Retain the id value, which is the job identifier, for monitoring.

Step - Dataflow Execution Overrides

Feature Availability: This feature is available in Cloud Dataprep Premium by TRIFACTA® INC.

NOTE: Overrides applied to the jobGroup are merged with any overrides specified as part of the output objects associated with the wrangledDataset. For more information, see API Workflow - Manage Outputs.

If neither object has a specified override for a Cloud Dataflow property, the applicable project setting is used. See Project Settings Page.

General example

You can submit overrides to a specific set of Cloud Dataflow properties for your job execution.

NOTE: If you are using automatic VPC network mode, then network, subnetwork, and usePublicIPs do not apply.

The following example shows how to run a job for a specified recipe with Cloud Dataflow property overrides applied to it:

Endpoint
https://www.api.clouddataprep.com/v4/jobGroups
AuthenticationRequired
MethodPOST

Request Body:

{
  "wrangledDataset": {
    "id": 28629
  },
  "execution": "dataflow",
  "dataflowOptions": [
    {"region": "first-region"},
    {"zone": "second-zone"},
    {"machineType": "n1-standard-32"},
    {"network": ""},
    {"subnetwork": ""},
    {"autoscalingAlgorithm": "THROUGHPUT_BASED"},
    {"maxNumWorkers": "1000"},
    {"numWorkers": "10"}
  ]
}

Notes on properties:

  • You can submit empty or null values for property values in the payload. These values are submitted.
  • If you are not using auto-scaling on your job:
    • "autoscalingAlgorithm": "NONE",
    • Use "numWorkers" instead to specify the number of compute nodes to use for the job.
    • Feature Availability: This feature is available in Cloud Dataprep Premium by TRIFACTA INC.
  • If you are using auto-scaling on your job:
    • "autoscalingAlgorithm": "throughput_based",
    • Use the "maxNumWorkers" and "numWorkers" instead to specify the number of compute nodes to use for the job.
      • Feature Availability: This feature is available in Cloud Dataprep Premium by TRIFACTA INC.

Example using VPC

By default, Cloud Dataflow expects that submitted jobs are executed across publicly available IP addresses (usePublicUPs = true). As needed, you can use resources available through a VPC.

NOTE: Google Private Access must be enabled on your Virtual Private Cloud (VPC) for Cloud Dataprep by TRIFACTA INC. to access it.

If needed, you can override the default settings to execute the job on workers that are available through your VPC.

Feature Availability: This feature is available in Cloud Dataprep Premium by TRIFACTA INC.

The following example shows how to run a job for a specified recipe with Cloud Dataflow to use your specified VPC:

Endpoint
https://www.api.clouddataprep.com/v4/jobGroups
AuthenticationRequired
MethodPOST

Request Body:

{
  "wrangledDataset": {
    "id": 28629
  },
  "execution": "dataflow",
  "dataflowOptions": [
    {"region": "first-region"},
    {"zone": "second-zone"},
    {"machineType": "n1-standard-32"},
    {"network": "my-network-name"},
    {"subnetwork": "my-subnetwork-name"},
    {"autoscalingAlgorithm": "THROUGHPUT_BASED"},
    {"serviceAccount": "my-service-account-name@<project-id>.iam.gserviceaccount.com"},
    {"numWorkers": "1"},
    {"maxNumWorkers": "1000"},
    {"usePublicIps": "false"}
  ]
}

Example with labels

You can use labels to assign billing information for the job in your project.

Feature Availability: This feature is available in Cloud Dataprep Premium by TRIFACTA INC.

The following example shows how to run a job for a specified recipe with Cloud Dataflow labels applied to it:

Endpoint
https://www.api.clouddataprep.com/v4/jobGroups
AuthenticationRequired
MethodPOST

Request Body:

{
  "wrangledDataset": {
    "id": 28629
  },
  "execution": "dataflow",
  "dataflowOptions": [
    {"region": "first-region"},
    {"zone": "second-zone"},
    {"machineType": "n1-standard-32"},
    {"network": ""},
    {"subnetwork": ""},
    {"autoscalingAlgorithm": "THROUGHPUT_BASED"},
    {"maxNumWorkers": "1000"},
    {"numWorkers": "10"},
    {"labels": [
      {
        "key": "first-new-label-key",
        "value": "first-new-label-value"
      },
      {
        "key": "second-new-label-key",
        "value": "second-new-label-value"
      }
    ]
   }
  ]
}

Notes on labels:

You can apply up to 64 labels for a job. For more information on the available properties, see Dataflow Execution Settings.