API Workflow - Swap Datasets

Feature Availability: This feature is available in Cloud Dataprep Premium by TRIFACTA® INC.

Overview

After you have created a flow, imported a dataset, and created a recipe for that dataset, you may need to swap in a different dataset and run the recipe against that one. This workflow steps through that process via the APIs.

NOTE: If you are processing multiple parallel datasources in a single job, you should create a dataset with parameters and then run the job. For more information, see API Workflow - Run Job on Dataset with Parameters.

This workflow utilizes the following methods:

  1. Creating an imported dataset. After the new file has been added to the backend datastore, you can import into Cloud Dataprep by TRIFACTA® INC. as an imported dataset.

  2. Swap dataset. Using the ID of the imported dataset you created, you can now assign the dataset to the recipe in your flow.
  3. Run a job. Run the job against the dataset.
  4. Monitor progress. Monitor the progress of the job until it is complete.

Example Datasets

In this example, you are wrangling data from orders placed in different regions on a quarterly basis. When a new file drops, you want to be able to swap out the current dataset that is assigned to the recipe and swap in the new one. Then, run the job.


Example Files:

The following files are stored on your HDFS deployment:

Path and FilenameDescription
hdfs:///user/orders/MyCo-orders-west-Q1.txtOrders from West region for Q1
hdfs:///user/orders/MyCo-orders-west-Q2.txtOrders from West region for Q2
hdfs:///user/orders/MyCo-orders-north-Q1.txtOrders from North region for Q1
hdfs:///user/orders/MyCo-orders-north-Q2.txtOrders from North region for Q2
hdfs:///user/orders/MyCo-orders-east-Q1.txtOrders from East region for Q1
hdfs:///user/orders/MyCo-orders-east-Q1.txtOrders from East region for Q2

Assumptions

You have already created a flow, which contains the following imported dataset and recipe:

NOTE: When an imported dataset is created via API, it is always imported as an unstructured dataset. Any recipe that references this dataset should contain initial parsing steps required to structure the data.

Tip: Through the UI, you can import one of your datasets as unstructured. Create a recipe for this dataset and then edit it. In the Recipe panel, you should be able to see the structuring steps. Back in Flow View, you can chain your structural recipe off of this one. Dataset swapping should happen on the first recipe.

Object TypeNameId
flowMyCo-Orders-Quarter2
Imported DatasetMyCo-orders-west-Q1.txt8
Recipe (wrangledDataset)n/a9
Jobn/a3

Base URL:

For purposes of this example, the base URL for the platform is the following:

http://www.example.com:3005

Step - Import Dataset

NOTE: You cannot add datasets to the flow through the flows endpoint. Moving pre-existing datasets into a flow is not supported in this release. Create or locate the flow first and then when you create the datasets, associate them with the flow at the time of creation.

NOTE: When an imported dataset is created via API, it is always imported as an unstructured dataset. Any recipe that references this dataset should contain initial parsing steps required to structure the data.

The following steps describe how to create an imported dataset and assign it to the flow that has already been created (flowId=2).

Steps:

  1. To create an imported dataset, you must acquire the following information about the source.

    1. path
    2. type
    3. name
    4. description
    5. bucket (if a file stored on S3)
  2. In this example, the file you are importing is MyCo-orders-west-Q2.txt. Since the files are similar in nature and are stored in the same directory, you can acquire this information by gathering the information from the imported dataset that is already part of the flow. Execute the following:

    Endpointhttp://www.example.com:3005/v4/importedDatasets
    AuthenticationRequired
    MethodPOST
    Request Body
    {
      "path": "hdfs:///user/orders/MyCo-orders-west-Q2.txt",
      "name": "MyCo-orders-west-Q2.txt",
      "description": "MyCo-orders-west-Q2"
    }
    
  3. The response should be a 201 - Created status code with something like the following:

    {
        "id": 12,
        "size": "281032",
        "path": "hdfs:///user/orders/MyCo-orders-west-Q2.txt",
        "dynamicPath": null,
        "workspaceId": 1,
        "isSchematized": false,
        "isDynamic": false,
        "disableTypeInference": false,
        "createdAt": "2018-10-29T23:15:01.831Z",
        "updatedAt": "2018-10-29T23:15:01.889Z",
        "parsingRecipe": {
            "id": 11
        },
        "runParameters": [],
        "name": "MyCo-orders-west-Q2.txt.txt",
        "description": "MyCo-orders-west-Q2.txt",
        "creator": {
            "id": 1,
        },
        "updater": {
            "id": 1,
        },
        "connection": null,
    }
  4. You must retain the id value so you can reference it when you create the recipe.

  5. See https://clouddataprep.com/documentation/api/#operation/createImportedDataset

Checkpoint: You have imported a dataset that is unstructured and is not associated with any flow.

Step - Swap Dataset from Recipe

The next step is to swap the primary input dataset for the recipe to point at the newly imported dataset. This step automatically adds the imported dataset to the flow and drops the previous imported dataset from the flow.

  1. Use the following to swap the primary input dataset for the recipe:

    Endpointhttp://www.example.com:3005/v4/wrangledDatasets/9/primaryInputDataset
    AuthenticationRequired
    MethodPUT
    Request Body
    {
      "importedDataset": {
        "id": 12
      }
    }
  2. The response should be a 200 - OK status code with something like the following:

    {
        "id": 9,
        "wrangled": true,
        "createdAt": "2019-03-03T17:58:53.979Z",
        "updatedAt": "2019-03-03T18:01:11.310Z",
        "recipe": {
            "id": 9,
    x        "name": "POS-r01",
    x        "description": null,
            "active": true,
            "nextPortId": 1,
            "createdAt": "2019-03-03T17:58:53.965Z",
            "updatedAt": "2019-03-03T18:01:11.308Z",
            "currentEdit": {
                "id": 8
            },
            "redoLeafEdit": {
                "id": 7
            },
            "creator": {
                "id": 1
            },
            "updater": {
                "id": 1
            }
        },
        "referenceInfo": null,
        "activeSample": {
            "id": 7
        },
        "creator": {
            "id": 1
        },
        "updater": {
            "id": 1
        },
        "referencedFlowNode": null,
        "flow": {
            "id": 2
        }
    }
  3. The new imported dataset is now the primary input for the recipe, and the old imported dataset has been removed from the flow.

Step - Rerun Job

To execute a job on this recipe, you can simply re-run any job that was executed on the old imported dataset, since you reference the job by jobId and wrangledDataset (recipe) Id.

Endpointhttp://www.example.com:3005/v4/jobGroups
AuthenticationRequired
MethodPOST
Request Body
{
  "wrangledDataset": {
    "id": 9
  }
}

The job is re-run as it was previously specified.

If you need to modify any job parameters, you must create a new job definition.

Step - Monitor Your Job

After the job has been queued, you can track it to completion. See API Workflow - Develop a Flow.

Step - Schedule Your Job

When you are satisfied with how your flow is working, you can set up periodic schedules using a third-party tool to execute the job on a regular basis.

The tool must hit the above endpoints to swap in the new dataset and run the job.