CDAP reference

When using Cloud Data Fusion, you use both the Cloud Console and the Cloud Data Fusion UI. You use the Cloud Console to create a Cloud Data Fusion instance. You then use the Cloud Data Fusion UI to create and manage your pipelines.

Alternatively, you can use command-line tools to create and manage your Cloud Data Fusion instances and pipelines.

  • The REST reference describes the API for creating and managing your Cloud Data Fusion instances on Google Cloud.
  • This page describes the REST API for creating and managing pipelines and datasets. Throughout this page, there are links to the CDAP documentation site, where you can find more detailed information.

Before you begin

Before you use the REST API, download the Cloud SDK and set up environment variables for your gcloud command-line tool access credentials and CDAP API endpoint.

Download and log in to the Cloud SDK

  1. Install and initialize the Cloud SDK.

  2. Log in to the Cloud SDK, using the gcloud command-line tool. Run:

    $ gcloud auth login
    

Set up environment variables

Set up environment variables for your gcloud command-line tool access credentials and the CDAP apiEndpoint of your Cloud Data Fusion instance.

export AUTH_TOKEN=$(gcloud auth print-access-token)
export CDAP_ENDPOINT=apiEndpoint

The CDAP_ENDPOINT can be obtained using either the gcloud command-line tool or the Cloud Data Fusion REST API. Make sure to use the value of the apiEndpoint, not the serviceEndpoint field. It has the format [hostname]/api

  • To set the CDAP_ENDPOINT to the apiEndpoint, you can use the gcloud beta data-fusion command:

    1. In a local terminal window or in Cloud Shell, specify your-instance-name, then run the following gcloud commands to set the CDAP_ENDPOINT environment variable to your instance's apiEndpoint.

      export INSTANCE_ID=your-instance-name
      
      export CDAP_ENDPOINT=$(gcloud beta data-fusion instances describe \
          --location=us-central1 \
          --format="value(apiEndpoint)" \
        ${INSTANCE_ID})
      
  • As an alternative, you can get the apiEndpoint by calling the Cloud Data Fusion REST API, then manually set the CDAP_ENDPOINT.

    1. Use the Try this API panel to submit an instances.get request:
      1. Fill in the name request parameter. Provide your project-id, instance location (the region, for example, "us-central1"), and instance-name in the following format:
        projects/project-id/locations/location/instances/instance-name
      2. Click "EXECUTE" to submit the request.
      3. Expand the panel to view apiEndpoint listed in the bHTTP response section.
      4. Run the following command (insert the value of your apiEndpoint).
        export CDAP_ENDPOINT=your-apiEndpoint
        

Deploy a pipeline

To deploy a Cloud Data Fusion pipeline, submit the following HTTP PUT request.

PUT -H "Authorization: Bearer ${AUTH_TOKEN}" "${CDAP_ENDPOINT}/v3/namespaces/namespace-id/apps/pipeline-name"
Parameter Description
namespace-id If your pipeline belongs to a Basic edition instance, the namespace ID is always default. If your pipeline belongs to an Enterprise edition instance, you can create a namespace.
pipeline-name Your pipeline name

For more information, see Create an Application on the CDAP documentation site.

The body of the HTTP PUT request is a JSON object in the following format:

{
  "name": "MyPipeline",
  "artifact": {
    "name": "cdap-data-pipeline",
    "version": "6.0.0",
    "scope": "system"
  },
  "config": {
    . . .
    "connections": [ . . . ],
    "engine": "mapreduce",
    "postActions": [ . . . ],
    "stages": [ . . . ],
    "schedule": "0 * * * *",
  },
  "__ui__": {
    . . .
  }
}

For more information, see Pipeline Configuration File Format and Creating a Batch Pipeline on the CDAP documentation site.

Retrieve pipelines

Retrieve all pipelines

To list Cloud Data Fusion pipelines in the specified namespace, submit the following HTTP GET request.

GET -H "Authorization: Bearer ${AUTH_TOKEN}" "${CDAP_ENDPOINT}/v3/namespaces/namespace-id/apps"
Parameter Description
namespace-id If your pipeline belongs to a Basic edition instance, the namespace ID is always default. If your pipeline belongs to an Enterprise edition instance, you can create a namespace.

For more inforamation, see Deployed Applications, on the CDAP documentation site.

Retrieve batch pipelines

To list Cloud Data Fusion batch pipelines in the specified namespace, submit the following HTTP GET request.

GET -H "Authorization: Bearer ${AUTH_TOKEN}" "${CDAP_ENDPOINT}/v3/namespaces/namespace-id/apps?artifactName=cdap-data-pipeline"
Parameter Description
namespace-id If your pipeline belongs to a Basic edition instance, the namespace ID is always default. If your pipeline belongs to an Enterprise edition instance, you can create a namespace.

For more inforamation, see Deployed Applications, on the CDAP documentation site.

Retrieve real-time pipelines

To list Cloud Data Fusion real-time pipelines in the specified namespace, submit the following HTTP GET request.

GET -H "Authorization: Bearer ${AUTH_TOKEN}" "${CDAP_ENDPOINT}/v3/namespaces/namespace-id/apps?artifactName=cdap-data-streams"
Parameter Description
namespace-id If your pipeline belongs to a Basic edition instance, the namespace ID is always default. If your pipeline belongs to an Enterprise edition instance, you can create a namespace.

For more inforamation, see Deployed Applications, on the CDAP documentation site.

Retrieve pipeline details

To list the details of a pipeline in the specified namespace, submit the following HTTP GET request.

GET -H "Authorization: Bearer ${AUTH_TOKEN}" "${CDAP_ENDPOINT}/v3/namespaces/namespace-id/apps/pipeline-name"
Parameter Description
namespace-id If your pipeline belongs to a Basic edition instance, the namespace ID is always default. If your pipeline belongs to an Enterprise edition instance, you can create a namespace.
pipeline-name Your pipeline name

For more inforamation, see Details of a Deployed Application, on the CDAP documentation site.

Batch pipelines

Start a batch pipeline

To start a batch pipeline, submit the following HTTP POST request.

POST -H "Authorization: Bearer ${AUTH_TOKEN}" "${CDAP_ENDPOINT}/v3/namespaces/namespace-id/apps/pipeline-name/workflows/DataPipelineWorkflow/start"
Parameter Description / value
namespace-id If your pipeline belongs to a Basic edition instance, the namespace ID is always default. If your pipeline belongs to an Enterprise edition instance, you can create a namespace.
pipeline-name Your pipeline name

For more information, see Start a Program on the CDAP documentation site.

Stop a batch pipeline

To stop a batch pipeline, submit the following HTTP POST request.

POST -H "Authorization: Bearer ${AUTH_TOKEN}" "${CDAP_ENDPOINT}/v3/namespaces/namespace-id/apps/pipeline-name/workflows/DataPipelineWorkflow/stop"
Parameter Description / value
namespace-id If your pipeline belongs to a Basic edition instance, the namespace ID is always default. If your pipeline belongs to an Enterprise edition instance, you can create a namespace.
pipeline-name Your pipeline name

For more information, see Stop a program on the CDAP documentation site.

Schedule a batch pipeline

Note: Scheduling is available only for batch pipelines.

By default, scheduling is disabled. To enable scheduling for your pipeline, submit the following HTTP POST request.

POST -H "Authorization: Bearer ${AUTH_TOKEN}" "${CDAP_ENDPOINT}/v3/namespaces/namespace-id/apps/pipeline-name/schedules/dataPipelineSchedule/enable"
Parameter Description / value
namespace-id If your pipeline belongs to a Basic edition instance, the namespace ID is always default. If your pipeline belongs to an Enterprise edition instance, you can create a namespace.
pipeline-name Your pipeline name

For more information, see Schedule a Program on the CDAP documentation site.

Batch pipeline run records

To get the run records of a Cloud Data Fusion batch pipeline, submit the following HTTP GET requests.

Run records of a batch pipeline

The returned information includes the batch pipeline's run ids.

GET -H "Authorization: Bearer ${AUTH_TOKEN}" "${CDAP_ENDPOINT}/v3/namespaces/namespace-id/apps/pipeline-name/workflows/DataPipelineWorkflow/runs

Records of a batch pipeline run

GET -H "Authorization: Bearer ${AUTH_TOKEN}" "${CDAP_ENDPOINT}/v3/namespaces/namespace-id/apps/pipeline-name/workflows/DataPipelineWorkflow/runs/run-id
Parameter Description
namespace-id If your pipeline belongs to a Basic edition instance, the namespace ID is always default. If your pipeline belongs to an Enterprise edition instance, you can create a namespace.
pipeline-name
run-id To find the run ID, see Batch pipeline run records, which returns a list of run IDs.

For more information, see Retrieving Specific Run Information on the CDAP documentation site.

Logs for a batch pipeline

You can view the logs of a pipeline or of a specific pipeline run.

  • To view the logs of a batch pipeline, submit the following HTTP GET request.

    GET -H "Authorization: Bearer ${AUTH_TOKEN}" "${CDAP_ENDPOINT}/v3/namespaces/namespace-id/apps/pipeline-name/workflows/DataPipelineWorkflow/logs?start=start-ts&stop=stop-ts
    
  • To view logs of a specific run of a batch pipeline, submit the following HTTP GET request.

    GET -H "Authorization: Bearer ${AUTH_TOKEN}" "${CDAP_ENDPOINT}/v3/namespaces/namespace-id/apps/pipeline-name/workflows/DataPipelineWorkflow/runs/run-id/logs?start=start-ts&stop=stop-ts"
    
Parameter Description / value
namespace-id If your pipeline belongs to a Basic edition instance, the namespace ID is always default. If your pipeline belongs to an Enterprise edition instance, you can create a namespace.
pipeline-name Your pipeline name
run-id Relevant only if you want to view logs of a specific pipeline run. To find the run ID, see Batch pipeline run records, which returns a list of run IDs.

For more information, see Downloading Application Logs on the CDAP documentation site.

Metrics for a batch pipeline

To view specific metrics for a batch pipeline, submit the following HTTP POST request.

POST -H "Authorization: Bearer ${AUTH_TOKEN}" "${CDAP_ENDPOINT}/v3/metrics/query"

The body of the HTTP POST request is a JSON object in the following format:

{
  "query": {
    "tags": {
      "namespace": "default",
      "app": "pipeline name",
      "workflow": "DataPipelineWorkflow",
      "run": "run-id"
    },
    "metrics": [
      "metric1 name",
      "metric2 name",
      ...
    ],
    "timeRange": {
      "aggregate": true
    }
  }
}
Query parameter Description / value
pipeline name Pipeline name
run-id To find the run ID, see Batch pipeline run records, which returns a list of run IDs.
metric name Metric names follow the format:
user.pipeline-stage.metric
  • pipeline-stage is any of the stage names in the body of the HTTP PUT request you configured when you created your pipeline. In the following example, BigQuery or GoogleCloudStorage are possible values for pipeline-stage.
    {
      "stages": [
        {
          "name": "BigQuery",
          ...
        },
        {
          "name": "GoogleCloudStorage",
          ...
        },
        ...
      ],
      ...
    }         
  • metrics can be any of:
    • records.in
    • records.out
    • records.error
    • process.time.total
    • process.time.avg
    • process.time.max
    • process.time.min
    • process.time.stddev

For example, the following query gets the records.out and process.time.avg metrics for the BigQuery stage of the batch pipeline, batch-pipeline.

{
  "query": {
    "tags": {
      "namespace": "default",
      "app": "batch-pipeline",
      "workflow": "DataPipelineWorkflow",
      "run": "81e3d583-f68b-11e9-aba0-0242b9f29569"
    },
    "metrics": [
      "user.BigQuery.records.out",
      "user.BigQuery.process.time.avg"
    ],
    "timeRange": {
      "aggregate": true
    }
  }
}

For more information, see Metrics HTTP RESTful API on the CDAP documentation site.

Real-time pipelines

Start a real-time pipeline

To start a real-time pipeline, submit the following HTTP POST request.

POST -H "Authorization: Bearer ${AUTH_TOKEN}" "${CDAP_ENDPOINT}/v3/namespaces/namespace-id/apps/pipeline-name/spark/DataStreamsSparkStreaming/start"
Parameter Description / value
namespace-id If your pipeline belongs to a Basic edition instance, the namespace ID is always default. If your pipeline belongs to an Enterprise edition instance, you can create a namespace.
pipeline-name Your pipeline name

For more information, see Starting a Program on the CDAP documentation site.

Stop a real-time pipeline

To stop a real-time pipeline, submit the following HTTP POST request.

POST -H "Authorization: Bearer ${AUTH_TOKEN}" "${CDAP_ENDPOINT}/v3/namespaces/namespace-id/apps/pipeline-name/spark/DataStreamsSparkStreaming/stop"
Parameter Description / value
namespace-id If your pipeline belongs to a Basic edition instance, the namespace ID is always default. If your pipeline belongs to an Enterprise edition instance, you can create a namespace.
pipeline-name Your pipeline name

For more information, see Stop a Program on the CDAP documentation site.

Real-time pipeline run records

To get the run records of a Cloud Data Fusion real-time pipeline, submit the following HTTP GET requests.

Run records of a real-time pipeline

The returned information includes the real-time pipeline's run ids.

GET -H "Authorization: Bearer ${AUTH_TOKEN}" "${CDAP_ENDPOINT}/v3/namespaces/namespace-id/apps/pipeline-name/spark/DataStreamsSparkStreaming

Records of a real-time pipeline run

GET -H "Authorization: Bearer ${AUTH_TOKEN}" "${CDAP_ENDPOINT}/v3/namespaces/namespace-id/apps/pipeline-name/spark/DataStreamsSparkStreaming/run-id
Parameter Description
namespace-id If your pipeline belongs to a Basic edition instance, the namespace ID is always default. If your pipeline belongs to an Enterprise edition instance, you can create a namespace.
pipeline-name
run-id To find the run ID, see Real-time pipeline run records, which returns a list of run IDs.

For more information, see Retrieving Specific Run Information on the CDAP documentation site.

Logs for a real-time pipeline

You can view the logs of a pipeline or of a specific pipeline run.

  • To view the logs of a real-time pipeline, submit the following HTTP GET request.

    GET -H "Authorization: Bearer ${AUTH_TOKEN}" "${CDAP_ENDPOINT}/v3/namespaces/namespace-id/apps/pipeline-name/spark/DataStreamsSparkStreaming/logs?start=start-ts&stop=stop-ts"
    
  • To view logs of a specific run of a real-time pipeline, submit the following HTTP GET request.

    GET -H "Authorization: Bearer ${AUTH_TOKEN}" "${CDAP_ENDPOINT}/v3/namespaces/namespace-id/apps/pipeline-name/spark/DataStreamsSparkStreaming/runs/run-id/logs?start=start-ts&stop=stop-ts"
    
Parameter Description / value
namespace-id If your pipeline belongs to a Basic edition instance, the namespace ID is always default. If your pipeline belongs to an Enterprise edition instance, you can create a namespace.
pipeline-name Your pipeline name
run-id Relevant only if you want to view logs of a specific pipeline run. To find the run ID, call Real-time pipeline run records, which returns a list of run IDs.

For more informatrion, see Downloading Application Logs on the CDAP documentation site.

Metrics for a real-time pipeline

To view specific metrics for a real-time pipeline, submit the following HTTP POST request.

POST -H "Authorization: Bearer ${AUTH_TOKEN}" "${CDAP_ENDPOINT}/v3/metrics/query"

The body of the HTTP POST request is a JSON object in the following format:

{
  "query": {
    "tags": {
      "namespace": "default",
      "app": "pipeline name",
      "spark": "DataStreamsSparkStreaming",
      "run": "run-id"
    },
    "metrics": [
      "metric1 name",
      "metric2 name",
      ...
    ],
    "timeRange": {
      "aggregate": true
    }
  }
}
Query parameter Description / value
pipeline name Pipeline name
run-id To find the run ID, call Real-time pipeline run records, which returns a list of run IDs.
metric name Metric names follow the format:
user.pipeline-stage.metric
  • pipeline-stage is any of the stage names in the body of the HTTP PUT request you configured when you created your pipeline. In the following example, BigQuery or GoogleCloudStorage are possible values for pipeline-stage.
    {
     "stages": [
      {
       "name": BigQuery,
       "name": GoogleCloudStorage
      },
      ...
     ],
     ...
    }           
  • metrics can be any of:
    • records.in
    • records.out
    • records.error
    • process.time.total
    • process.time.avg
    • process.time.max
    • process.time.min
    • process.time.stddev

For example, the following query gets the records.out and process.time.avg metrics for the BigQuery stage of the real-time pipeline, rt-pipeline.

{
  "query": {
    "tags": {
      "namespace": "default",
      "app": "rt-pipeline",
      "spark": "DataStreamsSparkStreaming",
      "run": "81e3d583-f68b-11e9-aba0-0242b9f29570"
    },
    "metrics": [
      "user.BigQuery.records.out",
      "user.BigQuery.process.time.avg"
    ],
    "timeRange": {
      "aggregate": true
    }
  }
}

For more information, see Metrics HTTP RESTful API on the CDAP documentation site.

Dataset metadata

Metadata properties

To view metadata properties for your dataset, submit the following HTTP GET request.

GET -H "Authorization: Bearer ${AUTH_TOKEN}" "${CDAP_ENDPOINT}/v3/namespaces/default/datasets/dataset-id/metadata/properties"
Parameter Description / value
dataset-id To get the dataset ID, submit an HTTP GET request that lists all datasets:
GET -H "Authorization: Bearer $(gcloud auth print-access-token) ${CDAP_ENDPOINT}/v3/namespaces/namespace-id/data/datasets
If your pipeline belongs to a Basic edition instance, the namespace ID is always default. If your pipeline belongs to an Enterprise edition instance, you can create a namespace.

For more information, see Retrieving Properties on the CDAP documentation site.

Metadata tags

To view metadata tags for your dataset, submit the following HTTP GET request.

GET -H "Authorization: Bearer ${AUTH_TOKEN}" "${CDAP_ENDPOINT}/v3/namespaces/default/datasets/dataset-id/metadata/tags"
Parameter Description / value
dataset-id To get the dataset ID, submit an HTTP GET request that lists all datasets: GET -H "Authorization: Bearer $(gcloud auth print-access-token) ${CDAP_ENDPOINT}/v3/namespaces/namespace-id/data/datasets. If your pipeline belongs to a Basic edition instance, the namespace ID is always default. If your pipeline belongs to an Enterprise edition instance, you can create a namespace.

For more information, see Retrieving Tags on the CDAP documentation site.

Lineage

Dataset lineage

To view the lineage of your dataset, submit the following HTTP GET request.

GET -H "Authorization: Bearer ${AUTH_TOKEN}" "${CDAP_ENDPOINT}/v3/namespaces/namespace-id/datasets/dataset-id/lineage"
Parameter Description / value
namespace-id If your pipeline belongs to a Basic edition instance, the namespace ID is always default. If your pipeline belongs to an Enterprise edition instance, you can create a namespace.
dataset-id To get the dataset ID, submit an HTTP GET request that lists all datasets: GET -H "Authorization: Bearer $(gcloud auth print-access-token) ${CDAP_ENDPOINT}/v3/namespaces/namespace-id/data/datasets. If your pipeline belongs to a Basic edition instance, the namespace ID is always default. If your pipeline belongs to an Enterprise edition instance, you can create a namespace.

For more information, see Viewing Lineages on the CDAP documentation site.

Field level lineage

To view the lineage of fields in your dataset in a specified range of time, submit the following HTTP GET request.

GET -H "Authorization: Bearer ${AUTH_TOKEN}" "${CDAP_ENDPOINT}/v3/namespaces/namespace-id/datasets/dataset-id/lineage/fields?start=start-ts&end=end-ts[&prefix=prefix>]"
Parameter Description / value
namespace-id If your pipeline belongs to a Basic edition instance, the namespace ID is always default. If your pipeline belongs to an Enterprise edition instance, you can create a namespace.
dataset-id To get the dataset ID, submit an HTTP GET request that lists all datasets: GET -H "Authorization: Bearer $(gcloud auth print-access-token) ${CDAP_ENDPOINT}/v3/namespaces/namespace-id/data/datasets.

For more information, see Field Level Lineage on the CDAP documentation site.

Secure storage

Use the CDAP Secure Storage HTTP RESTful API to add, retrieve, and delete secure keys.

Add a secure key

To add a secure key to secure storage, submit the following HTTP PUT request.

PUT -H "Authorization: Bearer ${AUTH_TOKEN}" "${CDAP_ENDPOINT}/v3/namespaces/namespace-id/securekeys/secure-key-id"
Parameter Description
namespace-id If your pipeline belongs to a Basic edition instance, the namespace ID is always default. If your pipeline belongs to an Enterprise edition instance, you can create a namespace.
secure-key-id Name of the key to add to secure storage.

The body of the HTTP PUT request is a JSON object in the following format:

{
  "description": "Example Secure Key",
  "data": "secure-contents",
  "properties": {
    "property-key": "property-value"
  }
}

For more information, see Add a Secure Key and the Administration Manual: Secure Storage on the CDAP documentation site.

Retrieve a secure key

To retrieve a secure key from secure storage, submit the following HTTP GET request.

GET -H "Authorization: Bearer ${AUTH_TOKEN}" "${CDAP_ENDPOINT}/v3/namespaces/namespace-id/securekeys/secure-key-id"
Parameter Description
namespace-id If your pipeline belongs to a Basic edition instance, the namespace ID is always default. If your pipeline belongs to an Enterprise edition instance, you can create a namespace.
secure-key-id Name of the key to retrieve from secure storage.

For more information, see Retrieve a Secure Key and the Administration Manual: Secure Storage on the CDAP documentation site.

Retrieve the metadata for a secure key

To retrieve the metadata for a secure key from secure storage, submit the following HTTP GET request.

GET -H "Authorization: Bearer ${AUTH_TOKEN}" "${CDAP_ENDPOINT}/v3/namespaces/namespace-id/securekeys/secure-key-id/metadata"
Parameter Description
namespace-id If your pipeline belongs to a Basic edition instance, the namespace ID is always default. If your pipeline belongs to an Enterprise edition instance, you can create a namespace.
secure-key-id Name of the key to retrieve from secure storage.

The metadata of the secure key is returned as a JSON object— the secure key name (the secure-key-id), description, created timestamp, and the map of properties—in the response body.

Example response:

{
  "name": "secure-key-id",
  "description": "Example Secure Key",
  "createdEpochMs": 1471718010326,
  "properties": {
    "property-key": "property-value"
  }
}

For more information, see Retrieve a Secure Key, Retrieve the Metadata for a Secure Key, and the Administration Manual: Secure Storage on the CDAP documentation site.

List all secure keys

To list all the keys in a namespace from secure storage, submit the following HTTP GET request.

GET -H "Authorization: Bearer ${AUTH_TOKEN}" "${CDAP_ENDPOINT}/v3/namespaces/namespace-id/securekeys"
Parameter Description
namespace-id If your pipeline belongs to a Basic edition instance, the namespace ID is always default. If your pipeline belongs to an Enterprise edition instance, you can create a namespace.

For more information, see List all Secure Keys and the Administration Manual: Secure Storage on the CDAP documentation site.

Delete a secure key

To delete a secure key from secure storage, submit the following HTTP DELETE request.

DELETE -H "Authorization: Bearer ${AUTH_TOKEN}" "${CDAP_ENDPOINT}/v3/namespaces/namespace-id/securekeys/secure-key-id"
Parameter Description
namespace-id If your pipeline belongs to a Basic edition instance, the namespace ID is always default. If your pipeline belongs to an Enterprise edition instance, you can create a namespace.
secure-key-id Name of the key to delete from secure storage.

For more information, see Remove a Secure Key and the Administration Manual: Secure Storage on the CDAP documentation site.