Data Analytics

Automating Cloud Data Fusion deployments via Terraform

June 12, 2020

Svetak Sundhar

Cloud Technical Resident

Cloud Data Fusion is a fully managed, cloud-native data integration service that helps users efficiently build and manage ETL data pipelines. It’s powered by the open source project CDAP.

How Cloud Data Fusion works

Many enterprises have data integration pipelines that take data from multiple sources and transform that data into a format useful for analytics. Cloud Data Fusion is a Google Cloud service that lets you build out these pipelines with little to no coding. One way of configuring these pipelines is via the UI. While using the UI is a visually intuitive way to architect pipelines, many organizations have requirements to automate deployments in production environments. A manageable way is to do this via infrastructure as code. Terraform, an infrastructure as code tool managed by Hashicorp, is an industry-standard way of spinning up infrastructure.

While CDAP makes authoring pipelines easy, automating deployments with the CDAP REST API requires some additional work. In this blog, we’ll explain how to automate deployments of various CDAP resources in infrastructure as code with Terraform, leveraging useful abstractions on the CDAP REST API built into the community-maintained Terraform CDAP provider. This post highlights further abstractions open-sourced in the Cloud Foundations toolkit modules.

Creating a Cloud Data Fusion instance

You can create a Cloud Data Fusion instance with the datafusion module. This shows the module’s basic usage:

The name of the instance, project ID, region, and subnetwork to create or reuse are all required inputs to the module. The instance type defaults to enterprise unless otherwise specified. The dataproc_subnet, labels, and options are also optional inputs.

Deploying prerequisites for a private IP CDF instance

Many use cases need to have a connection to Cloud Data Fusion established over a private VPC network, as traffic over the network does not go through the public internet. In order to create a private IP Cloud Data Fusion instance, you’ll need to deploy specific infrastructure. Specifically, you’ll need a VPC network, a custom subnet to deploy Dataproc clusters in, and IP allocation for peering with the Data Fusion tenant project. VPC can be deployed via the use of the private network module. Additionally, if you’re using Cloud Data Fusion version 6.1.2 or older, the module can create the SSH ingress rule to allow the Data Fusion instance to reach Dataproc clusters on port 22. Here’s a snippet showing the basic usage of the module:

The module requires several inputs: the Data Fusion service account, private Data Fusion instance ID, VPC network to be created with firewall rules for a private Data Fusion instance, the gcp project ID for private Data Fusion setup, and the private Data Fusion instance ID. Output from this module are the IP CIDR range reserved for the private Data Fusion instance, the VPC created for the private Data Fusion instance, the subnetwork created for Dataproc clusters controlled by the private Data Fusion instance, and the VPC created for the private Data Fusion instance.

Configuring a namespace

Namespaces are used to logically partition Cloud Data Fusion instances. They exist in order to achieve pipeline, plugin, metadata, and configuration isolation during runtime, as well as to provide multi-tenancy. This could be useful in cases where there are different sub-organizations all in the same instance. Thus, each namespace in an instance is separate from all other namespaces in the instance, with respect to pipelines, preferences, and plugins/artifacts in the instance. This CDAP REST API Document shows the API calls needed to perform different operations on namespaces. Check out a Terraform example of the namespace.

In this module, you have to provide the name of the namespace you’d like to create, as well as the preferences (any runtime arguments that are configured during runtime) to set in this namespace.

Deploying a compute profile

In Cloud Data Fusion, compute profiles represent the execution environment for pipelines. They allow users to specify the required resources to run a pipeline. Currently, Cloud Data Fusion pipelines primarily execute as Apache Spark or MapReduce programs on Dataproc clusters. Compute profiles can manage two types of Dataproc clusters: ephemeral clusters and persistent clusters. Ephemeral clusters are created for the duration of the pipeline, and destroyed when the pipeline ends. On the other hand, persistent clusters are pre-existing, and await requests for data processing jobs. Persistent clusters can accept multiple pipeline runs, while ephemeral clusters are job-scoped. Ephemeral clusters include a startup overhead for each run of the pipeline, since you need to provision a new cluster each time. Ephemeral clusters have the advantage of being a more managed experience, and don’t require you to provide SSH keys for communication with the cluster, since these keys are generated automatically by the CDAP service.

This CDAP REST API document shows the API calls needed to perform different operations on compute profiles. We have written a Terraform module so you can deploy a custom compute profile, allowing you to configure settings such as the network and compute service accounts—settings that are not configurable in the default compute profile.

In this example, the name of the profile and the label of the profile are required inputs.

The module allows for many more optional inputs, such as name of the network, name of the subnetwork, or name of the service account to run the Dataproc cluster on. It also allows you to provide the namespace to deploy the profile in, as well as the account key used for authentication.

Deploying and updating a pipeline

A pipeline can be deployed using the pipeline module. The name, namespace, and the path to an exported pipeline (the json_spec_path) are required as inputs. This can be obtained by clicking on Actions>Export after the pipeline is deployed on the Data Fusion UI.

These CDAP documents explain the nuances of a pipeline.

As mentioned earlier, the namespace is required for achieving application and data isolation. Finally, since the path to the exported JSON pipeline contains a hard-coded reference to the checkpoint directory of the instance on which it was authored, the json_spec_path must be without the checkpoint key. The checkpointDir key must be removed from the config block of exported pipeline JSONs. On a new instance, when this key is missing, it gets inferred to the correct checkpoint bucket. Checkpointing must be used in CDAP real-time apps, since they won’t start the Spark context because they do not respect the disableCheckpoints key of the pipeline config. Find more details on this. A common way to remove this checkpoint key is to use a jq command.

CheckpointDir keys are generated in a JSON file whenever Apache Spark is run. A challenge faced here is that the checkpointDir key must manually be removed from the JSON. The key must be removed, since it will be hard-coded to the checkpointDir from the Cloud Data Fusion instance from which it was exported. This could cause issues if the instances are different environments (i.e., prod and dev). This key must be absent to infer the correct checkpoint bucket in a new instance.

Here’s a snippet of the cdap_application resource:

To update an already deployed pipeline, simply make a pull request on the repository. This will stop this run on Terraform apply. Additionally, Terraform will add plugin resources for new versions of a plugin required by this pipeline. Since applications are immutable, when pipelines are updated, they should be treated as a new pipeline (with a versioned name).

Streaming program run

A program run is when the pipeline is passed in runtime arguments and run, after it is deployed. Streaming pipelines can be managed as infrastructure as code due to the fact that they are long-running infrastructure, as opposed to batch jobs, which are manually scheduled or triggered. These CDAP documents explain the relevant API calls to perform operations, such as starting and stopping programs, as well as starting and checking the status of multiple programs.

Here’s an example of the cdap_streaming_program_run resource:

The name of the app, name of the program, any runtime arguments, and type (mapreduce, spark, workers, etc.) are required. The namespace is optional (if none provided, default is used), and the cdap run_id is computed.

A challenge of the automation is that real-time sources do not support variable configurations at runtime, also known as macros. This means that an additional hard-coded application that achieves the functionality of a macro must be written for every run. This is achieved by rendering the JSON file as a template file (at Terraform apply time) to substitute runtime arguments there.

Try running Cloud Data Fusion via the modules provided. As a reminder, step number one is to create a Cloud Data Fusion instance with the datafusion module. Our team welcomes all feedback, suggestions, and comments. To get in touch, create an issue on the repository itself.

Posted in