Configuring clusters

This page describes when to use static Dataproc clusters in Cloud Data Fusion. It also describes compatible versions and recommended cluster configurations.

For more information, see Manage a cluster.

When to use static clusters

By default, Cloud Data Fusion creates ephemeral clusters for each pipeline: it creates a cluster at the beginning of the pipeline run, and then deletes it after the pipeline run completes.

In the following scenarios, do not use the default. Instead, use a static cluster:

  • When the time it takes to create a new cluster for every pipeline is prohibitive for your use case.

  • When your organization requires cluster creation to be managed centrally. For example, when you want to enforce certain policies for all Dataproc clusters.

For more information, see Running a pipeline against an existing Dataproc cluster.

Version compatibility

Problem: The version of your Cloud Data Fusion environment might not be compatible with the version of your Dataproc cluster.

The following Cloud Data Fusion versions support the corresponding Dataproc versions.

Cloud Data Fusion version Dataproc version
6.1 to 6.3* 1.3.x
6.4+ 1.3.x and 2.0.x

* Cloud Data Fusion versions 6.1 to 6.3 are compatible with Dataproc version 1.3. You don't need additional components to make them compatible. Cloud Data Fusion uses HDFS and Spark, which comes with the base Cloud Data Fusion version.

Best practices

Recommended: When you create a static cluster for your pipelines, use the following configurations.

yarn.nodemanager.delete.debug-delay-sec Retains YARN logs.
Recommended value: 86400 (equivalent to one day)
yarn.nodemanager.pmem-check-enabled Enables YARN to check for physical memory limits and kill containers if they go beyond physical memory.
Recommended value: false
yarn.nodemanager.vmem-check-enabled enables YARN to check for virtual memory limits and kill containers if they go beyond physical memory.
Recommended value: false.

What's next