Cluster configuration

This page describes when to use static Dataproc clusters in Cloud Data Fusion. It also describes compatible versions and recommended cluster configurations.

When to use static clusters

By default, Cloud Data Fusion creates ephemeral clusters for each pipeline: it creates a cluster at the beginning of the pipeline run, and then deletes it after the pipeline run completes.

In the following scenarios, do not use the default. Instead, use a static cluster:

  • When the time it takes to create a new cluster for every pipeline is prohibitive for your use case.

  • When your organization requires cluster creation to be managed centrally. For example, when you want to enforce certain policies for all Dataproc clusters.

For more information, see Running a pipeline against an existing Dataproc cluster.

Version compatibility

Problem: The version of your Cloud Data Fusion environment might not be compatible with the version of your Dataproc cluster.

Recommended: Upgrade to Cloud Data Fusion version 6.4 or later and use one of the supported Dataproc versions.

Cloud Data Fusion versions before 6.4 are only be compatible with unsupported versions of Dataproc. Dataproc does not provide updates and support for clusters created with these versions. Although you can continue running a cluster that was created with an unsupported version, replacing the cluster with a new cluster that is created with a supported version is recommended.

Cloud Data Fusion version Dataproc version
6.1 to 6.3* 1.3.x
6.4+ 1.3.x and 2.0.x

* Cloud Data Fusion versions 6.1 to 6.3 are compatible with Dataproc version 1.3. You don't need additional components to make them compatible. Cloud Data Fusion uses HDFS and Spark, which comes with the base Cloud Data Fusion version.

Best practices

Recommended: When you create a static cluster for your pipelines, use the following configurations.

yarn.nodemanager.delete.debug-delay-sec Retains YARN logs.
Recommended value: 86400 (equivalent to one day)
yarn.nodemanager.pmem-check-enabled Enables YARN to check for physical memory limits and kill containers if they go beyond physical memory.
Recommended value: false
yarn.nodemanager.vmem-check-enabled enables YARN to check for virtual memory limits and kill containers if they go beyond physical memory.
Recommended value: false.