This page describes when to use static Dataproc clusters in Cloud Data Fusion. It also describes compatible versions and recommended cluster configurations.
When to use static clusters
By default, Cloud Data Fusion creates ephemeral clusters for each pipeline: it creates a cluster at the beginning of the pipeline run, and then deletes it after the pipeline run completes.
In the following scenarios, do not use the default. Instead, use a static cluster:
When the time it takes to create a new cluster for every pipeline is prohibitive for your use case.
When your organization requires cluster creation to be managed centrally. For example, when you want to enforce certain policies for all Dataproc clusters.
For more information, see Running a pipeline against an existing Dataproc cluster.
Problem: The version of your Cloud Data Fusion environment might not be compatible with the version of your Dataproc cluster.
Recommended: Upgrade to Cloud Data Fusion version 6.4 or later and use one of the supported Dataproc versions.
Cloud Data Fusion versions before 6.4 are only be compatible with unsupported versions of Dataproc. Dataproc does not provide updates and support for clusters created with these versions. Although you can continue running a cluster that was created with an unsupported version, replacing the cluster with a new cluster that is created with a supported version is recommended.
|Cloud Data Fusion version||Dataproc version|
|6.1 to 6.3*||1.3.x|
|6.4+||1.3.x and 2.0.x|
Recommended: When you create a static cluster for your pipelines, use the following configurations.
||Retains YARN logs.
||Enables YARN to check for physical memory limits and kill containers
if they go beyond physical memory.
||enables YARN to check for virtual memory limits and kill containers if
they go beyond physical memory.