Cluster configuration

This page describes when to use static Dataproc clusters in Cloud Data Fusion, compatible versions, and the recommended cluster configurations.

When to reuse clusters

Starting in Cloud Data Fusion 6.5.0, you can reuse Dataproc clusters between runs to improve processing time. You reuse clusters to decrease the time it takes to start jobs when you have multiple jobs happening one after another. For more information, see Reusing Dataproc clusters.

When to use static clusters

Recommended: Try to reuse clusters to improve your startup time before you try to use static clusters.

By default, Cloud Data Fusion creates ephemeral clusters for each pipeline. It creates a cluster at the beginning of the pipeline run, and then deletes it after the pipeline run completes.

In the following scenarios, do not use the default. Instead, use a static cluster:

  • When the time it takes to create a new cluster for every pipeline is prohibitive for your use case.

  • When your organization requires cluster creation to be managed centrally. For example, when you want to enforce certain policies for all Dataproc clusters.

To use a static cluster, you must set the following property on the Dataproc cluster:

dataproc:dataproc.conscrypt.provider.enable=false

For more information, see Running a pipeline against an existing Dataproc cluster.

When to use autoscaling

Recommended: Use the predefined Cloud Data Fusion autoscaling or your own autoscaling policy to automate cluster resource management for processing in batch pipelines.

Autoscaling is not recommended for scaling down. For information about reducing costs during periods of low activity, see the following documents:

If you use the default compute profile to execute a large pipeline, the pipeline might not run with optimal performance. It's also useful when you're unsure about the proper cluster size requirements for your pipeline.

In Cloud Data Fusion version 6.6 and later, you can use the predefined autoscaling of Cloud Data Fusion to automate cluster resource management. The out-of-the-box autoscaling compute profile might be sufficient for your pipelines, but if you need more control, you can define your own autoscaling policy.

In any supported version, you can create your own autoscaling policy to set the maximum number of workers. Starting in Cloud Data Fusion version 6.10.0, instances use the autoscaling Dataproc profile by default for ephemeral clusters. For more information on creating an autoscaling policy, see Autoscaling clusters.

Use predefined autoscaling in Cloud Data Fusion

You can use predefined autoscaling for increasing worker nodes according to the workload of your pipeline in versions 6.6 and later. Predefined autoscaling is only available for batch pipelines.

Option 1: Use the autoscaling profile of Cloud Data Fusion

In Cloud Data Fusion instances running in version 6.6.0 and later, your instance comes with a profile called Autoscaling Dataproc, which is an out-of-the-box Dataproc profile where predefined autoscaling is already enabled. It's similar to the pre-created profile in option 3, except, in 6.6.0 and instances upgraded from 6.6.0, the worker machine type is n2.

  1. In the Google Cloud console, go to the Cloud Data Fusion Instances page.
  2. Click View Instance to open an instance in the Cloud Data Fusion web interface.
  3. Go to the deployed pipeline page by clicking List > Deployed.
  4. Click Configure > Compute.
  5. Select the profile named Autoscaling Dataproc.

Option 2: Enable a predefined autoscaling policy by customizing a profile

To enable predefined autoscaling on the default profile, follow these steps:

  1. In the Google Cloud console, go to the Cloud Data Fusion Instances page.
  2. Click View Instance to open an instance in the Cloud Data Fusion web interface.
  3. Go to the deployed pipeline page by clicking List > Deployed.
  4. Click Configure.
  5. Click Customize on the profile of your choice.
  6. Expand the Number of cluster workers section.
  7. Click the Use predefined Autoscaling toggle.
  8. Click Done and Save.

When your pipeline runs the next time, the Dataproc job will use the predefined autoscaling policy from Cloud Data Fusion.

When you enable predefined autoscaling:

  • The properties Number of primary workers, Number of secondary workers, and Autoscaling policy are not considered.
  • The worker machine type configuration is the same as the chosen profile's.
  • Turning off the Use predefined Autoscaling toggle disables predefined autoscaling, and then runs the original behavior of the profile.

Runtime argument:

The described behavior can be achieved by entering the following Key and Value in the runtime arguments:

system.profile.properties.enablePredefinedAutoScaling = true

Option 3: Enable a predefined autoscaling policy in a new compute profile

You can click the Use predefined Autoscaling toggle when you create a new Dataproc provisioner profile. You can then use this profile across different pipelines and have more control over the worker machine type and other properties.

  1. In the Google Cloud console, go to the Cloud Data Fusion Instances page.
  2. Click View Instance to open an instance in the Cloud Data Fusion UI.
  3. You can create the profile from either the system scope or the user scope:

    1. (Optional) For the system scope: Click System Admin > Configuration > System Compute Profiles > Create New Profile.
    2. (Optional) For the user scope: Click Menu > Namespace Admin > Create profile.

    A page of provisioners opens.

  4. Click Dataproc.

  5. Expand the Number of cluster workers section.

  6. Click the Use predefined Autoscaling toggle.

  7. Enter the other details and click Create.

You can apply this profile to your pipeline by opening the pipeline on the Studio page, clicking Configure > Compute, and selecting the profile. You can set the profile as the default.

Additional details

On the Compute config page, where you can see a list of profiles, there is a Total cores column, which has the maximum vCPUs that the profile can scale up to, such as Up to 84.

Version compatibility

Problem: The version of your Cloud Data Fusion environment might not be compatible with the version of your Dataproc cluster.

Recommended: Upgrade to the latest Cloud Data Fusion version and use one of the supported Dataproc versions.

Earlier versions of Cloud Data Fusion are only compatible with unsupported versions of Dataproc. Dataproc does not provide updates and support for clusters created with these versions. Although you can continue running a cluster that was created with an unsupported version, we recommend replacing it with one created with a supported version.

Cloud Data Fusion version Dataproc version
6.10 and later 2.1, 2.0 *
6.9 2.1, 2.0, 1.5 *
6.7-6.8 2.0, 1.5 *
6.4-6.6 2.0 *, 1.3 **
6.1-6.3 1.3**

* Cloud Data Fusion versions 6.4 and later are compatible with supported versions of Dataproc. Unless specific OS features are needed, the recommended practice is to specify the major.minor image version.
To specify the OS version used in your Dataproc cluster, the OS version must be compatible with one of the supported Dataproc versions for your Cloud Data Fusion in the preceding table.

** Cloud Data Fusion versions 6.1 to 6.6 are compatible with unsupported Dataproc version 1.3.

Best practices

Recommended: When you create a static cluster for your pipelines, use the following configurations.

Parameters
yarn.nodemanager.delete.debug-delay-sec Retains YARN logs.
Recommended value: 86400 (equivalent to one day)
yarn.nodemanager.pmem-check-enabled Enables YARN to check for physical memory limits and kill containers if they go beyond physical memory.
Recommended value: false
yarn.nodemanager.vmem-check-enabled Enables YARN to check for virtual memory limits and kill containers if they go beyond physical memory.
Recommended value: false.