Cluster configuration

This page describes when to use static Dataproc clusters in Cloud Data Fusion. It also describes compatible versions and recommended cluster configurations.

When to reuse clusters

Starting in Cloud Data Fusion 6.5.0, you can reuse Dataproc clusters between runs to improve processing time. You reuse clusters to decrease the time it takes to start jobs when you have multiple jobs happening one after another. For more information, see Reusing Dataproc clusters.

When to use static clusters

Recommended: Try to reuse clusters to improve your startup time before you try to use static clusters.

By default, Cloud Data Fusion creates ephemeral clusters for each pipeline: it creates a cluster at the beginning of the pipeline run, and then deletes it after the pipeline run completes.

In the following scenarios, do not use the default. Instead, use a static cluster:

  • When the time it takes to create a new cluster for every pipeline is prohibitive for your use case.

  • When your organization requires cluster creation to be managed centrally. For example, when you want to enforce certain policies for all Dataproc clusters.

For more information, see Running a pipeline against an existing Dataproc cluster.

When to use autoscaling

Recommended: Use the predefined Cloud Data Fusion autoscaling or your own autoscaling policy to automate cluster resource management for processing in batch pipelines.

Autoscaling is not recommended for scaling down. To reduce costs during periods of low activity, instead, see Dataproc's Enhanced Flexibility Mode and CDAP's autoscaling documentation.

If you use the default compute profile to execute a large pipeline, the pipeline might not run with optimal performance. It's also useful when you are unsure about the proper cluster size requirements for your pipeline.

In Cloud Data Fusion version 6.6 and later, you can use Cloud Data Fusion's predefined autoscaling to automate cluster resource management. The out-of-the-box autoscaling compute profile might be sufficient for your pipelines, but if you need more control, you can instead define your own autoscaling policy.

In any supported version, you can create your own autoscaling policy to set the maximum number of workers. For more information on creating an autoscaling policy, see Autoscaling clusters.

Use predefined autoscaling in Cloud Data Fusion

You can use predefined autoscaling for increasing worker nodes according to the workload of your pipeline in versions 6.6 and later. This is only available for batch pipelines.

Option 1: Use Cloud Data Fusion's autoscaling profile

In Cloud Data Fusion instances running in version 6.6 and later, your instance comes with a profile called Autoscaling Dataproc, which is an out-of-the-box Dataproc profile where predefined autoscaling is already enabled. This is similar to the pre-created profile in option three, except the worker machine type is n2.

  1. In the Google Cloud console, go to the Cloud Data Fusion Instances page.
  2. Click View Instance to open an instance in the Cloud Data Fusion UI.
  3. Go to the deployed pipeline page by clicking List > Deployed.
  4. Click Configure > Compute.
  5. Select the profile named Autoscaling Dataproc.

Option 2: Enable a predefined autoscaling policy by customizing a profile

To enable predefined autoscaling on the default profile, follow these steps:

  1. In the Google Cloud console, go to the Cloud Data Fusion Instances page.
  2. Click View Instance to open an instance in the Cloud Data Fusion UI.
  3. Go to the deployed pipeline page by clicking List > Deployed.
  4. Click Configure.
  5. Click Customize on the profile of your choice.
  6. Expand the Number of cluster workers section.
  7. Click the Use predefined Autoscaling toggle.
  8. Click Done and Save.

When your pipeline next runs, the Dataproc job will use the predefined autoscaling policy from Cloud Data Fusion.

When you enable predefined autoscaling:

  • The properties Number of primary workers, Number of secondary workers, and Autoscaling policy are not considered.
  • The worker machine type/configuration is the same as the chosen profile's.
  • Turning off the Use predefined Autoscaling toggle disables predefined autoscaling, and then runs the original behavior of the profile.

Runtime argument :

The above behavior can be achieved by entering the following Key and Value in the runtime arguments: system.profile.properties.enablePredefinedAutoScaling = true.

Option 3: Enable a predefined autoscaling policy in a new compute profile

You can click the Use predefined Autoscaling toggle when you create a new Dataproc provisioner profile. You can then use this profile across different pipelines and have more control over the worker machine type and other properties.

  1. In the Google Cloud console, go to the Cloud Data Fusion Instances page.
  2. Click View Instance to open an instance in the Cloud Data Fusion UI.
  3. You can create the profile from either the system scope or the user scope:

    1. (Optional) For the system scope: Click System Admin > Configuration > System Compute Profiles > Create New Profile.
    2. (Optional) For the user scope: Click Menu > Namespace Admin > Create profile.

    A page of provisioners opens.

  4. Click Dataproc.

  5. Expand the Number of cluster workers section.

  6. Click the Use predefined Autoscaling toggle.

  7. Enter the other details and click Create.

You can apply this profile to your pipeline by opening the pipeline on the Studio page, clicking Configure > Compute, and selecting the profile. You can set the profile as the default.

Additional details

On the Compute config page, where you can see a list of profiles, there is a Total cores column, which has the maximum vCPUs that the profile can scale up to, such as Up to 84.

Version compatibility

Problem: The version of your Cloud Data Fusion environment might not be compatible with the version of your Dataproc cluster.

Recommended: Upgrade to Cloud Data Fusion version 6.4 or later and use one of the supported Dataproc versions.

Cloud Data Fusion versions before 6.4 are only compatible with unsupported versions of Dataproc. Dataproc does not provide updates and support for clusters created with these versions. Although you can continue running a cluster that was created with an unsupported version, replacing the cluster with a new cluster that is created with a supported version is recommended.

Cloud Data Fusion version Dataproc version
6.1 to 6.3* 1.3.x
6.4+ 1.3.x and 2.0.x

* Cloud Data Fusion versions 6.1 to 6.3 are compatible with Dataproc version 1.3. You don't need additional components to make them compatible. Cloud Data Fusion uses HDFS and Spark, which comes with the base Cloud Data Fusion version.

Best practices

Configurations

Recommended: When you create a static cluster for your pipelines, use the following configurations.

Parameters
yarn.nodemanager.delete.debug-delay-sec Retains YARN logs.
Recommended value: 86400 (equivalent to one day)
yarn.nodemanager.pmem-check-enabled Enables YARN to check for physical memory limits and kill containers if they go beyond physical memory.
Recommended value: false
yarn.nodemanager.vmem-check-enabled enables YARN to check for virtual memory limits and kill containers if they go beyond physical memory.
Recommended value: false.