This page describes when to use static Dataproc clusters in Cloud Data Fusion, compatible versions, and the recommended cluster configurations.
When to reuse clusters
Starting in Cloud Data Fusion 6.5.0, you can reuse Dataproc clusters between runs to improve processing time. You reuse clusters to decrease the time it takes to start jobs when you have multiple jobs happening one after another. For more information, see Reusing Dataproc clusters.
When to use static clusters
Recommended: Try to reuse clusters to improve your startup time before you try to use static clusters.
By default, Cloud Data Fusion creates ephemeral clusters for each pipeline. It creates a cluster at the beginning of the pipeline run, and then deletes it after the pipeline run completes.
In the following scenarios, do not use the default. Instead, use a static cluster:
When the time it takes to create a new cluster for every pipeline is prohibitive for your use case.
When your organization requires cluster creation to be managed centrally. For example, when you want to enforce certain policies for all Dataproc clusters.
For more information, see Running a pipeline against an existing Dataproc cluster.
When to use autoscaling
Recommended: Use the predefined Cloud Data Fusion autoscaling or your own autoscaling policy to automate cluster resource management for processing in batch pipelines.
Autoscaling is not recommended for scaling down. For information about reducing costs during periods of low activity, see the following documents:
- Dataproc Enhanced Flexibility Mode in the Dataproc documentation
- Autoscale in the CDAP documentation
If you use the default compute profile to execute a large pipeline, the pipeline might not run with optimal performance. It's also useful when you're unsure about the proper cluster size requirements for your pipeline.
In Cloud Data Fusion version 6.6 and later, you can use the predefined autoscaling of Cloud Data Fusion to automate cluster resource management. The out-of-the-box autoscaling compute profile might be sufficient for your pipelines, but if you need more control, you can define your own autoscaling policy.
In any supported version, you can create your own autoscaling policy to set the maximum number of workers. For more information on creating an autoscaling policy, see Autoscaling clusters.
Use predefined autoscaling in Cloud Data Fusion
You can use predefined autoscaling for increasing worker nodes according to the workload of your pipeline in versions 6.6 and later. Predefined autoscaling is only available for batch pipelines.
Option 1: Use the autoscaling profile of Cloud Data Fusion
In Cloud Data Fusion instances running in version 6.6 and later, your
instance comes with a profile called Autoscaling Dataproc
, which is
an out-of-the-box Dataproc profile where predefined autoscaling is
already enabled. Autoscaling Dataproc
is similar to the pre-created profile
in option 3, except the worker machine type is n2
.
- In the Google Cloud console, go to the Cloud Data Fusion Instances page.
- Click View Instance to open an instance in the Cloud Data Fusion UI.
- Go to the deployed pipeline page by clicking List > Deployed.
- Click Configure > Compute.
- Select the profile named
Autoscaling Dataproc
.
Option 2: Enable a predefined autoscaling policy by customizing a profile
To enable predefined autoscaling on the default profile, follow these steps:
- In the Google Cloud console, go to the Cloud Data Fusion Instances page.
- Click View Instance to open an instance in the Cloud Data Fusion UI.
- Go to the deployed pipeline page by clicking List > Deployed.
- Click Configure.
- Click Customize on the profile of your choice.
- Expand the Number of cluster workers section.
- Click the Use predefined Autoscaling toggle.
- Click Done and Save.
When your pipeline runs the next time, the Dataproc job will use the predefined autoscaling policy from Cloud Data Fusion.
When you enable predefined autoscaling:
- The properties
Number of primary workers
,Number of secondary workers
, andAutoscaling policy
are not considered. - The worker machine type/configuration is the same as the chosen profile's.
- Turning off the Use predefined Autoscaling toggle disables predefined autoscaling, and then runs the original behavior of the profile.
Runtime argument:
The described behavior can be achieved by entering the following Key and Value in the runtime arguments:
system.profile.properties.enablePredefinedAutoScaling = true
Option 3: Enable a predefined autoscaling policy in a new compute profile
You can click the Use predefined Autoscaling toggle when you create a new Dataproc provisioner profile. You can then use this profile across different pipelines and have more control over the worker machine type and other properties.
- In the Google Cloud console, go to the Cloud Data Fusion Instances page.
- Click View Instance to open an instance in the Cloud Data Fusion UI.
You can create the profile from either the system scope or the user scope:
- (Optional) For the system scope: Click System Admin > Configuration > System Compute Profiles > Create New Profile.
- (Optional) For the user scope: Click > Namespace Admin > Create profile. Menu
A page of provisioners opens.
Click Dataproc.
Expand the Number of cluster workers section.
Click the Use predefined Autoscaling toggle.
Enter the other details and click Create.
You can apply this profile to your pipeline by opening the pipeline on the Studio page, clicking Configure > Compute, and selecting the profile. You can set the profile as the default.
Additional details
On the Compute config page, where you can see a list of profiles, there is a
Total cores column, which has the maximum vCPUs that the profile can scale
up to, such as Up to 84
.
Version compatibility
Problem: The version of your Cloud Data Fusion environment might not be compatible with the version of your Dataproc cluster.
Recommended: Upgrade to the latest Cloud Data Fusion version and use one of the supported Dataproc versions.
Earlier versions of Cloud Data Fusion are only compatible with unsupported versions of Dataproc. Dataproc does not provide updates and support for clusters created with these versions. Although you can continue running a cluster that was created with an unsupported version, we recommend replacing it with one created with a supported version.
Cloud Data Fusion version | Dataproc version |
---|---|
6.9 and later | 2.1, 2.0, 1.5 * |
6.7 and later | 2.0, 1.5 * |
6.4-6.6 | 2.0 *, 1.3 ** |
6.1-6.3 | 1.3** |
major.minor
image version.
To specify the OS version used in your Dataproc cluster, the OS
version must be compatible with one of the supported Dataproc
versions for your Cloud Data Fusion in the preceding table.
Best practices
Configurations
Recommended: When you create a static cluster for your pipelines, use the following configurations.
Parameters | |
---|---|
yarn.nodemanager.delete.debug-delay-sec |
Retains YARN logs. Recommended value: 86400
(equivalent to one day) |
yarn.nodemanager.pmem-check-enabled |
Enables YARN to check for physical memory limits and kill containers
if they go beyond physical memory. Recommended value: false
|
yarn.nodemanager.vmem-check-enabled |
Enables YARN to check for virtual memory limits and kill containers if
they go beyond physical memory. Recommended value: false .
|