Dataproc provisioner properties

The Dataproc provisioner in Cloud Data Fusion calls the Dataproc API to create and delete clusters in your Google Cloud projects. You can configure the clusters in the provisioner's settings.

For more information about compatibility between Cloud Data Fusion versions and Dataproc versions, see Version compatibility.

Properties

Property Description
Project ID The Google Cloud project where the Dataproc cluster gets created. The project must have the Dataproc API enabled.
Creator service account key

The service account key provided to the provisioner must have permission to access the Dataproc and Compute Engine APIs. Because your account key is sensitive, we recommend that you provide the account key using Secure Storage.

After you create the secure key, you can add it to a namespace or a system compute profile. For a namespace compute profile, click the shield and select the secure key. For a system compute profile, enter the name of the key in the Secure Account Key field.

Region A geographical location where you can host your resources, such as the compute nodes for the Dataproc cluster.
Zone An isolated deployment area within a region.
Network The VPC network in your Google Cloud project that will be used when creating a Dataproc cluster.
Network host project ID If the network resides in another Google Cloud project, enter the ID of that project. For a Shared VPC, enter the host project ID where the network resides.
Subnet The subnet to use when creating clusters. It must be within the given network and in the region that the zone is in. If left blank, a subnet is selected based on the network and zone.
Runner service account The service account name of the Dataproc virtual machines (VM) that are used for running programs. If left blank, the default Compute Engine service account is used.
Number of masters

The number of master nodes in the cluster. These nodes contain the YARN Resource Manager, HDFS NameNode, and all drivers. Must be set to 1 or 3.

Default is 1.

Master machine type

The type of master machine to use. Select one of the following machine types:

  • n1
  • n2
  • n2d
  • e2

In Cloud Data Fusion version 6.7.2 and later, the default is e2.

In version 6.7.1, the default is n2.

In version 6.7.0 and earlier, the default is n1.

Master cores

Number of virtual cores allocated to a master node.

Default is 2.

Master memory (GB)

The amount of memory, in gigabytes, allocated to a master node.

Default is 8 GB.

Master disk size (GB)

Disk size, in gigabytes, allocated to a master node.

Default is 1000 GB.

Master disk type

Type of boot disk for a master node:

  • Standard Persistent Disk
  • SSD Persistent Disk

Default is Standard Persistent Disk.

Worker machine type

The type of worker machine to use. Select one of the following machine types:

  • n1
  • n2
  • n2d
  • e2

In Cloud Data Fusion version 6.7.2 and later, the default is e2.

In version 6.7.1, the default is n2.

In version 6.7.0 and earlier, the default is n1.

Worker cores

Number of virtual cores allocated to a worker node.

Default is 2.

Worker memory (GB)

The amount of memory, in gigabytes, allocated to a worker node.

Default is 8 GB.

Worker disk size (GB)

Disk size, in gigabytes, allocated to a worker node.

Default is 1000 GB.

Worker disk type

Type of boot disk for a worker node:

  • Standard Persistent Disk
  • SSD Persistent Disk

Default is Standard Persistent Disk.

Use predefined Autoscaling Enables using predefined Dataproc autoscaling.
Number of primary workers

Worker nodes contain a YARN NodeManager and an HDFS DataNode.

Default is 2.

Number of secondary workers Secondary worker nodes contain a YARN NodeManager, but not an HDFS DataNode. This is normally set to zero, unless an autoscaling policy requires it to be higher.
Autoscaling policy

Path for the autoscaling policy ID or the resource URI.

For information about configuring and using Dataproc autoscaling to automatically and dynamically resize clusters to meet workload demands, see When to use autoscaling and Autoscale Dataproc clusters.

Metadata Additional metadata for instances running in your cluster. You can typically use it for tracking billing and chargebacks. For more information, see Cluster metadata.
Network tags Assign Network tags to apply firewall rules to the specific nodes of a cluster. Network tags must start with a lowercase letter and can contain lowercase letters, numbers, and hyphens. Tags must end with a lowercase letter or number.
Enable Secure Boot

Enables Secure Boot on the Dataproc VMs.

Default is False.

Enable vTPM

Enables virtual Trusted Platform Module (vTPM) on the Dataproc VMs.

Default is False.

Enable Integrity Monitoring

Enables virtual Integrity Monitoring on the Dataproc VMs.

Default is False.

Image version The Dataproc image version. If left blank, one is automatically selected. If the Custom image URIproperty is left blank, this property is ignored.
Custom image URI The Dataproc image URI. If left blank, it's inferred from the Image version property.
Staging bucket Cloud Storage bucket used to stage job dependencies and config files for running pipelines in Dataproc.
Temp bucket

Cloud Storage bucket used to store ephemeral cluster and jobs data, such as Spark history files in Dataproc.

This property was introduced in Cloud Data Fusion version 6.9.2.

Encryption key name The customer managed encryption key (CMEK) that's used by Dataproc.
OAuth scopes

The OAuth 2.0 scopes that you might need to request to access Google APIs, depending on the level of access you need. Google Cloud Platform Scope is always included.

This property was introduced in Cloud Data Fusion version 6.9.2.

Initialization actions A list of scripts to be executed during initialization of the cluster. Initialization actions should be placed on Cloud Storage.
Cluster properties Cluster properties overriding the default configuration properties of the Hadoop services. For more information on applicable key-value pairs, see Cluster properties.
Common labels

Labels to organize the Dataproc clusters and jobs being created.

You can label each resource and then filter the resources by labels. Information about labels is forwarded to the billing system, so customers can break down your billing charges by label.

Max idle time

Configure Dataproc to delete a cluster if it's idle longer than the specified number of minutes. Clusters are normally deleted directly after a run ends, but deletion can fail in rare situations. For more information, see Troubleshoot deleting clusters.

Default is 30 minutes.

Skip cluster delete

Whether to skip cluster deletion at the end of a run. You must manually delete clusters. This should only be used when debugging a failed run.

Default is False.

Enable Stackdriver Logging Integration

Enable the Stackdriver logging integration.

Default is True.

Enable Stackdriver Monitoring Integration

Enable the Stackdriver monitoring integration.

Default is True.

Enable Component Gateway

Enable the component gateway to access to the cluster's interfaces, such as the YARN ResourceManager and Spark HistoryServer.

Default is False.

Prefer external IP

When the system is running on Google Cloud in the same network as the cluster, it normally uses the internal IP address when communicating with the cluster. To always use the external IP address, set this value to True.

Default is False.

Create poll delay

The number of seconds to wait after creating a cluster to begin polling to see if the cluster has been created.

Default is 60 seconds.

Polling settings control how often cluster status is polled when creating and deleting clusters. If you have many pipelines scheduled to run at the same time, you may want to change these settings.

Create poll jitter

Maximum amount of random jitter, in seconds, to add to the delay when creating a cluster. You can use this property to prevent many simultaneous API calls in Google Cloud when you have a lot of pipelines that are scheduled to run at the exact same time.

Default is 20 seconds.

Delete poll delay

The number of seconds to wait after deleting a cluster to begin polling to see if the cluster has been deleted.

Default is 30 seconds.

Poll interval

The number of seconds to wait between polls for cluster status.

Default is 2.

Dataproc profile web interface properties mapped to JSON properties

Dataproc profile UI property name Dataproc profile JSON property name
Profile label name
Profile name label
Description description
Project ID projectId
Creator service account key accountKey
Region region
Zone zone
Network network
Network host project ID networkHostProjectId
Subnet subnet
Runner service account serviceAccount
Number of masters masterNumNodes
Master machine type masterMachineType
Master cores masterCPUs
Master memory (GB) masterMemoryMB
Master disk size (GB) masterDiskGB
Master disk type masterDiskType
Number of primary workers workerNumNodes
Number of secondary workers secondaryWorkerNumNodes
Worker machine type workerMachineType
Worker cores workerCPUs
Worker memory (GB) workerMemoryMB
Worker disk size (GB) workerDiskGB
Worker disk type workerDiskType
Metadata clusterMetaData
Network tags networkTags
Enable Secure Boot secureBootEnabled
Enable vTPM vTpmEnabled
Enable Integrity Monitoring integrityMonitoringEnabled
Image version imageVersion
Custom image URI customImageUri
Cloud Storage bucket gcsBucket
Encryption key name encryptionKeyName
Autoscaling policy autoScalingPolicy
Initialization actions initActions
Cluster properties clusterProperties
Labels clusterLabels
Max idle time idleTTL
Skip cluster delete skipDelete
Enable Stackdriver Logging Integration stackdriverLoggingEnabled
Enable Stackdriver Monitoring Integration stackdriverMonitoringEnabled
Enable Component Gateway componentGatewayEnabled
Prefer external IP preferExternalIP
Create poll delay pollCreateDelay
Create poll jitter pollCreateJitter
Delete poll delay pollDeleteDelay
Poll interval pollInterval

Best Practices

When you create a static cluster for your pipelines, refer to the cluster configuration best practices.

What's next