ClusterConfig

The cluster config.

JSON representation
{
  "configBucket": string,
  "gceClusterConfig": {
    object(GceClusterConfig)
  },
  "masterConfig": {
    object(InstanceGroupConfig)
  },
  "workerConfig": {
    object(InstanceGroupConfig)
  },
  "secondaryWorkerConfig": {
    object(InstanceGroupConfig)
  },
  "softwareConfig": {
    object(SoftwareConfig)
  },
  "lifecycleConfig": {
    object(LifecycleConfig)
  },
  "initializationActions": [
    {
      object(NodeInitializationAction)
    }
  ],
  "encryptionConfig": {
    object(EncryptionConfig)
  }
}
Fields
configBucket

string

Optional. A Cloud Storage staging bucket used for sharing generated SSH keys and config. If you do not specify a staging bucket, Cloud Dataproc will determine an appropriate Cloud Storage location (US, ASIA, or EU) for your cluster's staging bucket according to the Google Compute Engine zone where your cluster is deployed, and then it will create and manage this project-level, per-location bucket for you.

gceClusterConfig

object(GceClusterConfig)

Required. The shared Compute Engine config settings for all instances in a cluster.

masterConfig

object(InstanceGroupConfig)

Optional. The Compute Engine config settings for the master instance in a cluster.

workerConfig

object(InstanceGroupConfig)

Optional. The Compute Engine config settings for worker instances in a cluster.

secondaryWorkerConfig

object(InstanceGroupConfig)

Optional. The Compute Engine config settings for additional worker instances in a cluster.

softwareConfig

object(SoftwareConfig)

Optional. The config settings for software inside the cluster.

lifecycleConfig

object(LifecycleConfig)

Optional. The config setting for auto delete cluster schedule.

initializationActions[]

object(NodeInitializationAction)

Optional. Commands to execute on each node after config is completed. By default, executables are run on master and all worker nodes. You can test a node's

role

metadata to run an executable on a master or worker node, as shown below using curl (you can also use wget):

ROLE=$(curl -H Metadata-Flavor:Google http://metadata/computeMetadata/v1beta2/instance/attributes/dataproc-role)
if [[ "${ROLE}" == 'Master' ]]; then
  ... master specific actions ...
else
  ... worker specific actions ...
fi

encryptionConfig

object(EncryptionConfig)

Optional. Encryption settings for the cluster.

GceClusterConfig

Common config settings for resources of Compute Engine cluster instances, applicable to all instances in the cluster.

JSON representation
{
  "zoneUri": string,
  "networkUri": string,
  "subnetworkUri": string,
  "internalIpOnly": boolean,
  "serviceAccount": string,
  "serviceAccountScopes": [
    string
  ],
  "tags": [
    string
  ],
  "metadata": {
    string: string,
    ...
  }
}
Fields
zoneUri

string

Optional. The zone where the Compute Engine cluster will be located. On a create request, it is required in the "global" region. If omitted in a non-global Cloud Dataproc region, the service will pick a zone in the corresponding Compute Engine region. On a get request, zone will always be present.

A full URL, partial URI, or short name are valid. Examples:

  • https://www.googleapis.com/compute/v1/projects/[projectId]/zones/[zone]
  • projects/[projectId]/zones/[zone]
  • us-central1-f

networkUri

string

Optional. The Compute Engine network to be used for machine communications. Cannot be specified with subnetworkUri. If neither networkUri nor subnetworkUri is specified, the "default" network of the project is used, if it exists. Cannot be a "Custom Subnet Network" (see Using Subnetworks for more information).

A full URL, partial URI, or short name are valid. Examples:

  • https://www.googleapis.com/compute/v1/projects/[projectId]/regions/global/default
  • projects/[projectId]/regions/global/default
  • default

subnetworkUri

string

Optional. The Compute Engine subnetwork to be used for machine communications. Cannot be specified with networkUri.

A full URL, partial URI, or short name are valid. Examples:

  • https://www.googleapis.com/compute/v1/projects/[projectId]/regions/us-east1/sub0
  • projects/[projectId]/regions/us-east1/sub0
  • sub0

internalIpOnly

boolean

Optional. If true, all instances in the cluster will only have internal IP addresses. By default, clusters are not restricted to internal IP addresses, and will have ephemeral external IP addresses assigned to each instance. This internalIpOnly restriction can only be enabled for subnetwork enabled networks, and all off-cluster dependencies must be configured to be accessible without external IP addresses.

serviceAccount

string

Optional. The service account of the instances. Defaults to the default Compute Engine service account. Custom service accounts need permissions equivalent to the following IAM roles:

  • roles/logging.logWriter
  • roles/storage.objectAdmin

(see https://cloud.google.com/compute/docs/access/service-accounts#custom_service_accounts for more information). Example: [account_id]@[projectId].iam.gserviceaccount.com

serviceAccountScopes[]

string

Optional. The URIs of service account scopes to be included in Compute Engine instances. The following base set of scopes is always included:

If no scopes are specified, the following defaults are also provided:

tags[]

string

The Compute Engine tags to add to all instances (see Tagging instances).

metadata

map (key: string, value: string)

The Compute Engine metadata entries to add to all instances (see Project and instance metadata).

An object containing a list of "key": value pairs. Example: { "name": "wrench", "mass": "1.3kg", "count": "3" }.

InstanceGroupConfig

Optional. The config settings for Compute Engine resources in an instance group, such as a master or worker group.

JSON representation
{
  "numInstances": number,
  "instanceNames": [
    string
  ],
  "imageUri": string,
  "machineTypeUri": string,
  "diskConfig": {
    object(DiskConfig)
  },
  "isPreemptible": boolean,
  "managedGroupConfig": {
    object(ManagedGroupConfig)
  },
  "accelerators": [
    {
      object(AcceleratorConfig)
    }
  ],
  "minCpuPlatform": string
}
Fields
numInstances

number

Optional. The number of VM instances in the instance group. For master instance groups, must be set to 1.

instanceNames[]

string

Output only. The list of instance names. Cloud Dataproc derives the names from clusterName, numInstances, and the instance group.

imageUri

string

Optional. The Compute Engine image resource used for cluster instances. It can be specified or may be inferred from SoftwareConfig.image_version.

machineTypeUri

string

Optional. The Compute Engine machine type used for cluster instances.

A full URL, partial URI, or short name are valid. Examples:

  • https://www.googleapis.com/compute/v1/projects/[projectId]/zones/us-east1-a/machineTypes/n1-standard-2
  • projects/[projectId]/zones/us-east1-a/machineTypes/n1-standard-2
  • n1-standard-2

Auto Zone Exception: If you are using the Cloud Dataproc Auto Zone Placement feature, you must use the short name of the machine type resource, for example, n1-standard-2.

diskConfig

object(DiskConfig)

Optional. Disk option config settings.

isPreemptible

boolean

Optional. Specifies that this instance group contains preemptible instances.

managedGroupConfig

object(ManagedGroupConfig)

Output only. The config for Compute Engine Instance Group Manager that manages this group. This is only used for preemptible instance groups.

accelerators[]

object(AcceleratorConfig)

Optional. The Compute Engine accelerator configuration for these instances.

Beta Feature: This feature is still under development. It may be changed before final release.

minCpuPlatform

string

Optional. Specifies the minimum cpu platform for the Instance Group. See Cloud Dataproc→Minimum CPU Platform.

DiskConfig

Specifies the config of disk options for a group of VM instances.

JSON representation
{
  "bootDiskType": string,
  "bootDiskSizeGb": number,
  "numLocalSsds": number
}
Fields
bootDiskType

string

Optional. Type of the boot disk (default is "pd-standard"). Valid values: "pd-ssd" (Persistent Disk Solid State Drive) or "pd-standard" (Persistent Disk Hard Disk Drive).

bootDiskSizeGb

number

Optional. Size in GB of the boot disk (default is 500GB).

numLocalSsds

number

Optional. Number of attached SSDs, from 0 to 4 (default is 0). If SSDs are not attached, the boot disk is used to store runtime logs and HDFS data. If one or more SSDs are attached, this runtime bulk data is spread across them, and the boot disk contains only basic config and installed binaries.

ManagedGroupConfig

Specifies the resources used to actively manage an instance group.

JSON representation
{
  "instanceTemplateName": string,
  "instanceGroupManagerName": string
}
Fields
instanceTemplateName

string

Output only. The name of the Instance Template used for the Managed Instance Group.

instanceGroupManagerName

string

Output only. The name of the Instance Group Manager for this group.

AcceleratorConfig

Specifies the type and number of accelerator cards attached to the instances of an instance group (see GPUs on Compute Engine).

JSON representation
{
  "acceleratorTypeUri": string,
  "acceleratorCount": number
}
Fields
acceleratorTypeUri

string

Full URL, partial URI, or short name of the accelerator type resource to expose to this instance. See [Compute Engine AcceleratorTypes]( /compute/docs/reference/beta/acceleratorTypes)

Examples * https://www.googleapis.com/compute/beta/projects/[projectId]/zones/us-east1-a/acceleratorTypes/nvidia-tesla-k80 * projects/[projectId]/zones/us-east1-a/acceleratorTypes/nvidia-tesla-k80 * nvidia-tesla-k80

Auto Zone Exception: If you are using the Cloud Dataproc Auto Zone Placement feature, you must use the short name of the accelerator type resource, for example, nvidia-tesla-k80.

acceleratorCount

number

The number of the accelerator cards of this type exposed to this instance.

SoftwareConfig

Specifies the selection and config of software inside the cluster.

JSON representation
{
  "imageVersion": string,
  "properties": {
    string: string,
    ...
  },
  "optionalComponents": [
    enum(Component)
  ]
}
Fields
imageVersion

string

Optional. The version of software inside the cluster. It must be one of the supported Cloud Dataproc Versions, such as "1.2" (including a subminor version, such as "1.2.29"), or the "preview" version. If unspecified, it defaults to the latest version.

properties

map (key: string, value: string)

Optional. The properties to set on daemon config files.

Property keys are specified in prefix:property format, such as core:fs.defaultFS. The following are supported prefixes and their mappings:

  • capacity-scheduler: capacity-scheduler.xml
  • core: core-site.xml
  • distcp: distcp-default.xml
  • hdfs: hdfs-site.xml
  • hive: hive-site.xml
  • mapred: mapred-site.xml
  • pig: pig.properties
  • spark: spark-defaults.conf
  • yarn: yarn-site.xml

For more information, see Cluster properties.

An object containing a list of "key": value pairs. Example: { "name": "wrench", "mass": "1.3kg", "count": "3" }.

optionalComponents[]

enum(Component)

The set of optional components to activate on the cluster.

Component

Cluster components that can be activated.

Enums
COMPONENT_UNSPECIFIED Unspecified component.
JUPYTER The Jupyter Notebook.
HIVE_WEBHCAT The Hive Web HCatalog (the REST service for accessing HCatalog).
ZEPPELIN The Zeppelin notebook.
ANACONDA The Anaconda python distribution.
PRESTO The Presto query engine.
KERBEROS The Kerberos security feature.

LifecycleConfig

Specifies the cluster auto-delete schedule configuration.

JSON representation
{
  "idleDeleteTtl": string,

  // Union field ttl can be only one of the following:
  "autoDeleteTime": string,
  "autoDeleteTtl": string
  // End of list of possible types for union field ttl.
}
Fields
idleDeleteTtl

string (Duration format)

Optional. The duration to keep the cluster alive while idling. Passing this threshold will cause the cluster to be deleted. Valid range: [10m, 14d].

Example: "10m", the minimum value, to delete the cluster when it has had no jobs running for 10 minutes.

Union field ttl. Optional. Either the exact time the cluster should be deleted at or the cluster maximum age. ttl can be only one of the following:
autoDeleteTime

string (Timestamp format)

Optional. The time when cluster will be auto-deleted.

A timestamp in RFC3339 UTC "Zulu" format, accurate to nanoseconds. Example: "2014-10-02T15:01:23.045123456Z".

autoDeleteTtl

string (Duration format)

Optional. The lifetime duration of cluster. The cluster will be auto-deleted at the end of this period. Valid range: [10m, 14d].

Example: "1d", to delete the cluster 1 day after its creation..

NodeInitializationAction

Specifies an executable to run on a fully configured node and a timeout period for executable completion.

JSON representation
{
  "executableFile": string,
  "executionTimeout": string
}
Fields
executableFile

string

Required. Cloud Storage URI of executable file.

executionTimeout

string (Duration format)

Optional. Amount of time executable has to complete. Default is 10 minutes. Cluster creation fails with an explanatory error message (the name of the executable that caused the error and the exceeded timeout period) if the executable is not completed at end of the timeout period.

EncryptionConfig

Encryption settings for the cluster.

JSON representation
{
  "gcePdKmsKeyName": string
}
Fields
gcePdKmsKeyName

string

Optional. The Cloud KMS key name to use for PD disk encryption for all instances in the cluster.

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Dataproc