ClusterConfig

The cluster config.

JSON representation
{
  "configBucket": string,
  "tempBucket": string,
  "gceClusterConfig": {
    object (GceClusterConfig)
  },
  "masterConfig": {
    object (InstanceGroupConfig)
  },
  "workerConfig": {
    object (InstanceGroupConfig)
  },
  "secondaryWorkerConfig": {
    object (InstanceGroupConfig)
  },
  "softwareConfig": {
    object (SoftwareConfig)
  },
  "initializationActions": [
    {
      object (NodeInitializationAction)
    }
  ],
  "encryptionConfig": {
    object (EncryptionConfig)
  },
  "autoscalingConfig": {
    object (AutoscalingConfig)
  },
  "securityConfig": {
    object (SecurityConfig)
  },
  "lifecycleConfig": {
    object (LifecycleConfig)
  },
  "endpointConfig": {
    object (EndpointConfig)
  }
}
Fields
configBucket

string

Optional. A Cloud Storage bucket used to stage job dependencies, config files, and job driver console output. If you do not specify a staging bucket, Cloud Dataproc will determine a Cloud Storage location (US, ASIA, or EU) for your cluster's staging bucket according to the Compute Engine zone where your cluster is deployed, and then create and manage this project-level, per-location bucket (see Dataproc staging bucket).

tempBucket

string

Optional. A Cloud Storage bucket used to store ephemeral cluster and jobs data, such as Spark and MapReduce history files. If you do not specify a temp bucket, Dataproc will determine a Cloud Storage location (US, ASIA, or EU) for your cluster's temp bucket according to the Compute Engine zone where your cluster is deployed, and then create and manage this project-level, per-location bucket. The default bucket has a TTL of 90 days, but you can use any TTL (or none) if you specify a bucket.

gceClusterConfig

object (GceClusterConfig)

Optional. The shared Compute Engine config settings for all instances in a cluster.

masterConfig

object (InstanceGroupConfig)

Optional. The Compute Engine config settings for the master instance in a cluster.

workerConfig

object (InstanceGroupConfig)

Optional. The Compute Engine config settings for worker instances in a cluster.

secondaryWorkerConfig

object (InstanceGroupConfig)

Optional. The Compute Engine config settings for additional worker instances in a cluster.

softwareConfig

object (SoftwareConfig)

Optional. The config settings for software inside the cluster.

initializationActions[]

object (NodeInitializationAction)

Optional. Commands to execute on each node after config is completed. By default, executables are run on master and all worker nodes. You can test a node's role metadata to run an executable on a master or worker node, as shown below using curl (you can also use wget):

ROLE=$(curl -H Metadata-Flavor:Google
http://metadata/computeMetadata/v1/instance/attributes/dataproc-role)
if [[ "${ROLE}" == 'Master' ]]; then
  ... master specific actions ...
else
  ... worker specific actions ...
fi
encryptionConfig

object (EncryptionConfig)

Optional. Encryption settings for the cluster.

autoscalingConfig

object (AutoscalingConfig)

Optional. Autoscaling config for the policy associated with the cluster. Cluster does not autoscale if this field is unset.

securityConfig

object (SecurityConfig)

Optional. Security settings for the cluster.

lifecycleConfig

object (LifecycleConfig)

Optional. Lifecycle setting for the cluster.

endpointConfig

object (EndpointConfig)

Optional. Port/endpoint configuration for this cluster

GceClusterConfig

Common config settings for resources of Compute Engine cluster instances, applicable to all instances in the cluster.

JSON representation
{
  "zoneUri": string,
  "networkUri": string,
  "subnetworkUri": string,
  "internalIpOnly": boolean,
  "privateIpv6GoogleAccess": enum (PrivateIpv6GoogleAccess),
  "serviceAccount": string,
  "serviceAccountScopes": [
    string
  ],
  "tags": [
    string
  ],
  "metadata": {
    string: string,
    ...
  },
  "reservationAffinity": {
    object (ReservationAffinity)
  },
  "nodeGroupAffinity": {
    object (NodeGroupAffinity)
  }
}
Fields
zoneUri

string

Optional. The zone where the Compute Engine cluster will be located. On a create request, it is required in the "global" region. If omitted in a non-global Dataproc region, the service will pick a zone in the corresponding Compute Engine region. On a get request, zone will always be present.

A full URL, partial URI, or short name are valid. Examples:

  • https://www.googleapis.com/compute/v1/projects/[projectId]/zones/[zone]
  • projects/[projectId]/zones/[zone]
  • us-central1-f
networkUri

string

Optional. The Compute Engine network to be used for machine communications. Cannot be specified with subnetworkUri. If neither networkUri nor subnetworkUri is specified, the "default" network of the project is used, if it exists. Cannot be a "Custom Subnet Network" (see Using Subnetworks for more information).

A full URL, partial URI, or short name are valid. Examples:

  • https://www.googleapis.com/compute/v1/projects/[projectId]/regions/global/default
  • projects/[projectId]/regions/global/default
  • default
subnetworkUri

string

Optional. The Compute Engine subnetwork to be used for machine communications. Cannot be specified with networkUri.

A full URL, partial URI, or short name are valid. Examples:

  • https://www.googleapis.com/compute/v1/projects/[projectId]/regions/us-east1/subnetworks/sub0
  • projects/[projectId]/regions/us-east1/subnetworks/sub0
  • sub0
internalIpOnly

boolean

Optional. If true, all instances in the cluster will only have internal IP addresses. By default, clusters are not restricted to internal IP addresses, and will have ephemeral external IP addresses assigned to each instance. This internalIpOnly restriction can only be enabled for subnetwork enabled networks, and all off-cluster dependencies must be configured to be accessible without external IP addresses.

privateIpv6GoogleAccess

enum (PrivateIpv6GoogleAccess)

Optional. The type of IPv6 access for a cluster.

serviceAccount

string

Optional. The Dataproc service account (also see VM Data Plane identity) used by Dataproc cluster VM instances to access Google Cloud Platform services.

If not specified, the Compute Engine default service account is used.

serviceAccountScopes[]

string

Optional. The URIs of service account scopes to be included in Compute Engine instances. The following base set of scopes is always included:

If no scopes are specified, the following defaults are also provided:

tags[]

string

The Compute Engine tags to add to all instances (see Tagging instances).

metadata

map (key: string, value: string)

The Compute Engine metadata entries to add to all instances (see Project and instance metadata).

An object containing a list of "key": value pairs. Example: { "name": "wrench", "mass": "1.3kg", "count": "3" }.

reservationAffinity

object (ReservationAffinity)

Optional. Reservation Affinity for consuming Zonal reservation.

nodeGroupAffinity

object (NodeGroupAffinity)

Optional. Node Group Affinity for sole-tenant clusters.

PrivateIpv6GoogleAccess

PrivateIpv6GoogleAccess controls whether and how Dataproc cluster nodes can communicate with Google Services through gRPC over IPv6. These values are directly mapped to corresponding values in the Compute Engine Instance fields.

Enums
PRIVATE_IPV6_GOOGLE_ACCESS_UNSPECIFIED If unspecified, Compute Engine default behavior will apply, which is the same as INHERIT_FROM_SUBNETWORK.
INHERIT_FROM_SUBNETWORK Private access to and from Google Services configuration inherited from the subnetwork configuration. This is the default Compute Engine behavior.
OUTBOUND Enables outbound private IPv6 access to Google Services from the Dataproc cluster.
BIDIRECTIONAL Enables bidirectional private IPv6 access between Google Services and the Dataproc cluster.

ReservationAffinity

Reservation Affinity for consuming Zonal reservation.

JSON representation
{
  "consumeReservationType": enum (Type),
  "key": string,
  "values": [
    string
  ]
}
Fields
consumeReservationType

enum (Type)

Optional. Type of reservation to consume

key

string

Optional. Corresponds to the label key of reservation resource.

values[]

string

Optional. Corresponds to the label values of reservation resource.

Type

Indicates whether to consume capacity from an reservation or not.

Enums
TYPE_UNSPECIFIED
NO_RESERVATION Do not consume from any allocated capacity.
ANY_RESERVATION Consume any reservation available.
SPECIFIC_RESERVATION Must consume from a specific reservation. Must specify key value fields for specifying the reservations.

NodeGroupAffinity

Node Group Affinity for clusters using sole-tenant node groups.

JSON representation
{
  "nodeGroupUri": string
}
Fields
nodeGroupUri

string

Required. The URI of a sole-tenant node group resource that the cluster will be created on.

A full URL, partial URI, or node group name are valid. Examples:

  • https://www.googleapis.com/compute/v1/projects/[projectId]/zones/us-central1-a/nodeGroups/node-group-1
  • projects/[projectId]/zones/us-central1-a/nodeGroups/node-group-1
  • node-group-1

InstanceGroupConfig

The config settings for Compute Engine resources in an instance group, such as a master or worker group.

JSON representation
{
  "numInstances": integer,
  "instanceNames": [
    string
  ],
  "imageUri": string,
  "machineTypeUri": string,
  "diskConfig": {
    object (DiskConfig)
  },
  "isPreemptible": boolean,
  "preemptibility": enum (Preemptibility),
  "managedGroupConfig": {
    object (ManagedGroupConfig)
  },
  "accelerators": [
    {
      object (AcceleratorConfig)
    }
  ],
  "minCpuPlatform": string
}
Fields
numInstances

integer

Optional. The number of VM instances in the instance group. For master instance groups, must be set to 1.

instanceNames[]

string

Output only. The list of instance names. Dataproc derives the names from clusterName, numInstances, and the instance group.

imageUri

string

Optional. The Compute Engine image resource used for cluster instances.

The URI can represent an image or image family.

Image examples:

  • https://www.googleapis.com/compute/beta/projects/[projectId]/global/images/[image-id]
  • projects/[projectId]/global/images/[image-id]
  • image-id

Image family examples. Dataproc will use the most recent image from the family:

  • https://www.googleapis.com/compute/beta/projects/[projectId]/global/images/family/[custom-image-family-name]
  • projects/[projectId]/global/images/family/[custom-image-family-name]

If the URI is unspecified, it will be inferred from SoftwareConfig.image_version or the system default.

machineTypeUri

string

Optional. The Compute Engine machine type used for cluster instances.

A full URL, partial URI, or short name are valid. Examples:

  • https://www.googleapis.com/compute/v1/projects/[projectId]/zones/us-east1-a/machineTypes/n1-standard-2
  • projects/[projectId]/zones/us-east1-a/machineTypes/n1-standard-2
  • n1-standard-2

Auto Zone Exception: If you are using the Dataproc Auto Zone Placement feature, you must use the short name of the machine type resource, for example, n1-standard-2.

diskConfig

object (DiskConfig)

Optional. Disk option config settings.

isPreemptible

boolean

Output only. Specifies that this instance group contains preemptible instances.

preemptibility

enum (Preemptibility)

Optional. Specifies the preemptibility of the instance group.

The default value for master and worker groups is NON_PREEMPTIBLE. This default cannot be changed.

The default value for secondary instances is PREEMPTIBLE.

managedGroupConfig

object (ManagedGroupConfig)

Output only. The config for Compute Engine Instance Group Manager that manages this group. This is only used for preemptible instance groups.

accelerators[]

object (AcceleratorConfig)

Optional. The Compute Engine accelerator configuration for these instances.

minCpuPlatform

string

Optional. Specifies the minimum cpu platform for the Instance Group. See Dataproc -> Minimum CPU Platform.

DiskConfig

Specifies the config of disk options for a group of VM instances.

JSON representation
{
  "bootDiskType": string,
  "bootDiskSizeGb": integer,
  "numLocalSsds": integer
}
Fields
bootDiskType

string

Optional. Type of the boot disk (default is "pd-standard"). Valid values: "pd-ssd" (Persistent Disk Solid State Drive) or "pd-standard" (Persistent Disk Hard Disk Drive).

bootDiskSizeGb

integer

Optional. Size in GB of the boot disk (default is 500GB).

numLocalSsds

integer

Optional. Number of attached SSDs, from 0 to 4 (default is 0). If SSDs are not attached, the boot disk is used to store runtime logs and HDFS data. If one or more SSDs are attached, this runtime bulk data is spread across them, and the boot disk contains only basic config and installed binaries.

Preemptibility

Controls the use of preemptible instances within the group.

Enums
PREEMPTIBILITY_UNSPECIFIED Preemptibility is unspecified, the system will choose the appropriate setting for each instance group.
NON_PREEMPTIBLE

Instances are non-preemptible.

This option is allowed for all instance groups and is the only valid value for Master and Worker instance groups.

PREEMPTIBLE

Instances are preemptible.

This option is allowed only for secondary worker groups.

ManagedGroupConfig

Specifies the resources used to actively manage an instance group.

JSON representation
{
  "instanceTemplateName": string,
  "instanceGroupManagerName": string
}
Fields
instanceTemplateName

string

Output only. The name of the Instance Template used for the Managed Instance Group.

instanceGroupManagerName

string

Output only. The name of the Instance Group Manager for this group.

AcceleratorConfig

Specifies the type and number of accelerator cards attached to the instances of an instance. See GPUs on Compute Engine.

JSON representation
{
  "acceleratorTypeUri": string,
  "acceleratorCount": integer
}
Fields
acceleratorTypeUri

string

Full URL, partial URI, or short name of the accelerator type resource to expose to this instance. See Compute Engine AcceleratorTypes.

Examples:

  • https://www.googleapis.com/compute/beta/projects/[projectId]/zones/us-east1-a/acceleratorTypes/nvidia-tesla-k80
  • projects/[projectId]/zones/us-east1-a/acceleratorTypes/nvidia-tesla-k80
  • nvidia-tesla-k80

Auto Zone Exception: If you are using the Dataproc Auto Zone Placement feature, you must use the short name of the accelerator type resource, for example, nvidia-tesla-k80.

acceleratorCount

integer

The number of the accelerator cards of this type exposed to this instance.

SoftwareConfig

Specifies the selection and config of software inside the cluster.

JSON representation
{
  "imageVersion": string,
  "properties": {
    string: string,
    ...
  },
  "optionalComponents": [
    enum (Component)
  ]
}
Fields
imageVersion

string

Optional. The version of software inside the cluster. It must be one of the supported Dataproc Versions, such as "1.2" (including a subminor version, such as "1.2.29"), or the "preview" version. If unspecified, it defaults to the latest Debian version.

properties

map (key: string, value: string)

Optional. The properties to set on daemon config files.

Property keys are specified in prefix:property format, for example core:hadoop.tmp.dir. The following are supported prefixes and their mappings:

  • capacity-scheduler: capacity-scheduler.xml
  • core: core-site.xml
  • distcp: distcp-default.xml
  • hdfs: hdfs-site.xml
  • hive: hive-site.xml
  • mapred: mapred-site.xml
  • pig: pig.properties
  • spark: spark-defaults.conf
  • yarn: yarn-site.xml

For more information, see Cluster properties.

An object containing a list of "key": value pairs. Example: { "name": "wrench", "mass": "1.3kg", "count": "3" }.

optionalComponents[]

enum (Component)

Optional. The set of components to activate on the cluster.

Component

Cluster components that can be activated.

Enums
COMPONENT_UNSPECIFIED Unspecified component. Specifying this will cause Cluster creation to fail.
ANACONDA

The Anaconda python distribution. The Anaconda component is not supported in the Dataproc preview 2.0 image. The 2.0 preview image is pre-installed with Miniconda.

DOCKER Docker
HIVE_WEBHCAT The Hive Web HCatalog (the REST service for accessing HCatalog).
JUPYTER The Jupyter Notebook.
PRESTO The Presto query engine.
RANGER The Ranger service.
SOLR The Solr service.
ZEPPELIN The Zeppelin notebook.
ZOOKEEPER The Zookeeper service.

NodeInitializationAction

Specifies an executable to run on a fully configured node and a timeout period for executable completion.

JSON representation
{
  "executableFile": string,
  "executionTimeout": string
}
Fields
executableFile

string

Required. Cloud Storage URI of executable file.

executionTimeout

string (Duration format)

Optional. Amount of time executable has to complete. Default is 10 minutes (see JSON representation of Duration).

Cluster creation fails with an explanatory error message (the name of the executable that caused the error and the exceeded timeout period) if the executable is not completed at end of the timeout period.

EncryptionConfig

Encryption settings for the cluster.

JSON representation
{
  "gcePdKmsKeyName": string
}
Fields
gcePdKmsKeyName

string

Optional. The Cloud KMS key name to use for PD disk encryption for all instances in the cluster.

AutoscalingConfig

Autoscaling Policy config associated with the cluster.

JSON representation
{
  "policyUri": string
}
Fields
policyUri

string

Optional. The autoscaling policy used by the cluster.

Only resource names including projectid and location (region) are valid. Examples:

  • https://www.googleapis.com/compute/v1/projects/[projectId]/locations/[dataproc_region]/autoscalingPolicies/[policy_id]
  • projects/[projectId]/locations/[dataproc_region]/autoscalingPolicies/[policy_id]

Note that the policy must be in the same project and Dataproc region.

SecurityConfig

Security related configuration, including Kerberos.

JSON representation
{
  "kerberosConfig": {
    object (KerberosConfig)
  }
}
Fields
kerberosConfig

object (KerberosConfig)

Kerberos related configuration.

KerberosConfig

Specifies Kerberos related configuration.

JSON representation
{
  "enableKerberos": boolean,
  "rootPrincipalPasswordUri": string,
  "kmsKeyUri": string,
  "keystoreUri": string,
  "truststoreUri": string,
  "keystorePasswordUri": string,
  "keyPasswordUri": string,
  "truststorePasswordUri": string,
  "crossRealmTrustRealm": string,
  "crossRealmTrustKdc": string,
  "crossRealmTrustAdminServer": string,
  "crossRealmTrustSharedPasswordUri": string,
  "kdcDbKeyUri": string,
  "tgtLifetimeHours": integer,
  "realm": string
}
Fields
enableKerberos

boolean

Optional. Flag to indicate whether to Kerberize the cluster (default: false). Set this field to true to enable Kerberos on a cluster.

rootPrincipalPasswordUri

string

Required. The Cloud Storage URI of a KMS encrypted file containing the root principal password.

kmsKeyUri

string

Required. The uri of the KMS key used to encrypt various sensitive files.

keystoreUri

string

Optional. The Cloud Storage URI of the keystore file used for SSL encryption. If not provided, Dataproc will provide a self-signed certificate.

truststoreUri

string

Optional. The Cloud Storage URI of the truststore file used for SSL encryption. If not provided, Dataproc will provide a self-signed certificate.

keystorePasswordUri

string

Optional. The Cloud Storage URI of a KMS encrypted file containing the password to the user provided keystore. For the self-signed certificate, this password is generated by Dataproc.

keyPasswordUri

string

Optional. The Cloud Storage URI of a KMS encrypted file containing the password to the user provided key. For the self-signed certificate, this password is generated by Dataproc.

truststorePasswordUri

string

Optional. The Cloud Storage URI of a KMS encrypted file containing the password to the user provided truststore. For the self-signed certificate, this password is generated by Dataproc.

crossRealmTrustRealm

string

Optional. The remote realm the Dataproc on-cluster KDC will trust, should the user enable cross realm trust.

crossRealmTrustKdc

string

Optional. The KDC (IP or hostname) for the remote trusted realm in a cross realm trust relationship.

crossRealmTrustAdminServer

string

Optional. The admin server (IP or hostname) for the remote trusted realm in a cross realm trust relationship.

crossRealmTrustSharedPasswordUri

string

Optional. The Cloud Storage URI of a KMS encrypted file containing the shared password between the on-cluster Kerberos realm and the remote trusted realm, in a cross realm trust relationship.

kdcDbKeyUri

string

Optional. The Cloud Storage URI of a KMS encrypted file containing the master key of the KDC database.

tgtLifetimeHours

integer

Optional. The lifetime of the ticket granting ticket, in hours. If not specified, or user specifies 0, then default value 10 will be used.

realm

string

Optional. The name of the on-cluster Kerberos realm. If not specified, the uppercased domain of hostnames will be the realm.

LifecycleConfig

Specifies the cluster auto-delete schedule configuration.

JSON representation
{
  "idleDeleteTtl": string,
  "idleStartTime": string,

  // Union field ttl can be only one of the following:
  "autoDeleteTime": string,
  "autoDeleteTtl": string
  // End of list of possible types for union field ttl.
}
Fields
idleDeleteTtl

string (Duration format)

Optional. The duration to keep the cluster alive while idling (when no jobs are running). Passing this threshold will cause the cluster to be deleted. Minimum value is 10 minutes; maximum value is 14 days (see JSON representation of Duration.

idleStartTime

string (Timestamp format)

Output only. The time when cluster became idle (most recent job finished) and became eligible for deletion due to idleness (see JSON representation of Timestamp).

A timestamp in RFC3339 UTC "Zulu" format, with nanosecond resolution and up to nine fractional digits. Examples: "2014-10-02T15:01:23Z" and "2014-10-02T15:01:23.045123456Z".

Union field ttl. Either the exact time the cluster should be deleted at or the cluster maximum age. ttl can be only one of the following:
autoDeleteTime

string (Timestamp format)

Optional. The time when cluster will be auto-deleted (see JSON representation of Timestamp).

A timestamp in RFC3339 UTC "Zulu" format, with nanosecond resolution and up to nine fractional digits. Examples: "2014-10-02T15:01:23Z" and "2014-10-02T15:01:23.045123456Z".

autoDeleteTtl

string (Duration format)

Optional. The lifetime duration of cluster. The cluster will be auto-deleted at the end of this period. Minimum value is 10 minutes; maximum value is 14 days (see JSON representation of Duration).

EndpointConfig

Endpoint config for this cluster

JSON representation
{
  "httpPorts": {
    string: string,
    ...
  },
  "enableHttpPortAccess": boolean
}
Fields
httpPorts

map (key: string, value: string)

Output only. The map of port descriptions to URLs. Will only be populated if enableHttpPortAccess is true.

An object containing a list of "key": value pairs. Example: { "name": "wrench", "mass": "1.3kg", "count": "3" }.

enableHttpPortAccess

boolean

Optional. If true, enable http access to specific ports on the cluster from external sources. Defaults to false.