DataprocCluster

Property Value
Google Cloud Service Name Dataproc
Google Cloud Service Documentation /dataproc/docs/
Google Cloud REST Resource Name v1.projects.regions.clusters
Google Cloud REST Resource Documentation /dataproc/docs/reference/rest/v1/projects.regions.clusters
Config Connector Resource Short Names gcpdataproccluster
gcpdataprocclusters
dataproccluster
Config Connector Service Name dataproc.googleapis.com
Config Connector Resource Fully Qualified Name dataprocclusters.dataproc.cnrm.cloud.google.com
Can Be Referenced by IAMPolicy/IAMPolicyMember No

Custom Resource Definition Properties

Spec

Schema

  config:
    autoscalingConfig:
      policyRef:
        external: string
        name: string
        namespace: string
    encryptionConfig:
      gcePdKmsKeyRef:
        external: string
        name: string
        namespace: string
    endpointConfig:
      enableHttpPortAccess: boolean
    gceClusterConfig:
      internalIPOnly: boolean
      metadata:
        string: string
      networkRef:
        external: string
        name: string
        namespace: string
      nodeGroupAffinity:
        nodeGroupRef:
          external: string
          name: string
          namespace: string
      privateIPv6GoogleAccess: string
      reservationAffinity:
        consumeReservationType: string
        key: string
        values:
        - string
      serviceAccountRef:
        external: string
        name: string
        namespace: string
      serviceAccountScopes:
      - string
      subnetworkRef:
        external: string
        name: string
        namespace: string
      tags:
      - string
      zone: string
    initializationActions:
    - executableFile: string
      executionTimeout: string
    lifecycleConfig:
      autoDeleteTime: string
      autoDeleteTtl: string
      idleDeleteTtl: string
    masterConfig:
      accelerators:
      - acceleratorCount: integer
        acceleratorType: string
      diskConfig:
        bootDiskSizeGb: integer
        bootDiskType: string
        numLocalSsds: integer
      imageRef:
        external: string
        name: string
        namespace: string
      machineType: string
      minCpuPlatform: string
      numInstances: integer
      preemptibility: string
    secondaryWorkerConfig:
      accelerators:
      - acceleratorCount: integer
        acceleratorType: string
      diskConfig:
        bootDiskSizeGb: integer
        bootDiskType: string
        numLocalSsds: integer
      imageRef:
        external: string
        name: string
        namespace: string
      machineType: string
      minCpuPlatform: string
      numInstances: integer
      preemptibility: string
    securityConfig:
      kerberosConfig:
        crossRealmTrustAdminServer: string
        crossRealmTrustKdc: string
        crossRealmTrustRealm: string
        crossRealmTrustSharedPassword: string
        enableKerberos: boolean
        kdcDbKey: string
        keyPassword: string
        keystore: string
        keystorePassword: string
        kmsKeyRef:
          external: string
          name: string
          namespace: string
        realm: string
        rootPrincipalPassword: string
        tgtLifetimeHours: integer
        truststore: string
        truststorePassword: string
    softwareConfig:
      imageVersion: string
      optionalComponents:
      - string
      properties:
        string: string
    stagingBucketRef:
      external: string
      name: string
      namespace: string
    tempBucketRef:
      external: string
      name: string
      namespace: string
    workerConfig:
      accelerators:
      - acceleratorCount: integer
        acceleratorType: string
      diskConfig:
        bootDiskSizeGb: integer
        bootDiskType: string
        numLocalSsds: integer
      imageRef:
        external: string
        name: string
        namespace: string
      machineType: string
      minCpuPlatform: string
      numInstances: integer
      preemptibility: string
  location: string
  projectRef:
    external: string
    name: string
    namespace: string
  resourceID: string
Fields

config

Optional

object

Required. The cluster config. Note that Dataproc may set default values, and values may change when clusters are updated.

config.autoscalingConfig

Optional

object

Optional. Autoscaling config for the policy associated with the cluster. Cluster does not autoscale if this field is unset.

config.autoscalingConfig.policyRef

Optional

object

config.autoscalingConfig.policyRef.external

Optional

string

Optional. The autoscaling policy used by the cluster. Only resource names including projectid and location (region) are valid. Examples: * `https://www.googleapis.com/compute/v1/projects/[project_id]/locations/[dataproc_region]/autoscalingPolicies/[policy_id]` * `projects/[project_id]/locations/[dataproc_region]/autoscalingPolicies/[policy_id]` Note that the policy must be in the same project and Dataproc region.

config.autoscalingConfig.policyRef.name

Optional

string

Name of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names

config.autoscalingConfig.policyRef.namespace

Optional

string

Namespace of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/

config.encryptionConfig

Optional

object

Optional. Encryption settings for the cluster.

config.encryptionConfig.gcePdKmsKeyRef

Optional

object

config.encryptionConfig.gcePdKmsKeyRef.external

Optional

string

Optional. The Cloud KMS key name to use for PD disk encryption for all instances in the cluster.

config.encryptionConfig.gcePdKmsKeyRef.name

Optional

string

Name of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names

config.encryptionConfig.gcePdKmsKeyRef.namespace

Optional

string

Namespace of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/

config.endpointConfig

Optional

object

Optional. Port/endpoint configuration for this cluster

config.endpointConfig.enableHttpPortAccess

Optional

boolean

Optional. If true, enable http access to specific ports on the cluster from external sources. Defaults to false.

config.gceClusterConfig

Optional

object

Optional. The shared Compute Engine config settings for all instances in a cluster.

config.gceClusterConfig.internalIPOnly

Optional

boolean

Optional. If true, all instances in the cluster will only have internal IP addresses. By default, clusters are not restricted to internal IP addresses, and will have ephemeral external IP addresses assigned to each instance. This `internal_ip_only` restriction can only be enabled for subnetwork enabled networks, and all off-cluster dependencies must be configured to be accessible without external IP addresses.

config.gceClusterConfig.metadata

Optional

map (key: string, value: string)

The Compute Engine metadata entries to add to all instances (see [Project and instance metadata](https://cloud.google.com/compute/docs/storing-retrieving-metadata#project_and_instance_metadata)).

config.gceClusterConfig.networkRef

Optional

object

config.gceClusterConfig.networkRef.external

Optional

string

Optional. The Compute Engine network to be used for machine communications. Cannot be specified with subnetwork_uri. If neither `network_uri` nor `subnetwork_uri` is specified, the "default" network of the project is used, if it exists. Cannot be a "Custom Subnet Network" (see [Using Subnetworks](https://cloud.google.com/compute/docs/subnetworks) for more information). A full URL, partial URI, or short name are valid. Examples: * `https://www.googleapis.com/compute/v1/projects/[project_id]/regions/global/default` * `projects/[project_id]/regions/global/default` * `default`

config.gceClusterConfig.networkRef.name

Optional

string

Name of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names

config.gceClusterConfig.networkRef.namespace

Optional

string

Namespace of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/

config.gceClusterConfig.nodeGroupAffinity

Optional

object

Optional. Node Group Affinity for sole-tenant clusters.

config.gceClusterConfig.nodeGroupAffinity.nodeGroupRef

Required*

object

config.gceClusterConfig.nodeGroupAffinity.nodeGroupRef.external

Optional

string

Required. The URI of a sole-tenant [node group resource](https://cloud.google.com/compute/docs/reference/rest/v1/nodeGroups) that the cluster will be created on. A full URL, partial URI, or node group name are valid. Examples: * `https://www.googleapis.com/compute/v1/projects/[project_id]/zones/us-central1-a/nodeGroups/node-group-1` * `projects/[project_id]/zones/us-central1-a/nodeGroups/node-group-1` * `node-group-1`

config.gceClusterConfig.nodeGroupAffinity.nodeGroupRef.name

Optional

string

Name of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names

config.gceClusterConfig.nodeGroupAffinity.nodeGroupRef.namespace

Optional

string

Namespace of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/

config.gceClusterConfig.privateIPv6GoogleAccess

Optional

string

Optional. The type of IPv6 access for a cluster. Possible values: PRIVATE_IPV6_GOOGLE_ACCESS_UNSPECIFIED, INHERIT_FROM_SUBNETWORK, OUTBOUND, BIDIRECTIONAL

config.gceClusterConfig.reservationAffinity

Optional

object

Optional. Reservation Affinity for consuming Zonal reservation.

config.gceClusterConfig.reservationAffinity.consumeReservationType

Optional

string

Optional. Type of reservation to consume Possible values: TYPE_UNSPECIFIED, NO_RESERVATION, ANY_RESERVATION, SPECIFIC_RESERVATION

config.gceClusterConfig.reservationAffinity.key

Optional

string

Optional. Corresponds to the label key of reservation resource.

config.gceClusterConfig.reservationAffinity.values

Optional

list (string)

Optional. Corresponds to the label values of reservation resource.

config.gceClusterConfig.reservationAffinity.values.[]

Optional

string

config.gceClusterConfig.serviceAccountRef

Optional

object

config.gceClusterConfig.serviceAccountRef.external

Optional

string

Optional. The [Dataproc service account](https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/service-accounts#service_accounts_in_dataproc) (also see [VM Data Plane identity](https://cloud.google.com/dataproc/docs/concepts/iam/dataproc-principals#vm_service_account_data_plane_identity)) used by Dataproc cluster VM instances to access Google Cloud Platform services. If not specified, the [Compute Engine default service account](https://cloud.google.com/compute/docs/access/service-accounts#default_service_account) is used.

config.gceClusterConfig.serviceAccountRef.name

Optional

string

Name of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names

config.gceClusterConfig.serviceAccountRef.namespace

Optional

string

Namespace of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/

config.gceClusterConfig.serviceAccountScopes

Optional

list (string)

Optional. The URIs of service account scopes to be included in Compute Engine instances. The following base set of scopes is always included: * https://www.googleapis.com/auth/cloud.useraccounts.readonly * https://www.googleapis.com/auth/devstorage.read_write * https://www.googleapis.com/auth/logging.write If no scopes are specified, the following defaults are also provided: * https://www.googleapis.com/auth/bigquery * https://www.googleapis.com/auth/bigtable.admin.table * https://www.googleapis.com/auth/bigtable.data * https://www.googleapis.com/auth/devstorage.full_control

config.gceClusterConfig.serviceAccountScopes.[]

Optional

string

config.gceClusterConfig.subnetworkRef

Optional

object

config.gceClusterConfig.subnetworkRef.external

Optional

string

Optional. The Compute Engine subnetwork to be used for machine communications. Cannot be specified with network_uri. A full URL, partial URI, or short name are valid. Examples: * `https://www.googleapis.com/compute/v1/projects/[project_id]/regions/us-east1/subnetworks/sub0` * `projects/[project_id]/regions/us-east1/subnetworks/sub0` * `sub0`

config.gceClusterConfig.subnetworkRef.name

Optional

string

Name of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names

config.gceClusterConfig.subnetworkRef.namespace

Optional

string

Namespace of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/

config.gceClusterConfig.tags

Optional

list (string)

The Compute Engine tags to add to all instances (see [Tagging instances](https://cloud.google.com/compute/docs/label-or-tag-resources#tags)).

config.gceClusterConfig.tags.[]

Optional

string

config.gceClusterConfig.zone

Optional

string

Optional. The zone where the Compute Engine cluster will be located. On a create request, it is required in the "global" region. If omitted in a non-global Dataproc region, the service will pick a zone in the corresponding Compute Engine region. On a get request, zone will always be present. A full URL, partial URI, or short name are valid. Examples: * `https://www.googleapis.com/compute/v1/projects/[project_id]/zones/[zone]` * `projects/[project_id]/zones/[zone]` * `us-central1-f`

config.initializationActions

Optional

list (object)

Optional. Commands to execute on each node after config is completed. By default, executables are run on master and all worker nodes. You can test a node's `role` metadata to run an executable on a master or worker node, as shown below using `curl` (you can also use `wget`): ROLE=$(curl -H Metadata-Flavor:Google http://metadata/computeMetadata/v1/instance/attributes/dataproc-role) if [[ "${ROLE}" == 'Master' ]]; then ... master specific actions ... else ... worker specific actions ... fi

config.initializationActions.[]

Optional

object

config.initializationActions.[].executableFile

Required*

string

Required. Cloud Storage URI of executable file.

config.initializationActions.[].executionTimeout

Optional

string

Optional. Amount of time executable has to complete. Default is 10 minutes (see JSON representation of [Duration](https://developers.google.com/protocol-buffers/docs/proto3#json)). Cluster creation fails with an explanatory error message (the name of the executable that caused the error and the exceeded timeout period) if the executable is not completed at end of the timeout period.

config.lifecycleConfig

Optional

object

Optional. Lifecycle setting for the cluster.

config.lifecycleConfig.autoDeleteTime

Optional

string

Optional. The time when cluster will be auto-deleted (see JSON representation of [Timestamp](https://developers.google.com/protocol-buffers/docs/proto3#json)).

config.lifecycleConfig.autoDeleteTtl

Optional

string

Optional. The lifetime duration of cluster. The cluster will be auto-deleted at the end of this period. Minimum value is 10 minutes; maximum value is 14 days (see JSON representation of [Duration](https://developers.google.com/protocol-buffers/docs/proto3#json)).

config.lifecycleConfig.idleDeleteTtl

Optional

string

Optional. The duration to keep the cluster alive while idling (when no jobs are running). Passing this threshold will cause the cluster to be deleted. Minimum value is 5 minutes; maximum value is 14 days (see JSON representation of [Duration](https://developers.google.com/protocol-buffers/docs/proto3#json)).

config.masterConfig

Optional

object

Optional. The Compute Engine config settings for the master instance in a cluster.

config.masterConfig.accelerators

Optional

list (object)

Optional. The Compute Engine accelerator configuration for these instances.

config.masterConfig.accelerators.[]

Optional

object

config.masterConfig.accelerators.[].acceleratorCount

Optional

integer

The number of the accelerator cards of this type exposed to this instance.

config.masterConfig.accelerators.[].acceleratorType

Optional

string

Full URL, partial URI, or short name of the accelerator type resource to expose to this instance. See [Compute Engine AcceleratorTypes](https://cloud.google.com/compute/docs/reference/beta/acceleratorTypes). Examples: * `https://www.googleapis.com/compute/beta/projects/[project_id]/zones/us-east1-a/acceleratorTypes/nvidia-tesla-k80` * `projects/[project_id]/zones/us-east1-a/acceleratorTypes/nvidia-tesla-k80` * `nvidia-tesla-k80` **Auto Zone Exception**: If you are using the Dataproc [Auto Zone Placement](https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/auto-zone#using_auto_zone_placement) feature, you must use the short name of the accelerator type resource, for example, `nvidia-tesla-k80`.

config.masterConfig.diskConfig

Optional

object

Optional. Disk option config settings.

config.masterConfig.diskConfig.bootDiskSizeGb

Optional

integer

Optional. Size in GB of the boot disk (default is 500GB).

config.masterConfig.diskConfig.bootDiskType

Optional

string

Optional. Type of the boot disk (default is "pd-standard"). Valid values: "pd-balanced" (Persistent Disk Balanced Solid State Drive), "pd-ssd" (Persistent Disk Solid State Drive), or "pd-standard" (Persistent Disk Hard Disk Drive). See [Disk types](https://cloud.google.com/compute/docs/disks#disk-types).

config.masterConfig.diskConfig.numLocalSsds

Optional

integer

Optional. Number of attached SSDs, from 0 to 4 (default is 0). If SSDs are not attached, the boot disk is used to store runtime logs and [HDFS](https://hadoop.apache.org/docs/r1.2.1/hdfs_user_guide.html) data. If one or more SSDs are attached, this runtime bulk data is spread across them, and the boot disk contains only basic config and installed binaries.

config.masterConfig.imageRef

Optional

object

config.masterConfig.imageRef.external

Optional

string

Optional. The Compute Engine image resource used for cluster instances. The URI can represent an image or image family. Image examples: * `https://www.googleapis.com/compute/beta/projects/[project_id]/global/images/[image-id]` * `projects/[project_id]/global/images/[image-id]` * `image-id` Image family examples. Dataproc will use the most recent image from the family: * `https://www.googleapis.com/compute/beta/projects/[project_id]/global/images/family/[custom-image-family-name]` * `projects/[project_id]/global/images/family/[custom-image-family-name]` If the URI is unspecified, it will be inferred from `SoftwareConfig.image_version` or the system default.

config.masterConfig.imageRef.name

Optional

string

Name of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names

config.masterConfig.imageRef.namespace

Optional

string

Namespace of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/

config.masterConfig.machineType

Optional

string

Optional. The Compute Engine machine type used for cluster instances. A full URL, partial URI, or short name are valid. Examples: * `https://www.googleapis.com/compute/v1/projects/[project_id]/zones/us-east1-a/machineTypes/n1-standard-2` * `projects/[project_id]/zones/us-east1-a/machineTypes/n1-standard-2` * `n1-standard-2` **Auto Zone Exception**: If you are using the Dataproc [Auto Zone Placement](https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/auto-zone#using_auto_zone_placement) feature, you must use the short name of the machine type resource, for example, `n1-standard-2`.

config.masterConfig.minCpuPlatform

Optional

string

Optional. Specifies the minimum cpu platform for the Instance Group. See [Dataproc -> Minimum CPU Platform](https://cloud.google.com/dataproc/docs/concepts/compute/dataproc-min-cpu).

config.masterConfig.numInstances

Optional

integer

Optional. The number of VM instances in the instance group. For [HA cluster](/dataproc/docs/concepts/configuring-clusters/high-availability) [master_config](#FIELDS.master_config) groups, **must be set to 3**. For standard cluster [master_config](#FIELDS.master_config) groups, **must be set to 1**.

config.masterConfig.preemptibility

Optional

string

Optional. Specifies the preemptibility of the instance group. The default value for master and worker groups is `NON_PREEMPTIBLE`. This default cannot be changed. The default value for secondary instances is `PREEMPTIBLE`. Possible values: PREEMPTIBILITY_UNSPECIFIED, NON_PREEMPTIBLE, PREEMPTIBLE

config.secondaryWorkerConfig

Optional

object

Optional. The Compute Engine config settings for the master instance in a cluster.

config.secondaryWorkerConfig.accelerators

Optional

list (object)

Optional. The Compute Engine accelerator configuration for these instances.

config.secondaryWorkerConfig.accelerators.[]

Optional

object

config.secondaryWorkerConfig.accelerators.[].acceleratorCount

Optional

integer

The number of the accelerator cards of this type exposed to this instance.

config.secondaryWorkerConfig.accelerators.[].acceleratorType

Optional

string

Full URL, partial URI, or short name of the accelerator type resource to expose to this instance. See [Compute Engine AcceleratorTypes](https://cloud.google.com/compute/docs/reference/beta/acceleratorTypes). Examples: * `https://www.googleapis.com/compute/beta/projects/[project_id]/zones/us-east1-a/acceleratorTypes/nvidia-tesla-k80` * `projects/[project_id]/zones/us-east1-a/acceleratorTypes/nvidia-tesla-k80` * `nvidia-tesla-k80` **Auto Zone Exception**: If you are using the Dataproc [Auto Zone Placement](https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/auto-zone#using_auto_zone_placement) feature, you must use the short name of the accelerator type resource, for example, `nvidia-tesla-k80`.

config.secondaryWorkerConfig.diskConfig

Optional

object

Optional. Disk option config settings.

config.secondaryWorkerConfig.diskConfig.bootDiskSizeGb

Optional

integer

Optional. Size in GB of the boot disk (default is 500GB).

config.secondaryWorkerConfig.diskConfig.bootDiskType

Optional

string

Optional. Type of the boot disk (default is "pd-standard"). Valid values: "pd-balanced" (Persistent Disk Balanced Solid State Drive), "pd-ssd" (Persistent Disk Solid State Drive), or "pd-standard" (Persistent Disk Hard Disk Drive). See [Disk types](https://cloud.google.com/compute/docs/disks#disk-types).

config.secondaryWorkerConfig.diskConfig.numLocalSsds

Optional

integer

Optional. Number of attached SSDs, from 0 to 4 (default is 0). If SSDs are not attached, the boot disk is used to store runtime logs and [HDFS](https://hadoop.apache.org/docs/r1.2.1/hdfs_user_guide.html) data. If one or more SSDs are attached, this runtime bulk data is spread across them, and the boot disk contains only basic config and installed binaries.

config.secondaryWorkerConfig.imageRef

Optional

object

config.secondaryWorkerConfig.imageRef.external

Optional

string

Optional. The Compute Engine image resource used for cluster instances. The URI can represent an image or image family. Image examples: * `https://www.googleapis.com/compute/beta/projects/[project_id]/global/images/[image-id]` * `projects/[project_id]/global/images/[image-id]` * `image-id` Image family examples. Dataproc will use the most recent image from the family: * `https://www.googleapis.com/compute/beta/projects/[project_id]/global/images/family/[custom-image-family-name]` * `projects/[project_id]/global/images/family/[custom-image-family-name]` If the URI is unspecified, it will be inferred from `SoftwareConfig.image_version` or the system default.

config.secondaryWorkerConfig.imageRef.name

Optional

string

Name of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names

config.secondaryWorkerConfig.imageRef.namespace

Optional

string

Namespace of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/

config.secondaryWorkerConfig.machineType

Optional

string

Optional. The Compute Engine machine type used for cluster instances. A full URL, partial URI, or short name are valid. Examples: * `https://www.googleapis.com/compute/v1/projects/[project_id]/zones/us-east1-a/machineTypes/n1-standard-2` * `projects/[project_id]/zones/us-east1-a/machineTypes/n1-standard-2` * `n1-standard-2` **Auto Zone Exception**: If you are using the Dataproc [Auto Zone Placement](https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/auto-zone#using_auto_zone_placement) feature, you must use the short name of the machine type resource, for example, `n1-standard-2`.

config.secondaryWorkerConfig.minCpuPlatform

Optional

string

Optional. Specifies the minimum cpu platform for the Instance Group. See [Dataproc -> Minimum CPU Platform](https://cloud.google.com/dataproc/docs/concepts/compute/dataproc-min-cpu).

config.secondaryWorkerConfig.numInstances

Optional

integer

Optional. The number of VM instances in the instance group. For [HA cluster](/dataproc/docs/concepts/configuring-clusters/high-availability) [master_config](#FIELDS.master_config) groups, **must be set to 3**. For standard cluster [master_config](#FIELDS.master_config) groups, **must be set to 1**.

config.secondaryWorkerConfig.preemptibility

Optional

string

Optional. Specifies the preemptibility of the instance group. The default value for master and worker groups is `NON_PREEMPTIBLE`. This default cannot be changed. The default value for secondary instances is `PREEMPTIBLE`. Possible values: PREEMPTIBILITY_UNSPECIFIED, NON_PREEMPTIBLE, PREEMPTIBLE

config.securityConfig

Optional

object

Optional. Security settings for the cluster.

config.securityConfig.kerberosConfig

Optional

object

Optional. Kerberos related configuration.

config.securityConfig.kerberosConfig.crossRealmTrustAdminServer

Optional

string

Optional. The admin server (IP or hostname) for the remote trusted realm in a cross realm trust relationship.

config.securityConfig.kerberosConfig.crossRealmTrustKdc

Optional

string

Optional. The KDC (IP or hostname) for the remote trusted realm in a cross realm trust relationship.

config.securityConfig.kerberosConfig.crossRealmTrustRealm

Optional

string

Optional. The remote realm the Dataproc on-cluster KDC will trust, should the user enable cross realm trust.

config.securityConfig.kerberosConfig.crossRealmTrustSharedPassword

Optional

string

Optional. The Cloud Storage URI of a KMS encrypted file containing the shared password between the on-cluster Kerberos realm and the remote trusted realm, in a cross realm trust relationship.

config.securityConfig.kerberosConfig.enableKerberos

Optional

boolean

Optional. Flag to indicate whether to Kerberize the cluster (default: false). Set this field to true to enable Kerberos on a cluster.

config.securityConfig.kerberosConfig.kdcDbKey

Optional

string

Optional. The Cloud Storage URI of a KMS encrypted file containing the master key of the KDC database.

config.securityConfig.kerberosConfig.keyPassword

Optional

string

Optional. The Cloud Storage URI of a KMS encrypted file containing the password to the user provided key. For the self-signed certificate, this password is generated by Dataproc.

config.securityConfig.kerberosConfig.keystore

Optional

string

Optional. The Cloud Storage URI of the keystore file used for SSL encryption. If not provided, Dataproc will provide a self-signed certificate.

config.securityConfig.kerberosConfig.keystorePassword

Optional

string

Optional. The Cloud Storage URI of a KMS encrypted file containing the password to the user provided keystore. For the self-signed certificate, this password is generated by Dataproc.

config.securityConfig.kerberosConfig.kmsKeyRef

Optional

object

config.securityConfig.kerberosConfig.kmsKeyRef.external

Optional

string

Optional. The uri of the KMS key used to encrypt various sensitive files.

config.securityConfig.kerberosConfig.kmsKeyRef.name

Optional

string

Name of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names

config.securityConfig.kerberosConfig.kmsKeyRef.namespace

Optional

string

Namespace of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/

config.securityConfig.kerberosConfig.realm

Optional

string

Optional. The name of the on-cluster Kerberos realm. If not specified, the uppercased domain of hostnames will be the realm.

config.securityConfig.kerberosConfig.rootPrincipalPassword

Optional

string

Optional. The Cloud Storage URI of a KMS encrypted file containing the root principal password.

config.securityConfig.kerberosConfig.tgtLifetimeHours

Optional

integer

Optional. The lifetime of the ticket granting ticket, in hours. If not specified, or user specifies 0, then default value 10 will be used.

config.securityConfig.kerberosConfig.truststore

Optional

string

Optional. The Cloud Storage URI of the truststore file used for SSL encryption. If not provided, Dataproc will provide a self-signed certificate.

config.securityConfig.kerberosConfig.truststorePassword

Optional

string

Optional. The Cloud Storage URI of a KMS encrypted file containing the password to the user provided truststore. For the self-signed certificate, this password is generated by Dataproc.

config.softwareConfig

Optional

object

Optional. The config settings for software inside the cluster.

config.softwareConfig.imageVersion

Optional

string

Optional. The version of software inside the cluster. It must be one of the supported [Dataproc Versions](https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-versions#supported_dataproc_versions), such as "1.2" (including a subminor version, such as "1.2.29"), or the ["preview" version](https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-versions#other_versions). If unspecified, it defaults to the latest Debian version.

config.softwareConfig.optionalComponents

Optional

list (string)

Optional. The set of components to activate on the cluster.

config.softwareConfig.optionalComponents.[]

Optional

string

config.softwareConfig.properties

Optional

map (key: string, value: string)

Optional. The properties to set on daemon config files. Property keys are specified in `prefix:property` format, for example `core:hadoop.tmp.dir`. The following are supported prefixes and their mappings: * capacity-scheduler: `capacity-scheduler.xml` * core: `core-site.xml` * distcp: `distcp-default.xml` * hdfs: `hdfs-site.xml` * hive: `hive-site.xml` * mapred: `mapred-site.xml` * pig: `pig.properties` * spark: `spark-defaults.conf` * yarn: `yarn-site.xml` For more information, see [Cluster properties](https://cloud.google.com/dataproc/docs/concepts/cluster-properties).

config.stagingBucketRef

Optional

object

config.stagingBucketRef.external

Optional

string

Optional. A Cloud Storage bucket used to stage job dependencies, config files, and job driver console output. If you do not specify a staging bucket, Cloud Dataproc will determine a Cloud Storage location (US, ASIA, or EU) for your cluster's staging bucket according to the Compute Engine zone where your cluster is deployed, and then create and manage this project-level, per-location bucket (see [Dataproc staging bucket](https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/staging-bucket)). **This field requires a Cloud Storage bucket name, not a URI to a Cloud Storage bucket.**

config.stagingBucketRef.name

Optional

string

Name of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names

config.stagingBucketRef.namespace

Optional

string

Namespace of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/

config.tempBucketRef

Optional

object

config.tempBucketRef.external

Optional

string

Optional. A Cloud Storage bucket used to store ephemeral cluster and jobs data, such as Spark and MapReduce history files. If you do not specify a temp bucket, Dataproc will determine a Cloud Storage location (US, ASIA, or EU) for your cluster's temp bucket according to the Compute Engine zone where your cluster is deployed, and then create and manage this project-level, per-location bucket. The default bucket has a TTL of 90 days, but you can use any TTL (or none) if you specify a bucket. **This field requires a Cloud Storage bucket name, not a URI to a Cloud Storage bucket.**

config.tempBucketRef.name

Optional

string

Name of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names

config.tempBucketRef.namespace

Optional

string

Namespace of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/

config.workerConfig

Optional

object

Optional. The Compute Engine config settings for the master instance in a cluster.

config.workerConfig.accelerators

Optional

list (object)

Optional. The Compute Engine accelerator configuration for these instances.

config.workerConfig.accelerators.[]

Optional

object

config.workerConfig.accelerators.[].acceleratorCount

Optional

integer

The number of the accelerator cards of this type exposed to this instance.

config.workerConfig.accelerators.[].acceleratorType

Optional

string

Full URL, partial URI, or short name of the accelerator type resource to expose to this instance. See [Compute Engine AcceleratorTypes](https://cloud.google.com/compute/docs/reference/beta/acceleratorTypes). Examples: * `https://www.googleapis.com/compute/beta/projects/[project_id]/zones/us-east1-a/acceleratorTypes/nvidia-tesla-k80` * `projects/[project_id]/zones/us-east1-a/acceleratorTypes/nvidia-tesla-k80` * `nvidia-tesla-k80` **Auto Zone Exception**: If you are using the Dataproc [Auto Zone Placement](https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/auto-zone#using_auto_zone_placement) feature, you must use the short name of the accelerator type resource, for example, `nvidia-tesla-k80`.

config.workerConfig.diskConfig

Optional

object

Optional. Disk option config settings.

config.workerConfig.diskConfig.bootDiskSizeGb

Optional

integer

Optional. Size in GB of the boot disk (default is 500GB).

config.workerConfig.diskConfig.bootDiskType

Optional

string

Optional. Type of the boot disk (default is "pd-standard"). Valid values: "pd-balanced" (Persistent Disk Balanced Solid State Drive), "pd-ssd" (Persistent Disk Solid State Drive), or "pd-standard" (Persistent Disk Hard Disk Drive). See [Disk types](https://cloud.google.com/compute/docs/disks#disk-types).

config.workerConfig.diskConfig.numLocalSsds

Optional

integer

Optional. Number of attached SSDs, from 0 to 4 (default is 0). If SSDs are not attached, the boot disk is used to store runtime logs and [HDFS](https://hadoop.apache.org/docs/r1.2.1/hdfs_user_guide.html) data. If one or more SSDs are attached, this runtime bulk data is spread across them, and the boot disk contains only basic config and installed binaries.

config.workerConfig.imageRef

Optional

object

config.workerConfig.imageRef.external

Optional

string

Optional. The Compute Engine image resource used for cluster instances. The URI can represent an image or image family. Image examples: * `https://www.googleapis.com/compute/beta/projects/[project_id]/global/images/[image-id]` * `projects/[project_id]/global/images/[image-id]` * `image-id` Image family examples. Dataproc will use the most recent image from the family: * `https://www.googleapis.com/compute/beta/projects/[project_id]/global/images/family/[custom-image-family-name]` * `projects/[project_id]/global/images/family/[custom-image-family-name]` If the URI is unspecified, it will be inferred from `SoftwareConfig.image_version` or the system default.

config.workerConfig.imageRef.name

Optional

string

Name of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names

config.workerConfig.imageRef.namespace

Optional

string

Namespace of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/

config.workerConfig.machineType

Optional

string

Optional. The Compute Engine machine type used for cluster instances. A full URL, partial URI, or short name are valid. Examples: * `https://www.googleapis.com/compute/v1/projects/[project_id]/zones/us-east1-a/machineTypes/n1-standard-2` * `projects/[project_id]/zones/us-east1-a/machineTypes/n1-standard-2` * `n1-standard-2` **Auto Zone Exception**: If you are using the Dataproc [Auto Zone Placement](https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/auto-zone#using_auto_zone_placement) feature, you must use the short name of the machine type resource, for example, `n1-standard-2`.

config.workerConfig.minCpuPlatform

Optional

string

Optional. Specifies the minimum cpu platform for the Instance Group. See [Dataproc -> Minimum CPU Platform](https://cloud.google.com/dataproc/docs/concepts/compute/dataproc-min-cpu).

config.workerConfig.numInstances

Optional

integer

Optional. The number of VM instances in the instance group. For [HA cluster](/dataproc/docs/concepts/configuring-clusters/high-availability) [master_config](#FIELDS.master_config) groups, **must be set to 3**. For standard cluster [master_config](#FIELDS.master_config) groups, **must be set to 1**.

config.workerConfig.preemptibility

Optional

string

Optional. Specifies the preemptibility of the instance group. The default value for master and worker groups is `NON_PREEMPTIBLE`. This default cannot be changed. The default value for secondary instances is `PREEMPTIBLE`. Possible values: PREEMPTIBILITY_UNSPECIFIED, NON_PREEMPTIBLE, PREEMPTIBLE

location

Required

string

The location for the resource, usually a GCP region.

projectRef

Optional

object

The Project that this resource belongs to.

projectRef.external

Optional

string

Required. The Google Cloud Platform project ID that the cluster belongs to.

projectRef.name

Optional

string

Name of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names

projectRef.namespace

Optional

string

Namespace of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/

resourceID

Optional

string

Immutable. Optional. The name of the resource. Used for creation and acquisition. When unset, the value of `metadata.name` is used as the default.

* Field is required when parent field is specified

Status

Schema

  clusterUuid: string
  conditions:
  - lastTransitionTime: string
    message: string
    reason: string
    status: string
    type: string
  config:
    endpointConfig:
      httpPorts:
        string: string
    lifecycleConfig:
      idleStartTime: string
    masterConfig:
      instanceNames:
      - string
      isPreemptible: boolean
      managedGroupConfig:
        instanceGroupManagerName: string
        instanceTemplateName: string
    secondaryWorkerConfig:
      instanceNames:
      - string
      isPreemptible: boolean
      managedGroupConfig:
        instanceGroupManagerName: string
        instanceTemplateName: string
    workerConfig:
      instanceNames:
      - string
      isPreemptible: boolean
      managedGroupConfig:
        instanceGroupManagerName: string
        instanceTemplateName: string
  metrics:
    hdfsMetrics:
      string: string
    yarnMetrics:
      string: string
  observedGeneration: integer
  status:
    detail: string
    state: string
    stateStartTime: string
    substate: string
  statusHistory:
  - detail: string
    state: string
    stateStartTime: string
    substate: string
Fields
clusterUuid

string

Output only. A cluster UUID (Unique Universal Identifier). Dataproc generates this value when it creates the cluster.

conditions

list (object)

Conditions represent the latest available observation of the resource's current state.

conditions.[]

object

conditions.[].lastTransitionTime

string

Last time the condition transitioned from one status to another.

conditions.[].message

string

Human-readable message indicating details about last transition.

conditions.[].reason

string

Unique, one-word, CamelCase reason for the condition's last transition.

conditions.[].status

string

Status is the status of the condition. Can be True, False, Unknown.

conditions.[].type

string

Type is the type of the condition.

config

object

config.endpointConfig

object

config.endpointConfig.httpPorts

map (key: string, value: string)

Output only. The map of port descriptions to URLs. Will only be populated if enable_http_port_access is true.

config.lifecycleConfig

object

config.lifecycleConfig.idleStartTime

string

Output only. The time when cluster became idle (most recent job finished) and became eligible for deletion due to idleness (see JSON representation of [Timestamp](https://developers.google.com/protocol-buffers/docs/proto3#json)).

config.masterConfig

object

config.masterConfig.instanceNames

list (string)

Output only. The list of instance names. Dataproc derives the names from `cluster_name`, `num_instances`, and the instance group.

config.masterConfig.instanceNames.[]

string

config.masterConfig.isPreemptible

boolean

Output only. Specifies that this instance group contains preemptible instances.

config.masterConfig.managedGroupConfig

object

Output only. The config for Compute Engine Instance Group Manager that manages this group. This is only used for preemptible instance groups.

config.masterConfig.managedGroupConfig.instanceGroupManagerName

string

Output only. The name of the Instance Group Manager for this group.

config.masterConfig.managedGroupConfig.instanceTemplateName

string

Output only. The name of the Instance Template used for the Managed Instance Group.

config.secondaryWorkerConfig

object

config.secondaryWorkerConfig.instanceNames

list (string)

Output only. The list of instance names. Dataproc derives the names from `cluster_name`, `num_instances`, and the instance group.

config.secondaryWorkerConfig.instanceNames.[]

string

config.secondaryWorkerConfig.isPreemptible

boolean

Output only. Specifies that this instance group contains preemptible instances.

config.secondaryWorkerConfig.managedGroupConfig

object

Output only. The config for Compute Engine Instance Group Manager that manages this group. This is only used for preemptible instance groups.

config.secondaryWorkerConfig.managedGroupConfig.instanceGroupManagerName

string

Output only. The name of the Instance Group Manager for this group.

config.secondaryWorkerConfig.managedGroupConfig.instanceTemplateName

string

Output only. The name of the Instance Template used for the Managed Instance Group.

config.workerConfig

object

config.workerConfig.instanceNames

list (string)

Output only. The list of instance names. Dataproc derives the names from `cluster_name`, `num_instances`, and the instance group.

config.workerConfig.instanceNames.[]

string

config.workerConfig.isPreemptible

boolean

Output only. Specifies that this instance group contains preemptible instances.

config.workerConfig.managedGroupConfig

object

Output only. The config for Compute Engine Instance Group Manager that manages this group. This is only used for preemptible instance groups.

config.workerConfig.managedGroupConfig.instanceGroupManagerName

string

Output only. The name of the Instance Group Manager for this group.

config.workerConfig.managedGroupConfig.instanceTemplateName

string

Output only. The name of the Instance Template used for the Managed Instance Group.

metrics

object

Output only. Contains cluster daemon metrics such as HDFS and YARN stats. **Beta Feature**: This report is available for testing purposes only. It may be changed before final release.

metrics.hdfsMetrics

map (key: string, value: string)

The HDFS metrics.

metrics.yarnMetrics

map (key: string, value: string)

The YARN metrics.

observedGeneration

integer

ObservedGeneration is the generation of the resource that was most recently observed by the Config Connector controller. If this is equal to metadata.generation, then that means that the current reported status reflects the most recent desired state of the resource.

status

object

Output only. Cluster status.

status.detail

string

Optional. Output only. Details of cluster's state.

status.state

string

Output only. The cluster's state. Possible values: UNKNOWN, CREATING, RUNNING, ERROR, DELETING, UPDATING, STOPPING, STOPPED, STARTING

status.stateStartTime

string

Output only. Time when this state was entered (see JSON representation of [Timestamp](https://developers.google.com/protocol-buffers/docs/proto3#json)).

status.substate

string

Output only. Additional state information that includes status reported by the agent. Possible values: UNSPECIFIED, UNHEALTHY, STALE_STATUS

statusHistory

list (object)

Output only. The previous cluster status.

statusHistory.[]

object

statusHistory.[].detail

string

Optional. Output only. Details of cluster's state.

statusHistory.[].state

string

Output only. The cluster's state. Possible values: UNKNOWN, CREATING, RUNNING, ERROR, DELETING, UPDATING, STOPPING, STOPPED, STARTING

statusHistory.[].stateStartTime

string

Output only. Time when this state was entered (see JSON representation of [Timestamp](https://developers.google.com/protocol-buffers/docs/proto3#json)).

statusHistory.[].substate

string

Output only. Additional state information that includes status reported by the agent. Possible values: UNSPECIFIED, UNHEALTHY, STALE_STATUS

Sample YAML(s)

Typical Use Case

  # Copyright 2020 Google LLC
  #
  # Licensed under the Apache License, Version 2.0 (the "License");
  # you may not use this file except in compliance with the License.
  # You may obtain a copy of the License at
  #
  #     http://www.apache.org/licenses/LICENSE-2.0
  #
  # Unless required by applicable law or agreed to in writing, software
  # distributed under the License is distributed on an "AS IS" BASIS,
  # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  # See the License for the specific language governing permissions and
  # limitations under the License.
  
  apiVersion: dataproc.cnrm.cloud.google.com/v1beta1
  kind: DataprocCluster
  metadata:
    annotations:
      cnrm.cloud.google.com/management-conflict-prevention-policy: "none"
    name: dataproccluster-sample
    labels:
      label-one: "value-one"
  spec:
    location: "us-central1"
    config:
      autoscalingConfig:
        policyRef:
          name: dataproccluster-dep
      stagingBucketRef:
        name: dataproccluster-dep-staging
      masterConfig:
        diskConfig:
          bootDiskSizeGb: 30
          bootDiskType: pd-standard
        machineType: "n2-standard-2"
        numInstances: 1
      workerConfig:
        numInstances: 2
        machineType: "n2-standard-2"
        diskConfig:
          bootDiskSizeGb: 30
          numLocalSsds: 1
      softwareConfig:
        imageVersion: "2.0.2-debian10"
      gceClusterConfig:
        tags:
        - "foo"
        - "bar"
      initializationActions:
      - executableFile: "gs://dataproc-initialization-actions/stackdriver/stackdriver.sh"
        executionTimeout: "500s"
  ---
  apiVersion: dataproc.cnrm.cloud.google.com/v1beta1
  kind: DataprocAutoscalingPolicy
  metadata:
    annotations:
    name: dataproccluster-dep
  spec:
    location: "us-central1"
    workerConfig:
      maxInstances: 5
    secondaryWorkerConfig:
      maxInstances: 2
    basicAlgorithm:
      yarnConfig:
        gracefulDecommissionTimeout: "30s"
        scaleDownFactor: 0.5
        scaleUpFactor: 0.5
  ---
  apiVersion: storage.cnrm.cloud.google.com/v1beta1
  kind: StorageBucket
  metadata:
    annotations:
      cnrm.cloud.google.com/force-destroy: "true"
    labels:
      label-one: "value-one"
    name: dataproccluster-dep-staging
  spec:
    # StorageBucket names must be globally unique. Replace ${PROJECT_ID?} with your project ID.
    resourceID: ${PROJECT_ID?}-dataproccluster-dep-staging
    bucketPolicyOnly: true