Dataproc clusters are built on Compute Engine instances. Machine types define the virtualized hardware resources available to an instance. Compute Engine offers both predefined machine types and custom machine types. Dataproc clusters can use both predefined and custom types for both master and/or worker nodes.
Dataproc clusters support the following Compute Engine predefined machine types (machine type availability varies by region):
- General purpose machine types,
which include N1, N2, N2D, E2, C3, C4, and N4 machine types (Dataproc
also supports N1, N2, N2D, E2, C3, C4, and N4 custom machine types).
Limitations:
- the n1-standard-1 machine type is not supported for 2.0+ images (the n1-standard-1 machine type is not recommended for pre-2.0 images—instead, use a machine type with higher memory.
- Shared-core machine types are not supported, which
include the following unsupported machine types:
- E2: e2-micro, e2-small, and e2-medium shared-core machine types, and
- N1: f1-micro and g1-small shared-core machine types.
- Dataproc selects
hyperdisk-balanced
as the boot-disk type if machine type is C3, C4 or N4.
- Compute-optimized machine types, which include C2 and C2D machine types.
- Memory-optimized machine types, which include M1 and M2 machine types.
- ARM machine types, which include T2A machine types.
Custom machine types
Dataproc supports N1 series custom machine types.
Custom machine types are ideal for the following workloads:
- Workloads that are not a good fit for the predefined machine types.
- Workloads that require more processing power or more memory, but don't need all of the upgrades that are provided by the next machine type level.
For example, if you have a workload that needs more processing power than
that provided by an n1-standard-4
instance, but the next step up, an n1-standard-8
instance, provides too much capacity. With custom machine types, you can create Dataproc clusters with master and/or worker nodes in the middle
range, with 6 virtual CPUs and 25 GB of memory.
Specify a custom machine type
Custom machine types use a special machine type
specification and are subject
to limitations. For example,
the custom machine type specification for a custom VM with 6 virtual CPUs and
22.5 GB of memory is custom-6-23040
.
The numbers in the machine type specification correspond to the number of virtual CPUs
(vCPUs)in the machine (6
) and the amount of memory (23040
).
The amount of memory is calculated by multiplying the amount of memory in
gigabytes by 1024
(see
Expressing memory in GB or MB). In this example, 22.5 (GB) is multiplied by 1024: 22.5 * 1024 = 23040
.
You use the above syntax to specify the custom machine type with your clusters. You can set the machine type for either master or worker nodes or both when you create a cluster. If you set both, the master node can use a custom machine type that is different from the custom machine type used by workers. The machine type used by any secondary workers follow the settings for primary workers and cannot be separately set (see Secondary workers - preemptible and non-preemptible VMs).
Custom machine type pricing
Custom machine type pricing is based on the resources used in a custom machine. Dataproc pricing is added to the cost of compute resources, and is based on the total number of virtual CPUs (vCPUs) used in a cluster.
Create a Dataproc cluster with a specified machine type
Console
From the Configure nodes panel of the Dataproc Create a cluster page in the Google Cloud console, select machine family, series and type for the cluster's master and worker nodes.
gcloud command
Run the gcloud dataproc clusters create command with the following flags to create a Dataproc cluster with master and/or worker machine types:
- The
--master-machine-type machine-type
flag allows you to set the predefined or custom machine type used by the master VM instance in your cluster (or master instances if you create a HA cluster) - The
--worker-machine-type custom-machine-type
flag allows you to set the predefined or custom machine type used by the worker VM instances in your cluster
Example:
gcloud dataproc clusters create test-cluster / --master-machine-type custom-6-23040 / --worker-machine-type custom-6-23040 / other args
... properties: distcp:mapreduce.map.java.opts: -Xmx1638m distcp:mapreduce.map.memory.mb: '2048' distcp:mapreduce.reduce.java.opts: -Xmx4915m distcp:mapreduce.reduce.memory.mb: '6144' mapred:mapreduce.map.cpu.vcores: '1' mapred:mapreduce.map.java.opts: -Xmx1638m ...
REST API
To create a cluster with custom machine types, set the
machineTypeUri
in the masterConfig
and/or workerConfig
InstanceGroupConfig
in the
cluster.create
API request.
Example:
POST /v1/projects/my-project-id/regions/is-central1/clusters/ { "projectId": "my-project-id", "clusterName": "test-cluster", "config": { "configBucket": "", "gceClusterConfig": { "subnetworkUri": "default", "zoneUri": "us-central1-a" }, "masterConfig": { "numInstances": 1, "machineTypeUri": "n1-highmem-4", "diskConfig": { "bootDiskSizeGb": 500, "numLocalSsds": 0 } }, "workerConfig": { "numInstances": 2, "machineTypeUri": "n1-highmem-4", "diskConfig": { "bootDiskSizeGb": 500, "numLocalSsds": 0 } } } }
Create a Dataproc cluster with custom machine type with extended memory
Dataproc supports custom machine types with extended memory beyond the 6.5GB per vCPU limit (see Extended Memory Pricing).
Console
Click Extend memory when customizing Machine type memory in the Master node and/or Worker nodes section from the Configure nodes panel on the Dataproc Create a cluster page in the Google Cloud console.
gcloud Command
To create a cluster from the gcloud command line with
custom CPUs with extended memory, add a -ext
suffix to the
‑‑master-machine-type
and/or
‑‑worker-machine-type
flags.
Example
The following gcloud command-line sample creates a Dataproc cluster with 1 CPU and 50 GB memory (50 * 1024 = 51200) in each node:
gcloud dataproc clusters create test-cluster / --master-machine-type custom-1-51200-ext / --worker-machine-type custom-1-51200-ext / other args
API
The following sample
... "masterConfig": { "numInstances": 1, "machineTypeUri": "custom-1-51200-ext", ... }, "workerConfig": { "numInstances": 2, "machineTypeUri": "custom-1-51200-ext", ... ...
ARM machine types
Dataproc supports creating a cluster with nodes that use ARM machine types, such as the T2A machine type.
Requirements and limitations:
- The Dataproc image must be compatible with ARM chipset (currently, only the Dataproc 2.1-ubuntu20-arm image is compatible with the ARM CHIPSET). Note that this image does not support many optional and initialization-action components (see 2.1.x release versions).
- Since one image must be specified for a cluster, the master, worker, and secondary-worker nodes must use an ARM machine type that is compatible with the selected Dataproc ARM image.
- Dataproc features that are not compatible with ARM machine types are not available (for example, local SSDs are not supported by T2A machine types).
Create a Dataproc cluster with an ARM machine type
Console
Currently, the Google Cloud console does not support the creation of a Dataproc ARM machine type cluster.
gcloud
To create a Dataproc cluster that uses the ARM t2a-standard-4
machine type, run the following gcloud
command locally in a terminal window or in
Cloud Shell.
gcloud dataproc clusters create cluster-name \ --region=REGION \ --image-version=2.1-ubuntu20-arm \ --master-machine-type=t2a-standard-4 \ --worker-machine-type=t2a-standard-4
Notes:
REGION: The region where the cluster will be located.
ARM images are available starting with
2.1.18-ubuntu20-arm
.See the gcloud dataproc clusters create reference documentation for information on additional command-line flags you can use to customize your cluster.
*-arm images
support only the installed components and the following optional components listed on the 2.1.x release versions page (the remaining 2.1 optional components and all initialization actions` listed on that page are unsupported):- Apache Hive WebHCat
- Docker
- Zookeeper (installed in HA clusters; optional component in non-HA clusters)
API
The following sample Dataproc REST API clusters.create request creates an ARM machine type cluster.
POST /v1/projects/my-project-id/regions/is-central1/clusters/ { "projectId": "my-project-id", "clusterName": "sample-cluster", "config": { "configBucket": "", "gceClusterConfig": { "subnetworkUri": "default", "zoneUri": "us-central1-a" }, "masterConfig": { "numInstances": 1, "machineTypeUri": "t2a-standard-4", "diskConfig": { "bootDiskSizeGb": 500, } }, "workerConfig": { "numInstances": 2, "machineTypeUri": "t2a-standard-4", "diskConfig": { "bootDiskSizeGb": 500, "numLocalSsds": 0 } }, "softwareConfig": { "imageVersion": "2.1-ubuntu20-arm" } } }
For more information
- See Creating a VM Instance with a Custom Machine Type.
- See Creating and starting an Arm VM instance.