Google Cloud Dataproc clusters are built on Google Compute Engine instances. Machine types define the virtualized hardware resources available to an instance. Compute Engine offers both predefined machine types and custom machine types. Cloud Dataproc clusters can use both standard and custom types for both master and/or worker nodes.
Use cases for custom machine types
As noted in the custom machine type documentation, custom machine types are ideal for the following workloads:
- Workloads that are not a good fit for the predefined machine types.
- Workloads that require more processing power or more memory, but don't need all of the upgrades that are provided by the next machine type level.
As an example, let's assume that you have a workload that needs more processing power than
that provided by an
n1-standard-4 instance, but the next step up, an
instance, provides too much capacity. With custom machine types, you can create Cloud Dataproc
clusters with master and/or worker nodes in the middle range, with 6 virtual CPUs
and 25 GB of memory.
Custom machine type pricing varies based on the resources used in a custom machine. Dataproc pricing is added to the cost of compute resources you use, and is based on the total number of virtual CPUs used in a cluster.
Using custom machine types with Cloud Dataproc
At present, creating clusters with custom machine types is only supported through
the Google Cloud SDK
gcloud dataproc command.
Understand custom machine types first
Before you create a cluster with custom machine types, we recommend you review the custom machine type documentation to understand important considerations, including custom type specifications and pricing.
Custom machine types use a customized
machine type name. As an example, the
custom machine type name for a custom VM with 6 virtual CPUs and 22.5 GB of
The numbers in the machine type correspond to the number of virtual CPUs in the
machine (in this case
6) and the amount of memory (in this case
The amount of memory is calculated by multiplying the amount of memory in
1024. In this example we multiply 22.5 (GB) by 1024:
22.5 * 1024 = 23040
Following the limits on CPU and memory combinations, you can use this method to find the name of the custom machine type you wish you use with your clusters.
Create a Cloud Dataproc cluster with custom machine types
Once you know the machine type name you wish to use, you can use the
gcloud dataproc command to create a cluster with that custom machine type.
gcloud dataproc clusters create command has two options to allow you to
set the master and/or worker machine type. The
you to set the type used by workers The
--worker-machine-type allows you to
set worker machine types.
For example, to create a cluster named
test-cluster with the custom
machine-type created above for both the master and worker nodes, you can use
the following command:
gcloud dataproc clusters create test-cluster / --worker-machine-type custom-6-23040 / --master-machine-type custom-6-23040
You can set the machine type for both master and worker nodes together or independently. If your set both, the master node can use a custom machine type that is different from the worker nodes' custom machine type.
Once the Cloud Dataproc cluster starts, cluster details are displayed in the terminal window. This example shows a partial listing of cluster properties in the terminal window:
... properties: distcp:mapreduce.map.java.opts: -Xmx1638m distcp:mapreduce.map.memory.mb: '2048' distcp:mapreduce.reduce.java.opts: -Xmx4915m distcp:mapreduce.reduce.memory.mb: '6144' mapred:mapreduce.map.cpu.vcores: '1' mapred:mapreduce.map.java.opts: -Xmx1638m ...
Use instance templates to explore custom machine type settings
You can use the instance template feature to experiment with memory and CPU combinations and then find the machine type name for that combination. In this process you will go through the steps to create an instance template but will not actually create an instance template. This is entirely optional, as you can manually determine machine type names, as explained above.
Start by opening the Google Cloud Platform Console. From the Compute Engine→Instance groups page, click the Create instance template button.
In the "Create an instance template" form, click on the
Customize link to
Customize the resources of the virtual machine.
In the advanced form, you can now choose the number of virtual CPUs and the amount of memory dedicated to the instance.
Once you have adjusted the machine settings to your preferences, you can find the machine type name by clicking on the REST link at the bottom of the page.
A window will open showing you the REST code for programatically creating this
instance template. You can see the machine type name next to
You can use this machine type name for your Cloud Dataproc clusters. Click on the Close button to close the window and then click on the Cancel button to leave the "Create an instance template" form.
For more information
For more information about custom machine types, take a look at the custom machine type documentation.