Create a Dataproc zero-scale cluster

This document describes how to create a Dataproc zero-scale cluster.

Dataproc zero-scale clusters provide a cost-effective way to use Dataproc clusters. Unlike standard Dataproc clusters that require at least two primary workers, Dataproc zero-scale clusters use only secondary workers that can be scaled down to zero.

Dataproc zero-scale clusters are ideal for use as long-running clusters that experience idle periods, such as a cluster that hosts a Jupiter notebook. They provide improved resource utilization through the use of zero-scale autoscaling policies.

Characteristics and limitations

A Dataproc zero-scale cluster shares similarities with a standard cluster, but has the following unique characteristics and limitations:

Requires image version 2.2.53 or later.
Supports only secondary workers, not primary workers.
Includes services such as YARN, but doesn't support the HDFS file system.
- To use Cloud Storage as the default file system, set the core:fs.defaultFS cluster property to a Cloud Storage bucket location (gs://BUCKET_NAME).
- If you disable a component during cluster creation, also disable HDFS.
Can't be converted to or from a standard cluster.
Requires an autoscaling policy for ZERO_SCALE cluster types.
Requires selecting flexible VMs as machine type.
Doesn't support the Oozie component.
Can't be created from the Google Cloud console.

Optional: Configure an autoscaling policy

You can configure an autoscaling policy to define secondary working scaling for a zero-scale cluster. When doing so, note the following:

Set the cluster type to ZERO_SCALE.
Configure an autoscaling policy to the secondary worker config only.

For more information, see Create an autoscaling policy.

Create a Dataproc zero-scale cluster

Create a zero-scale cluster using the gcloud CLI or the Dataproc API.

gcloud

Run gcloud dataproc clusters create command locally in a terminal window or in Cloud Shell.

gcloud dataproc clusters create CLUSTER_NAME \
    --region=REGION \
    --cluster-type=zero-scale \
    --autoscaling-policy=AUTOSCALING_POLICY \
    --properties=core:fs.defaultFS=gs://BUCKET_NAME \
    --secondary-worker-machine-types="type=MACHINE_TYPE1[,type=MACHINE_TYPE2...][,rank=RANK]"
    ...other args

Replace the following:

CLUSTER_NAME: name of the Dataproc zero-scale cluster.
REGION: an available Compute Engine region.
AUTOSCALING_POLICY: the ID or resource URI of the autoscaling policy.
BUCKET_NAME: name of your Cloud Storage bucket.
MACHINE_TYPE: specific Compute Engine machine type, such as n1-standard-4, e2-standard-8.
RANK: defines the priority of a list of machine types.

REST

Create a zero-scale cluster using a Dataproc REST API cluster.create request:

Set ClusterConfig.ClusterType for the secondaryWorkerConfig to ZERO_SCALE.
Set the AutoscalingConfig.policyUri with the ZERO_SCALE autoscaling policy ID.
Add the core:fs.defaultFS:gs://BUCKET_NAME SoftwareConfig.property. Replace BUCKET_NAME with the name of your Cloud Storage bucket.

What's next

Learn more about Dataproc autoscaling.