Scale Dataproc on GKE clusters

To scale a Dataproc on GKE cluster, update the autoscaler configuration of the node pool(s) associated with the Spark driver or Spark executor roles. You specify Dataproc on GKE node pools and their associated roles when you create a Dataproc on GKE cluster.

Set node pool autoscaling

You can set the bounds for Dataproc on GKE node pool autoscaling when you create a Dataproc on GKE virtual cluster. If not specified, Dataproc on GKE node pools are autoscaled with default values (at Dataproc on GKE GA release, defaults set to minimum = 1 and maximum = 10, which are subject to change). To obtain specific minimum and maximum node pool autoscaling values, set them when you create your Dataproc on GKE virtual cluster.

Update node pool autoscaling

Use the following GKE gcloud container node-pools update command to change the autoscaling configuration of a Dataproc on GKE node pool.

gcloud container node-pools update NODE_POOL_NAME \
    --cluster=GKE_CLUSTER_NAME \
    --region=region \
    --enable-autoscaling \
    --min-nodes=min nodes (must be <= max-nodes) \
    --max-nodes=max nodes (must be >= min-nodes) \

How Spark autoscaling works

  1. When a job is submitted, the driver pod is scheduled to run on the node pool associated with the Spark driver role.
  2. The driver pod calls the GKE scheduler to create executor pods.
  3. Executor pods are scheduled on the node pool associated with the Spark executor role.
  4. If the node pools have capacity for the pods, the pods start running immediately. If there is insufficient capacity, the GKE cluster autoscaler scales up the node pool to provide the requested resources, up to the user-specified limit. When node pools have excess capacity, the GKE cluster autoscaler scales down the node pool to its user-specified limit.