To mitigate the effects of the unavailability of user-specified VMs in specific
regions at specific times
Dataproc allows you to request the creation of a
partial cluster by specifying
a minimum number of primary workers that is acceptable to allow cluster creation.
|Standard cluster||Partial cluster|
|If one or more primary workers cannot be created and initialized, cluster creation fails. Workers that are created continue to run and incur charges until deleted by the user.||If the specified minimum number of workers can be created, the cluster is created. Failed (uninitialized) workers are deleted and do not incur charges. If the specified minimum number of workers cannot be created and initialized, the cluster is not created. Workers that are created are not deleted to allow for debugging.|
|Cluster creation time is optimized.||Longer cluster creation time can occur since all nodes must report provisioning status.|
|Single node clusters are available for creation.||Single node clusters are not available for creation.|
Use autoscaling with partial cluster creation to help ensure that the target (full) number of primary workers is created. Autoscaling will try to acquire failed workers in the background if the workload requires them.
The following is a sample autoscaling policy that retries until the total number
of primary worker instances reaches a target size of 10.
maxInstances match the minimum and total
number of primary workers specified at cluster creation time (see
How to create a partial cluster).
scaleDownFactor to 0 prevents the cluster from scaling down
from 10 to 8, and will help keep the number of workers at the maximum 10-worker
workerConfig: minInstances: 8 maxInstances: 10 basicAlgorithm: cooldownPeriod: 2m yarnConfig: scaleUpFactor: 1 scaleDownFactor: 0 gracefulDecommissionTimeout: 1h
How to create a partial cluster
You can use the Google Cloud CLI or the Dataproc API to create a Dataproc partial cluster.
gcloud dataproc clusters create CLUSTER_NAME \ --project=PROJECT \ --region=REGION \ --num-workers=NUM_WORKERS \ --min-num-workers=MIN_NUM_WORKERS \ other args ...
- CLUSTER_NAME: The cluster name must start with a lowercase letter followed by up to 51 lowercase letters, numbers, and hyphens, and cannot end with a hyphen.
- PROJECT: Specify the project associated with the job cluster.
- REGION: Specify the Compute Engine region where the job cluster will be located.
- NUM_WORKERS: The total number of primary workers in the cluster to create if available.
- MIN_NUM_WORKERS: The minimum number of primary workers to create
if the specified total number of workers (
NUM_WORKERS) cannot be created. Cluster creation fails if this minimum number of primary workers cannot be created (workers that are created are not deleted to allow for debugging). If this flag is omitted, standard cluster creation with the total number of primary workers (
NUM_WORKERS) is attempted.
Display the number of provisioned workers
After creating a cluster, you can run the following gcloud CLI command to list the number of workers, including any secondary workers, provisioned in your cluster.
gcloud dataproc clusters list \ --project=PROJECT \ --region=REGION \ --filter=clusterName=CLUSTER_NAME