How to create a Dataproc cluster
Requirements:
Name: The cluster name must start with a lowercase letter followed by up to 51 lowercase letters, numbers, and hyphens, and cannot end with a hyphen.
Cluster region: You must specify a Compute Engine region for the cluster, such as
us-east1
oreurope-west1
, to isolate cluster resources, such as VM instances and cluster metadata stored in Cloud Storage, within the region.- See Regional endpoints for more information on regional endpoints.
- See Available regions & zones
for information on selecting a region. You can also run the
gcloud compute regions list
command to display a listing of available regions.
Connectivity: Compute Engine Virtual Machine instances (VMs) in a Dataproc cluster, consisting of master and worker VMs, require full internal IP networking cross connectivity. The
default
VPC network provides this connectivity (see Dataproc Cluster Network Configuration).
gcloud
To create a Dataproc cluster on the command line, run the gcloud dataproc clusters create command locally in a terminal window or in Cloud Shell.
gcloud dataproc clusters create CLUSTER_NAME \ --region=REGION
The command creates a cluster with default Dataproc service settings for your master and worker virtual machine instances, disk sizes and types, network type, region and zone where your cluster is deployed, and other cluster settings. See the gcloud dataproc clusters create command for information on using command line flags to customize cluster settings.
Create a cluster with a YAML file
- Run the following
gcloud
command to export the configuration of an existing Dataproc cluster into acluster.yaml
file.gcloud dataproc clusters export EXISTING_CLUSTER_NAME \ --region=REGION \ --destination=cluster.yaml
- Create a new cluster by importing the YAML file configuration.
gcloud dataproc clusters import NEW_CLUSTER_NAME \ --region=REGION \ --source=cluster.yaml
Note: During the export operation, cluster-specific fields, such as cluster name, output-only fields, and automatically applied labels are filtered. These fields are disallowed in the imported YAML file used to create a cluster.
REST
This section shows how to create a cluster with required values and the default configuration (1 master, 2 workers).
Before using any of the request data, make the following replacements:
- CLUSTER_NAME: cluster name
- PROJECT: Google Cloud project ID
- REGION: An available Compute Engine region where the cluster will be created.
- ZONE: An optional zone within the selected region where the cluster will be created.
HTTP method and URL:
POST https://dataproc.googleapis.com/v1/projects/PROJECT/regions/REGION/clusters
Request JSON body:
{ "project_id":"PROJECT", "cluster_name":"CLUSTER_NAME", "config":{ "master_config":{ "num_instances":1, "machine_type_uri":"n1-standard-2", "image_uri":"" }, "softwareConfig": { "imageVersion": "", "properties": {}, "optionalComponents": [] }, "worker_config":{ "num_instances":2, "machine_type_uri":"n1-standard-2", "image_uri":"" }, "gce_cluster_config":{ "zone_uri":"ZONE" } } }
To send your request, expand one of these options:
You should receive a JSON response similar to the following:
{ "name": "projects/PROJECT/regions/REGION/operations/b5706e31......", "metadata": { "@type": "type.googleapis.com/google.cloud.dataproc.v1.ClusterOperationMetadata", "clusterName": "CLUSTER_NAME", "clusterUuid": "5fe882b2-...", "status": { "state": "PENDING", "innerState": "PENDING", "stateStartTime": "2019-11-21T00:37:56.220Z" }, "operationType": "CREATE", "description": "Create cluster with 2 workers", "warnings": [ "For PD-Standard without local SSDs, we strongly recommend provisioning 1TB ..."" ] } }
Console
Open the Dataproc Create a cluster page in the Google Cloud console in your browser, then click Create in the cluster on Compute engine row in the Create a Dataproc cluster on Compute Engine page. The Set up cluster panel is selected with fields filled in with default values. You can select each panel and confirm or change default values to customize your cluster.
Click Create to create the cluster. The cluster name appears in the Clusters page, and its status is updated to Running after the cluster is provisioned. Click the cluster name to open the cluster details page where you can examine jobs, instances, and configuration settings for your cluster and connect to web interfaces running on your cluster.