Create a cluster

You can create a Dataproc cluster using the Cloud SDK gcloud command-line tool, the Dataproc API, or the Google Cloud Console. You can also create clusters programmatically using Cloud Client Libraries.

Cluster Name: The cluster must start with a lowercase letter followed by up to 54 lowercase letters, numbers, or hyphens, and cannot end with a hyphen.

Cluster Region: You can specify a global region or a specific region for your cluster. The global region is a special multi-region endpoint that is capable of deploying instances into any user-specified Compute Engine zone. You can also specify distinct regions, such as us-east1 or europe-west1, to isolate resources (including VM instances and Cloud Storage) and metadata storage locations utilized by Dataproc within the user-specified region. See Regional endpoints to learn more about the difference between global and regional endpoints. See Available regions & zones for information on selecting a region. You can also run the gcloud compute regions list command to see a listing of available regions.

Compute Engine Virtual Machine instances (VMs) in a Dataproc cluster, consisting of master and worker VMs, require full internal IP networking access to each other. The default network, which is available to create a cluster, helps ensure this access. For information on creating your own network for your Dataproc cluster, see Dataproc Cluster Network Configuration.

Creating a Dataproc cluster

gcloud

To create a Dataproc cluster on the command line, run the Cloud SDK gcloud dataproc clusters create command locally in a terminal window or in Cloud Shell.

gcloud dataproc clusters create cluster-name \
    --region=region

The above command creates a cluster with default Dataproc service settings for your master and worker virtual machine instances, disk sizes and types, network type, region and zone where your cluster is deployed, and other cluster settings. See the gcloud dataproc clusters create command for information on using command line flags to customize cluster settings.

Create a cluster with a YAML file

  1. Run the following gcloud command to export the configuration of an existing Dataproc cluster into a YAML file.
    gcloud dataproc clusters export my-existing-cluster --destination cluster.yaml
    
  2. Create a new cluster by importing the YAML file configuration.
    gcloud dataproc clusters import my-new-cluster --source cluster.yaml
    

Note: During the export operation, cluster-specific fields, such as cluster name, output-only fields, and automatically applied labels are filtered. These fields are disallowed in the imported YAML file used to create a cluster.

REST & CMD LINE

This section shows how to create a cluster with required values and the default configuration (1 master, 2 workers).

Before using any of the request data below, make the following replacements:

  • project-id: GCP project ID
  • region: cluster region
  • clusterName: cluster name

HTTP method and URL:

POST https://dataproc.googleapis.com/v1/projects/project-id/regions/region/clusters

Request JSON body:

{
  "clusterName": "cluster-name",
  "config": {}
}

To send your request, expand one of these options:

You should receive a JSON response similar to the following:

{
"name": "projects/project-id/regions/region/operations/b5706e31......",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.dataproc.v1.ClusterOperationMetadata",
    "clusterName": "cluster-name",
    "clusterUuid": "5fe882b2-...",
    "status": {
      "state": "PENDING",
      "innerState": "PENDING",
      "stateStartTime": "2019-11-21T00:37:56.220Z"
    },
    "operationType": "CREATE",
    "description": "Create cluster with 2 workers",
    "warnings": [
      "For PD-Standard without local SSDs, we strongly recommend provisioning 1TB ...""
    ]
  }
}

Console

Open the Dataproc Create a cluster page in the Cloud Console in your browser. The Set up cluster panel is selected with fields filled in with default values. You can select each panel and confirm or change default values to customize you cluster.

Click CREATE to create the cluster. The cluster name appears in the Clusters page, and its status is updated to Running after the cluster is provisioned. Click the cluster name to open the cluster details page where you can examine jobs, instances, and configuration settings for your cluster and connect to web interfaces running on your cluster.

Go

  1. Install the client library
  2. Set up application default credentials
  3. Run the code
    import (
    	"context"
    	"fmt"
    	"io"
    
    	dataproc "cloud.google.com/go/dataproc/apiv1"
    	"google.golang.org/api/option"
    	dataprocpb "google.golang.org/genproto/googleapis/cloud/dataproc/v1"
    )
    
    func createCluster(w io.Writer, projectID, region, clusterName string) error {
    	// projectID := "your-project-id"
    	// region := "us-central1"
    	// clusterName := "your-cluster"
    	ctx := context.Background()
    
    	// Create the cluster client.
    	endpoint := region + "-dataproc.googleapis.com:443"
    	clusterClient, err := dataproc.NewClusterControllerClient(ctx, option.WithEndpoint(endpoint))
    	if err != nil {
    		return fmt.Errorf("dataproc.NewClusterControllerClient: %v", err)
    	}
    
    	// Create the cluster config.
    	req := &dataprocpb.CreateClusterRequest{
    		ProjectId: projectID,
    		Region:    region,
    		Cluster: &dataprocpb.Cluster{
    			ProjectId:   projectID,
    			ClusterName: clusterName,
    			Config: &dataprocpb.ClusterConfig{
    				MasterConfig: &dataprocpb.InstanceGroupConfig{
    					NumInstances:   1,
    					MachineTypeUri: "n1-standard-1",
    				},
    				WorkerConfig: &dataprocpb.InstanceGroupConfig{
    					NumInstances:   2,
    					MachineTypeUri: "n1-standard-1",
    				},
    			},
    		},
    	}
    
    	// Create the cluster.
    	op, err := clusterClient.CreateCluster(ctx, req)
    	if err != nil {
    		return fmt.Errorf("CreateCluster: %v", err)
    	}
    
    	resp, err := op.Wait(ctx)
    	if err != nil {
    		return fmt.Errorf("CreateCluster.Wait: %v", err)
    	}
    
    	// Output a success message.
    	fmt.Fprintf(w, "Cluster created successfully: %s", resp.ClusterName)
    	return nil
    }
    

Java

  1. Install the client library
  2. Set up application default credentials
  3. Run the code
    import com.google.api.gax.longrunning.OperationFuture;
    import com.google.cloud.dataproc.v1.Cluster;
    import com.google.cloud.dataproc.v1.ClusterConfig;
    import com.google.cloud.dataproc.v1.ClusterControllerClient;
    import com.google.cloud.dataproc.v1.ClusterControllerSettings;
    import com.google.cloud.dataproc.v1.ClusterOperationMetadata;
    import com.google.cloud.dataproc.v1.InstanceGroupConfig;
    import java.io.IOException;
    import java.util.concurrent.ExecutionException;
    
    public class CreateCluster {
    
      public static void createCluster() throws IOException, InterruptedException {
        // TODO(developer): Replace these variables before running the sample.
        String projectId = "your-project-id";
        String region = "your-project-region";
        String clusterName = "your-cluster-name";
        createCluster(projectId, region, clusterName);
      }
    
      public static void createCluster(String projectId, String region, String clusterName)
          throws IOException, InterruptedException {
        String myEndpoint = String.format("%s-dataproc.googleapis.com:443", region);
    
        // Configure the settings for the cluster controller client.
        ClusterControllerSettings clusterControllerSettings =
            ClusterControllerSettings.newBuilder().setEndpoint(myEndpoint).build();
    
        // Create a cluster controller client with the configured settings. The client only needs to be
        // created once and can be reused for multiple requests. Using a try-with-resources
        // closes the client, but this can also be done manually with the .close() method.
        try (ClusterControllerClient clusterControllerClient =
            ClusterControllerClient.create(clusterControllerSettings)) {
          // Configure the settings for our cluster.
          InstanceGroupConfig masterConfig =
              InstanceGroupConfig.newBuilder()
                  .setMachineTypeUri("n1-standard-1")
                  .setNumInstances(1)
                  .build();
          InstanceGroupConfig workerConfig =
              InstanceGroupConfig.newBuilder()
                  .setMachineTypeUri("n1-standard-1")
                  .setNumInstances(2)
                  .build();
          ClusterConfig clusterConfig =
              ClusterConfig.newBuilder()
                  .setMasterConfig(masterConfig)
                  .setWorkerConfig(workerConfig)
                  .build();
          // Create the cluster object with the desired cluster config.
          Cluster cluster =
              Cluster.newBuilder().setClusterName(clusterName).setConfig(clusterConfig).build();
    
          // Create the Cloud Dataproc cluster.
          OperationFuture<Cluster, ClusterOperationMetadata> createClusterAsyncRequest =
              clusterControllerClient.createClusterAsync(projectId, region, cluster);
          Cluster response = createClusterAsyncRequest.get();
    
          // Print out a success message.
          System.out.printf("Cluster created successfully: %s", response.getClusterName());
    
        } catch (ExecutionException e) {
          System.err.println(String.format("Error executing createCluster: %s ", e.getMessage()));
        }
      }
    }

Node.js

  1. Install the client library
  2. Set up application default credentials
  3. Run the code
const dataproc = require('@google-cloud/dataproc');

// TODO(developer): Uncomment and set the following variables
// projectId = 'YOUR_PROJECT_ID'
// region = 'YOUR_CLUSTER_REGION'
// clusterName = 'YOUR_CLUSTER_NAME'

// Create a client with the endpoint set to the desired cluster region
const client = new dataproc.v1.ClusterControllerClient({
  apiEndpoint: `${region}-dataproc.googleapis.com`,
  projectId: projectId,
});

async function createCluster() {
  // Create the cluster config
  const request = {
    projectId: projectId,
    region: region,
    cluster: {
      clusterName: clusterName,
      config: {
        masterConfig: {
          numInstances: 1,
          machineTypeUri: 'n1-standard-1',
        },
        workerConfig: {
          numInstances: 2,
          machineTypeUri: 'n1-standard-1',
        },
      },
    },
  };

  // Create the cluster
  const [operation] = await client.createCluster(request);
  const [response] = await operation.promise();

  // Output a success message
  console.log(`Cluster created successfully: ${response.clusterName}`);

Python

  1. Install the client library
  2. Set up application default credentials
  3. Run the code
    from google.cloud import dataproc_v1 as dataproc
    
    
    def create_cluster(project_id, region, cluster_name):
        """This sample walks a user through creating a Cloud Dataproc cluster
           using the Python client library.
    
           Args:
               project_id (string): Project to use for creating resources.
               region (string): Region where the resources should live.
               cluster_name (string): Name to use for creating a cluster.
        """
    
        # Create a client with the endpoint set to the desired cluster region.
        cluster_client = dataproc.ClusterControllerClient(
            client_options={"api_endpoint": f"{region}-dataproc.googleapis.com:443"}
        )
    
        # Create the cluster config.
        cluster = {
            "project_id": project_id,
            "cluster_name": cluster_name,
            "config": {
                "master_config": {"num_instances": 1, "machine_type_uri": "n1-standard-1"},
                "worker_config": {"num_instances": 2, "machine_type_uri": "n1-standard-1"},
            },
        }
    
        # Create the cluster.
        operation = cluster_client.create_cluster(
            request={"project_id": project_id, "region": region, "cluster": cluster}
        )
        result = operation.result()
    
        # Output a success message.
        print(f"Cluster created successfully: {result.cluster_name}")