创建集群

如何创建 Dataproc 集群

要求:

  • 名称:集群名称必须以小写字母开头,后跟向上 最多 51 个小写字母、数字和连字符,但不能以连字符结尾。

  • 集群区域:您必须为集群指定一个 Compute Engine 区域(例如 us-east1europe-west1),以便在该区域内隔离集群资源(例如存储在 Cloud Storage 中的虚拟机实例和集群元数据)。

    • 如需了解详情,请参阅区域端点 区域端点的相关信息
    • 如需了解如何选择地区,请参阅可用的地区和区域。您还可以运行 gcloud compute regions list 命令显示可用区域的列表。
  • 连接Compute Engine 虚拟机实例 Dataproc 集群中的虚拟机 (VM),由主实例和工作器虚拟机组成, 完整的内部 IP 网络跨连接。通过 default VPC 网络为此 (请参阅 Dataproc 集群网络配置)。

gcloud

要在命令行中创建 Dataproc 集群,请运行 gcloud gclid clusters create 命令行中的命令 Cloud Shell

gcloud dataproc clusters create CLUSTER_NAME \
    --region=REGION

该命令会利用适用于您主虚拟机实例和工作器虚拟机实例、磁盘大小和类型、网络类型、部署集群的区域和区域的默认 Dataproc 服务设置以及其他集群设置创建一个集群。要了解如何使用命令行标记自定义集群设置,请参阅 gcloud dataproc clusters create 命令。

使用 YAML 文件创建集群

  1. 运行以下 gcloud 命令将现有 Dataproc 集群的配置导出到 cluster.yaml 文件中。
    gcloud dataproc clusters export EXISTING_CLUSTER_NAME \
        --region=REGION \
        --destination=cluster.yaml
    
  2. 通过导入 YAML 文件配置来创建新集群。
    gcloud dataproc clusters import NEW_CLUSTER_NAME \
        --region=REGION \
        --source=cluster.yaml
    

注意:在执行导出操作的过程中,特定于集群的字段(例如集群名称)、仅限输出的字段和自动应用的标签会被过滤掉。在用于创建集群的导入的 YAML 文件中,不允许使用这些字段。

REST

本部分介绍如何创建采用所需值和默认配置(1 个主节点,2 个工作器节点)的集群。

在使用任何请求数据之前,请先进行以下替换:

  • CLUSTER_NAME:集群名称
  • PROJECT:Google Cloud 项目 ID
  • REGION:可用的 Compute Engine 将在其中创建集群的区域
  • ZONE:集群将在所选区域内的可选可用区中创建。

HTTP 方法和网址:

POST https://dataproc.googleapis.com/v1/projects/PROJECT/regions/REGION/clusters

请求 JSON 正文:

{
  "project_id":"PROJECT",
  "cluster_name":"CLUSTER_NAME",
  "config":{
    "master_config":{
      "num_instances":1,
      "machine_type_uri":"n1-standard-2",
      "image_uri":""
    },
    "softwareConfig": {
      "imageVersion": "",
      "properties": {},
      "optionalComponents": []
    },
    "worker_config":{
      "num_instances":2,
      "machine_type_uri":"n1-standard-2",
      "image_uri":""
    },
    "gce_cluster_config":{
      "zone_uri":"ZONE"
    }
  }
}

如需发送您的请求,请展开以下选项之一:

您应该收到类似以下内容的 JSON 响应:

{
"name": "projects/PROJECT/regions/REGION/operations/b5706e31......",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.dataproc.v1.ClusterOperationMetadata",
    "clusterName": "CLUSTER_NAME",
    "clusterUuid": "5fe882b2-...",
    "status": {
      "state": "PENDING",
      "innerState": "PENDING",
      "stateStartTime": "2019-11-21T00:37:56.220Z"
    },
    "operationType": "CREATE",
    "description": "Create cluster with 2 workers",
    "warnings": [
      "For PD-Standard without local SSDs, we strongly recommend provisioning 1TB ...""
    ]
  }
}

控制台

在浏览器中打开 Google Cloud 控制台中的 Dataproc 创建集群页面,然后在在 Compute Engine 上创建 Dataproc 集群页面的Compute Engine 行中,点击“集群”中的创建。选择“设置集群”面板,其中的字段填充默认值。您可以选择每个面板,然后确认或更改默认值以自定义您的集群。

单击创建以创建集群。集群名称显示在集群页面中,预配集群后,其状态会更新为“正在运行”。点击集群名称以打开集群详情 页面,您可以检查应用的作业、实例和配置设置, 并连接到集群上运行的网页界面。

Go

  1. 安装客户端库。
  2. 设置应用默认凭据
  3. 运行代码。
    import (
    	"context"
    	"fmt"
    	"io"
    
    	dataproc "cloud.google.com/go/dataproc/apiv1"
    	"cloud.google.com/go/dataproc/apiv1/dataprocpb"
    	"google.golang.org/api/option"
    )
    
    func createCluster(w io.Writer, projectID, region, clusterName string) error {
    	// projectID := "your-project-id"
    	// region := "us-central1"
    	// clusterName := "your-cluster"
    	ctx := context.Background()
    
    	// Create the cluster client.
    	endpoint := region + "-dataproc.googleapis.com:443"
    	clusterClient, err := dataproc.NewClusterControllerClient(ctx, option.WithEndpoint(endpoint))
    	if err != nil {
    		return fmt.Errorf("dataproc.NewClusterControllerClient: %w", err)
    	}
    	defer clusterClient.Close()
    
    	// Create the cluster config.
    	req := &dataprocpb.CreateClusterRequest{
    		ProjectId: projectID,
    		Region:    region,
    		Cluster: &dataprocpb.Cluster{
    			ProjectId:   projectID,
    			ClusterName: clusterName,
    			Config: &dataprocpb.ClusterConfig{
    				MasterConfig: &dataprocpb.InstanceGroupConfig{
    					NumInstances:   1,
    					MachineTypeUri: "n1-standard-2",
    				},
    				WorkerConfig: &dataprocpb.InstanceGroupConfig{
    					NumInstances:   2,
    					MachineTypeUri: "n1-standard-2",
    				},
    			},
    		},
    	}
    
    	// Create the cluster.
    	op, err := clusterClient.CreateCluster(ctx, req)
    	if err != nil {
    		return fmt.Errorf("CreateCluster: %w", err)
    	}
    
    	resp, err := op.Wait(ctx)
    	if err != nil {
    		return fmt.Errorf("CreateCluster.Wait: %w", err)
    	}
    
    	// Output a success message.
    	fmt.Fprintf(w, "Cluster created successfully: %s", resp.ClusterName)
    	return nil
    }
    

Java

  1. 安装客户端库。
  2. 设置应用默认凭据。
  3. 运行代码。
    import com.google.api.gax.longrunning.OperationFuture;
    import com.google.cloud.dataproc.v1.Cluster;
    import com.google.cloud.dataproc.v1.ClusterConfig;
    import com.google.cloud.dataproc.v1.ClusterControllerClient;
    import com.google.cloud.dataproc.v1.ClusterControllerSettings;
    import com.google.cloud.dataproc.v1.ClusterOperationMetadata;
    import com.google.cloud.dataproc.v1.InstanceGroupConfig;
    import java.io.IOException;
    import java.util.concurrent.ExecutionException;
    
    public class CreateCluster {
    
      public static void createCluster() throws IOException, InterruptedException {
        // TODO(developer): Replace these variables before running the sample.
        String projectId = "your-project-id";
        String region = "your-project-region";
        String clusterName = "your-cluster-name";
        createCluster(projectId, region, clusterName);
      }
    
      public static void createCluster(String projectId, String region, String clusterName)
          throws IOException, InterruptedException {
        String myEndpoint = String.format("%s-dataproc.googleapis.com:443", region);
    
        // Configure the settings for the cluster controller client.
        ClusterControllerSettings clusterControllerSettings =
            ClusterControllerSettings.newBuilder().setEndpoint(myEndpoint).build();
    
        // Create a cluster controller client with the configured settings. The client only needs to be
        // created once and can be reused for multiple requests. Using a try-with-resources
        // closes the client, but this can also be done manually with the .close() method.
        try (ClusterControllerClient clusterControllerClient =
            ClusterControllerClient.create(clusterControllerSettings)) {
          // Configure the settings for our cluster.
          InstanceGroupConfig masterConfig =
              InstanceGroupConfig.newBuilder()
                  .setMachineTypeUri("n1-standard-2")
                  .setNumInstances(1)
                  .build();
          InstanceGroupConfig workerConfig =
              InstanceGroupConfig.newBuilder()
                  .setMachineTypeUri("n1-standard-2")
                  .setNumInstances(2)
                  .build();
          ClusterConfig clusterConfig =
              ClusterConfig.newBuilder()
                  .setMasterConfig(masterConfig)
                  .setWorkerConfig(workerConfig)
                  .build();
          // Create the cluster object with the desired cluster config.
          Cluster cluster =
              Cluster.newBuilder().setClusterName(clusterName).setConfig(clusterConfig).build();
    
          // Create the Cloud Dataproc cluster.
          OperationFuture<Cluster, ClusterOperationMetadata> createClusterAsyncRequest =
              clusterControllerClient.createClusterAsync(projectId, region, cluster);
          Cluster response = createClusterAsyncRequest.get();
    
          // Print out a success message.
          System.out.printf("Cluster created successfully: %s", response.getClusterName());
    
        } catch (ExecutionException e) {
          System.err.println(String.format("Error executing createCluster: %s ", e.getMessage()));
        }
      }
    }

Node.js

  1. 安装客户端库。
  2. 设置应用默认凭据
  3. 运行代码。
    const dataproc = require('@google-cloud/dataproc');
    
    // TODO(developer): Uncomment and set the following variables
    // projectId = 'YOUR_PROJECT_ID'
    // region = 'YOUR_CLUSTER_REGION'
    // clusterName = 'YOUR_CLUSTER_NAME'
    
    // Create a client with the endpoint set to the desired cluster region
    const client = new dataproc.v1.ClusterControllerClient({
      apiEndpoint: `${region}-dataproc.googleapis.com`,
      projectId: projectId,
    });
    
    async function createCluster() {
      // Create the cluster config
      const request = {
        projectId: projectId,
        region: region,
        cluster: {
          clusterName: clusterName,
          config: {
            masterConfig: {
              numInstances: 1,
              machineTypeUri: 'n1-standard-2',
            },
            workerConfig: {
              numInstances: 2,
              machineTypeUri: 'n1-standard-2',
            },
          },
        },
      };
    
      // Create the cluster
      const [operation] = await client.createCluster(request);
      const [response] = await operation.promise();
    
      // Output a success message
      console.log(`Cluster created successfully: ${response.clusterName}`);

Python

  1. 安装客户端库。
  2. 设置应用默认凭据。
  3. 运行代码。
    from google.cloud import dataproc_v1 as dataproc
    
    
    def create_cluster(project_id, region, cluster_name):
        """This sample walks a user through creating a Cloud Dataproc cluster
        using the Python client library.
    
        Args:
            project_id (string): Project to use for creating resources.
            region (string): Region where the resources should live.
            cluster_name (string): Name to use for creating a cluster.
        """
    
        # Create a client with the endpoint set to the desired cluster region.
        cluster_client = dataproc.ClusterControllerClient(
            client_options={"api_endpoint": f"{region}-dataproc.googleapis.com:443"}
        )
    
        # Create the cluster config.
        cluster = {
            "project_id": project_id,
            "cluster_name": cluster_name,
            "config": {
                "master_config": {"num_instances": 1, "machine_type_uri": "n1-standard-2"},
                "worker_config": {"num_instances": 2, "machine_type_uri": "n1-standard-2"},
            },
        }
    
        # Create the cluster.
        operation = cluster_client.create_cluster(
            request={"project_id": project_id, "region": region, "cluster": cluster}
        )
        result = operation.result()
    
        # Output a success message.
        print(f"Cluster created successfully: {result.cluster_name}")