创建启用了 Dataproc Spark 的实例

本页面介绍了如何创建启用了 Dataproc Spark 的 Vertex AI Workbench 实例。本页面还介绍了 Dataproc JupyterLab 扩展程序的优势，并概述了如何将该扩展程序与 Serverless for Apache Spark 和 Dataproc on Compute Engine 搭配使用。

Dataproc JupyterLab 扩展程序概览

从 M113 版及更高版本开始，Vertex AI Workbench 实例预安装了 Dataproc JupyterLab 扩展程序。

Dataproc JupyterLab 扩展程序提供了两种运行 Apache Spark 笔记本作业的方法：Dataproc 集群和Google Cloud Serverless for Apache Spark。

Dataproc 集群包含一组丰富的功能，可控制 Spark 运行的基础架构。您可以选择 Spark 集群的大小和配置，从而对您的环境进行自定义和控制。此方法非常适合复杂的工作负载、长时间运行的作业和精细的资源管理。
Serverless for Apache Spark 消除了基础架构问题。您只需提交 Spark 作业，Google 就会在后台处理资源预配、扩缩和优化。这种无服务器方法为数据科学和机器学习工作负载提供了经济实惠的方案。

无论选择哪一种方法，您都可以使用 Spark 进行数据处理和分析。选择 Dataproc 集群还是 Serverless for Apache Spark 取决于您的特定工作负载要求、所需的控制级别和资源使用模式。

使用 Serverless for Apache Spark 处理数据科学和机器学习工作负载的优势包括：

无需管理集群：您无需担心预配、配置或管理 Spark 集群。这样可以节省时间和资源。
自动扩缩：Serverless for Apache Spark 会根据工作负载自动扩缩，因此您只需为使用的资源付费。
高性能：Serverless for Apache Spark 针对性能进行了优化，并利用 Google Cloud的基础架构。
与其他 Google Cloud 技术集成：Serverless for Apache Spark 与其他 Google Cloud 产品（例如 BigQuery 和 Dataplex Universal Catalog）集成。

如需了解详情，请参阅 Google Cloud Serverless for Apache Spark 文档。

准备工作

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Enable the Cloud Resource Manager, Dataproc, and Notebooks APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Enable the Cloud Resource Manager, Dataproc, and Notebooks APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

所需的角色

如需确保服务账号拥有在 Serverless for Apache Spark 集群或 Dataproc 集群上运行笔记本文件所需的权限，请让您的管理员为服务账号授予以下 IAM 角色：

针对您的项目的 Dataproc Worker (roles/dataproc.worker)
针对拥有 dataproc.clusters.use 权限的集群的 Dataproc Editor (roles/dataproc.editor)

如需详细了解如何授予角色，请参阅管理对项目、文件夹和组织的访问权限。

这些预定义角色包含在 Serverless for Apache Spark 集群或 Dataproc 集群上运行笔记本文件所需的权限。如需查看所需的确切权限，请展开所需权限部分：

所需权限

在 Serverless for Apache Spark 集群或 Dataproc 集群上运行笔记本文件需要以下权限：

dataproc.agents.create
dataproc.agents.delete
dataproc.agents.get
dataproc.agents.update
dataproc.tasks.lease
dataproc.tasks.listInvalidatedLeases
dataproc.tasks.reportStatus
dataproc.clusters.use

您的管理员也可以使用自定义角色或其他预定义角色为服务账号授予这些权限。

创建启用了 Dataproc 的实例

如需创建启用了 Dataproc 的 Vertex AI Workbench 实例，请执行以下操作：

在 Google Cloud 控制台中，前往实例页面。

转到实例
点击新建。
在新建实例对话框中，点击高级选项。
在创建实例对话框的详细信息部分中，确保选中启用 Dataproc Serverless Interactive 会话。
确保将 Workbench 类型设置为实例。
在环境部分中，确保您使用最新版本或编号为 M113 或更高的版本。
点击创建。

Vertex AI Workbench 会创建实例并自动启动该实例。当实例可供使用时，Vertex AI Workbench 会激活一个打开 JupyterLab 链接。

打开 JupyterLab

在实例名称旁边，点击打开 JupyterLab。

JupyterLab 启动器标签页会在浏览器中打开。默认情况下，该标签页包含 Serverless for Apache Spark Notebooks 和 Dataproc Jobs and Sessions 部分。如果所选项目和区域中存在支持 Jupyter 的集群，则系统会显示一个名为 Dataproc Cluster Notebooks 的部分。

将扩展程序与 Serverless for Apache Spark 搭配使用

与 Vertex AI Workbench 实例位于同一区域和项目中的 Serverless for Apache Spark 运行时模板显示在 JupyterLab 启动器标签页的 Serverless for Apache Spark Notebooks 部分中。

如需创建运行时模板，请参阅创建 Serverless for Apache Spark 运行时模板。

如需打开新的 Serverless Spark 笔记本，请点击运行时模板。远程 Spark 内核大约需要一分钟才能启动。内核启动后，您就可以开始编写代码了。

将扩展程序与 Dataproc on Compute Engine 搭配使用

如果您创建了 Dataproc on Compute Engine Jupyter 集群，则启动器标签页会包含 Dataproc Cluster Notebooks 部分。

对于您有权访问该区域和项目中的每个支持 Jupyter 的 Dataproc 集群，系统都会显示四个卡片。

如需更改区域和项目，请执行以下操作：

选择设置 > Cloud Dataproc Settings。
在 Setup Config 标签页上的项目信息下，更改项目 ID 和区域，然后点击保存。

这些更改在您重启 JupyterLab 后才会生效。
如需重启 JupyterLab，请选择文件 > 关停，然后点击 Vertex AI Workbench instances 页面上的打开 JupyterLab。

如需创建新笔记本，请点击卡片。Dataproc 集群上的远程内核启动后，您可以开始编写代码，然后在集群上运行代码。

使用 gcloud CLI 和 API 管理实例上的 Dataproc

本部分介绍了在 Vertex AI Workbench 实例上管理 Dataproc 的方法。

更改 Dataproc 集群的区域

Vertex AI Workbench 实例的默认内核（例如 Python 和 TensorFlow）是本地内核，在实例的虚拟机中运行。在启用了 Dataproc Spark 的 Vertex AI Workbench 实例上，您的笔记本通过远程内核在 Dataproc 集群上运行。远程内核在实例的虚拟机之外的服务上运行，因此您可以访问同一项目中的任何 Dataproc 集群。

默认情况下，Vertex AI Workbench 使用实例所在区域中的 Dataproc 集群，但您可以更改 Dataproc 区域，只要 Dataproc 集群上启用了组件网关和可选的 Jupyter 组件。

测试访问权限

默认情况下，Vertex AI Workbench 实例会启用 Dataproc JupyterLab 扩展程序。如需测试对 Dataproc 的访问权限，您可以向 kernels.googleusercontent.com 网域发送以下 curl 请求，以检查对实例远程内核的访问权限：

curl --verbose -H "Authorization: Bearer $(gcloud auth print-access-token)" https://PROJECT_ID-dot-REGION.kernels.googleusercontent.com/api/kernelspecs | jq .

如果 curl 命令失败，请检查以确保：

您的 DNS 条目已正确配置。
同一项目中有可用的集群（如果没有，您需要创建一个）。
您的集群已启用组件网关和可选的 Jupyter 组件。

关闭 Dataproc

Vertex AI Workbench 实例创建时默认启用 Dataproc。您可以通过将 disable-mixer metadata 键设置为 true，在创建 Vertex AI Workbench 实例时停用 Dataproc。

gcloud workbench instances create INSTANCE_NAME --metadata=disable-mixer=true

启用 Dataproc

您可以通过更新元数据值，在已停止的 Vertex AI Workbench 实例上启用 Dataproc。

gcloud workbench instances update INSTANCE_NAME --metadata=disable-mixer=false

使用 Terraform 管理 Dataproc

在 Terraform 上，Vertex AI Workbench 实例的 Dataproc 使用元数据字段中的 disable-mixer 键进行管理。将 disable-mixer metadata 键设置为 false 可以启用 Dataproc。将 disable-mixer 元数据键设置为 true 可以关闭 Dataproc。

如需了解如何应用或移除 Terraform 配置，请参阅基本 Terraform 命令。

resource "google_workbench_instance" "default" {
  name     = "workbench-instance-example"
  location = "us-central1-a"

  gce_setup {
    machine_type = "n1-standard-1"
    vm_image {
      project = "cloud-notebooks-managed"
      family  = "workbench-instances"
    }
    metadata = {
      disable-mixer = "false"
    }
  }
}

问题排查

如需诊断和解决与创建启用了 Dataproc Spark 的实例相关的问题，请参阅排查 Vertex AI Workbench 问题。

后续步骤

如需详细了解 Dataproc JupyterLab 扩展程序，请参阅使用 JupyterLab 扩展程序开发无服务器 Spark 工作负载。
如需详细了解 Serverless for Apache Spark，请参阅 Serverless for Apache Spark 文档
了解如何在不预配和管理集群的情况下运行 Serverless for Apache Spark 工作负载。
如需详细了解如何将 Spark 与 Google Cloud 产品和服务搭配使用，请参阅 Spark on Google Cloud。
在 GitHub 上浏览可用的 Dataproc 模板。
通过 GitHub 上的 serverless-spark-workshop 了解 Serverless Spark。
阅读 Apache Spark 文档。