在 Dataproc 集群上运行代管式笔记本实例

本页面介绍如何在 Dataproc 集群上运行代管式笔记本实例的笔记本文件。

准备工作

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Notebooks and Dataproc APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Notebooks and Dataproc APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

所需的角色

如需确保服务账号拥有在 Serverless for Apache Spark 集群上运行笔记本文件所需的权限，请让您的管理员为服务账号授予以下 IAM 角色：

针对您的项目的 Dataproc Worker (roles/dataproc.worker)
针对拥有 dataproc.clusters.use 权限的集群的 Dataproc Editor (roles/dataproc.editor)

如需详细了解如何授予角色，请参阅管理对项目、文件夹和组织的访问权限。

这些预定义角色包含在 Serverless for Apache Spark 集群上运行笔记本文件所需的权限。如需查看所需的确切权限，请展开所需权限部分：

所需权限

在 Serverless for Apache Spark 集群上运行笔记本文件需要以下权限：

dataproc.agents.create
dataproc.agents.delete
dataproc.agents.get
dataproc.agents.update
dataproc.tasks.lease
dataproc.tasks.listInvalidatedLeases
dataproc.tasks.reportStatus
dataproc.clusters.use

您的管理员也可以使用自定义角色或其他预定义角色为服务账号授予这些权限。

创建 Dataproc 集群

如需在 Dataproc 集群中运行代管式笔记本实例的笔记本文件，您的集群必须满足以下条件：

必须启用集群的组件网关。
集群必须具有 Jupyter 组件。
集群必须与您的代管式笔记本实例位于同一区域。

如需创建 Dataproc 集群，请在 Cloud Shell 中或安装了 Google Cloud CLI 的另一个环境中输入以下命令。

gcloud dataproc clusters create CLUSTER_NAME\
    --region=REGION \
    --enable-component-gateway \
    --optional-components=JUPYTER

替换以下内容：

REGION：您的托管式笔记本实例的 Google Cloud 位置
CLUSTER_NAME：新集群的名称

几分钟后，您的 Dataproc 集群就可以使用了。详细了解如何创建 Dataproc 集群。

打开 JupyterLab

在 Google Cloud 控制台中，打开托管式笔记本页面。

前往“代管式笔记本”
在代管式笔记本实例的名称旁边，点击打开 JupyterLab。

在 Dataproc 集群中运行笔记本文件

您可以在 Dataproc 集群中运行同一项目和区域中的任何代管式笔记本实例的笔记本文件。

运行新的笔记本文件

在代管式笔记本实例的 JupyterLab 界面中，选择文件 >新建 >笔记本。
Dataproc 集群的可用内核显示在选择内核菜单中，在其中选择您要使用的内核，然后点击选择。

系统会打开您的新笔记本文件。
向您的新笔记本文件添加代码，然后运行该代码。

如需在创建笔记本文件后更改要使用的内核，请参阅以下部分。

运行现有笔记本文件

在代管式笔记本实例的 JupyterLab 界面中，点击 文件浏览器按钮，导航到要运行的笔记本文件，然后将其打开。
如需打开选择内核对话框，请点击笔记本文件的内核名称，例如 Python (Local)。
如需选择 Dataproc 集群中的内核，请选择其末尾包含集群名称的内核名称。例如，名为 mycluster 的 Dataproc 集群上的 PySpark 内核的名称将为 PySpark on mycluster。
点击选择以关闭对话框。

现在，您便可以在 Dataproc 集群上运行笔记本文件的代码。

后续步骤

详细了解 Dataproc。