Create a Dataproc-enabled instance

This page describes how to create a Dataproc-enabled Vertex AI Workbench instance. This page also describes the benefits of the Dataproc JupyterLab plugin and provides an overview on how to use the plugin with Dataproc Serverless for Spark and Dataproc on Compute Engine.

Overview of the Dataproc JupyterLab plugin

Vertex AI Workbench instances have the Dataproc JupyterLab plugin preinstalled, as of version M113 and later.

The Dataproc JupyterLab plugin provides two ways to run Apache Spark notebooks jobs: Dataproc clusters and Serverless Spark on Dataproc.

Dataproc clusters include a rich set of features with control over the infrastructure that Spark runs on. You choose the size and configuration of your Spark cluster, allowing for customization and control over your environment. This approach is ideal for complex workloads, long-running jobs, and fine-grained resource management.
Serverless Spark powered by Dataproc eliminates infrastructure concerns. You submit your Spark jobs, and Google handles the provisioning, scaling, and optimization of resources behind the scenes. This serverless approach offers an easy and cost-efficient option for data science and ML workloads.

With both options, you can use Spark for data processing and analysis. The choice between Dataproc clusters and Serverless Spark depends on your specific workload requirements, desired level of control, and resource usage patterns.

Benefits of using Serverless Spark for data science and ML workloads include:

No cluster management: You don't need to worry about provisioning, configuring, or managing Spark clusters. This saves you time and resources.
Autoscaling: Serverless Spark automatically scales up and down based on the workload, so you only pay for the resources you use.
High performance: Serverless Spark is optimized for performance and takes advantage of Google Cloud's infrastructure.
Integration with other Google Cloud technologies: Serverless Spark integrates with other Google Cloud products, such as BigQuery and Dataplex.

For more information, see the Dataproc Serverless documentation.

Limitations

Consider the following limitations when planning your project:

The Dataproc JupyterLab plugin doesn't support VPC Service Controls.

Dataproc limitations

The following Dataproc limitations apply:

Spark jobs are executed with the service account identity, not the submitting user's identity.

Before you begin

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Enable the Cloud Resource Manager, Dataproc, and Notebooks APIs.

Enable the APIs

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Enable the Cloud Resource Manager, Dataproc, and Notebooks APIs.

Enable the APIs

Required roles

To ensure that the service account has the necessary permissions to run a notebook file on a Dataproc Serverless cluster or a Dataproc cluster, ask your administrator to grant the service account the following IAM roles:

Dataproc Worker (roles/dataproc.worker) on your project
Dataproc Editor (roles/dataproc.editor) on the cluster for the dataproc.clusters.use permission

For more information about granting roles, see Manage access to projects, folders, and organizations.

These predefined roles contain the permissions required to run a notebook file on a Dataproc Serverless cluster or a Dataproc cluster. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to run a notebook file on a Dataproc Serverless cluster or a Dataproc cluster:

dataproc.agents.create
dataproc.agents.delete
dataproc.agents.get
dataproc.agents.update
dataproc.tasks.lease
dataproc.tasks.listInvalidatedLeases
dataproc.tasks.reportStatus
dataproc.clusters.use

Your administrator might also be able to give the service account these permissions with custom roles or other predefined roles.

Create an instance with Dataproc enabled

To create a Vertex AI Workbench instance with Dataproc enabled, do the following:

In the Google Cloud console, go to the Instances page.

Go to Instances
Click Create new.
In the New instance dialog, click Advanced options.
In the Create instance dialog, in the Details section, make sure Enable Dataproc Serverless Interactive Sessions is selected.
Make sure Workbench type is set to Instance.
In the Environment section, make sure you use the latest version or a version numbered M113 or higher.
Click Create.

Vertex AI Workbench creates an instance and automatically starts it. When the instance is ready to use, Vertex AI Workbench activates an Open JupyterLab link.

Open JupyterLab

Next to your instance's name, click Open JupyterLab.

The JupyterLab Launcher tab opens in your browser. By default it contains sections for Dataproc Serverless Notebooks and Dataproc Jobs and Sessions. If there are Jupyter-ready clusters in the selected project and region, there will be a section called Dataproc Cluster Notebooks.

Use the plugin with Dataproc Serverless for Spark

Serverless Spark runtime templates that are in the same region and project as your Vertex AI Workbench instance appear in the Dataproc Serverless Notebooks section of the JupyterLab Launcher tab.

To create a runtime template, see Create a Dataproc Serverless runtime template.

To open a new Serverless Spark notebook, click a runtime template. It takes about a minute for the remote Spark kernel to start. After the kernel starts, you can start coding. To run your code on Serverless Spark, run a code cell in your notebook.

Use the plugin with Dataproc on Compute Engine

If you created a Dataproc on Compute Engine Jupyter cluster, the Launcher tab has a Dataproc Cluster Notebooks section.

Four cards appear for each Jupyter-ready Dataproc cluster that you have access to in that region and project.

To change the region and project, do the following:

Select Settings > Cloud Dataproc Settings.
On the Setup Config tab, under Project Info, change the Project ID and Region, and then click Save.

These changes don't take effect until you restart JupyterLab.
To restart JupyterLab, select File > Shut Down, and then click Open JupyterLab on the Vertex AI Workbench instances page.

To create a new notebook, click a card. After the remote kernel on the Dataproc cluster starts, you can start writing your code and then run it on your cluster.

Manage Dataproc on an instance using the gcloud CLI and the API

This section describes ways to manage Dataproc on a Vertex AI Workbench instance.

Change the region of your Dataproc cluster

Your Vertex AI Workbench instance's default kernels, such as Python and TensorFlow, are local kernels that run in the instance's VM. On a Dataproc-enabled Vertex AI Workbench instance, your notebook runs on a Dataproc cluster through a remote kernel. The remote kernel runs on a service outside of your instance's VM, which lets you access any Dataproc cluster within the same project.

By default Vertex AI Workbench uses Dataproc clusters within the same region as your instance, but you can change the Dataproc region as long as the Component Gateway and the optional Jupyter component are enabled on the Dataproc cluster.

To change the region of your instance's VM, use the following command:
```
gcloud config set compute/region REGION
```
Replace REGION with the region that you want, for example us-east4.
To change the region of your Dataproc cluster, use the following command:
```
gcloud config set dataproc/region REGION
```
Replace REGION with the region that you want, for example us-east4.

Test Access

The Dataproc JupyterLab plugin is enabled by default for Vertex AI Workbench instances. To test access to Dataproc, you can check access to your instance's remote kernels by sending the following curl request to the kernels.googleusercontent.com domain:

curl --verbose -H "Authorization: Bearer $(gcloud auth print-access-token)" https://PROJECT_ID-dot-REGION.kernels.googleusercontent.com/api/kernelspecs | jq .

If the curl command fails, check to make sure that:

Your DNS entries are configured correctly.
There is a cluster available in the same project (or you will need to create one if it doesn't exist).
Your cluster has both the Component Gateway and the optional Jupyter component enabled.

Turn off Dataproc

Vertex AI Workbench instances are created with Dataproc enabled by default. You can create a Vertex AI Workbench instance with Dataproc turned off by setting the disable-mixer metadata key to true.

gcloud workbench instances create INSTANCE_NAME --metadata=disable-mixer=true

Enable Dataproc

You can enable Dataproc on a stopped Vertex AI Workbench instance by updating the metadata value.

gcloud workbench instances update INSTANCE_NAME --metadata=disable-mixer=false

Manage Dataproc using Terraform

Dataproc for Vertex AI Workbench instances on Terraform is managed using the disable-mixer key in the metadata field. Turn on Dataproc by setting the disable-mixer metadata key to false. Turn off Dataproc by setting the disable-mixer metadata key to true.

To learn how to apply or remove a Terraform configuration, see Basic Terraform commands.

resource "google_workbench_instance" "default" {
  name     = "workbench-instance-example"
  location = "us-central1-a"

  gce_setup {
    machine_type = "n1-standard-1"
    vm_image {
      project = "cloud-notebooks-managed"
      family  = "workbench-instances"
    }
    metadata = {
      disable-mixer = "false"
    }
  }
}

Troubleshoot

To diagnose and resolve issues related to creating a Dataproc-enabled instance, see Troubleshooting Vertex AI Workbench.

What's next

For more information about the Dataproc JupyterLab plugin, see Use JupyterLab for serverless batch and interactive notebook sessions.
To learn more about Serverless Spark, see the Dataproc Serverless documentation
Learn how to run Serverless Spark workloads without provisioning and managing clusters.
To learn more about using Spark with Google Cloud products and services, see Spark on Google Cloud.
Browse the available Dataproc templates on GitHub.
Learn about Serverless Spark through the serverless-spark-workshop on GitHub.
Read the Apache Spark documentation.