Create a Dataproc-enabled instance
This page describes how to create a Dataproc-enabled Vertex AI Workbench instance. This page also describes the benefits of the Dataproc JupyterLab plugin and provides an overview on how to use the plugin with Dataproc Serverless for Spark and Dataproc on Compute Engine.
Overview of the Dataproc JupyterLab plugin
Vertex AI Workbench instances have the Dataproc
JupyterLab plugin preinstalled, as of version M113
and later.
The Dataproc JupyterLab plugin provides two ways to run Apache Spark notebooks jobs: Dataproc clusters and Serverless Spark on Dataproc.
- Dataproc clusters include a rich set of features with control over the infrastructure that Spark runs on. You choose the size and configuration of your Spark cluster, allowing for customization and control over your environment. This approach is ideal for complex workloads, long-running jobs, and fine-grained resource management.
- Serverless Spark powered by Dataproc eliminates infrastructure concerns. You submit your Spark jobs, and Google handles the provisioning, scaling, and optimization of resources behind the scenes. This serverless approach offers an easy and cost-efficient option for data science and ML workloads.
With both options, you can use Spark for data processing and analysis. The choice between Dataproc clusters and Serverless Spark depends on your specific workload requirements, desired level of control, and resource usage patterns.
Benefits of using Serverless Spark for data science and ML workloads include:
- No cluster management: You don't need to worry about provisioning, configuring, or managing Spark clusters. This saves you time and resources.
- Autoscaling: Serverless Spark automatically scales up and down based on the workload, so you only pay for the resources you use.
- High performance: Serverless Spark is optimized for performance and takes advantage of Google Cloud's infrastructure.
- Integration with other Google Cloud technologies: Serverless Spark integrates with other Google Cloud products, such as BigQuery and Dataplex.
For more information, see the Dataproc Serverless documentation.
Limitations
Consider the following limitations when planning your project:
- The Dataproc JupyterLab plugin doesn't support VPC Service Controls.
Dataproc limitations
The following Dataproc limitations apply:
- Spark jobs are executed with the service account identity, not the submitting user's identity.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Enable the Cloud Resource Manager, Dataproc, and Notebooks APIs.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Enable the Cloud Resource Manager, Dataproc, and Notebooks APIs.
Required roles
To ensure that the service account has the necessary permissions to run a notebook file on a Dataproc Serverless cluster or a Dataproc cluster, ask your administrator to grant the service account the following IAM roles:
-
Dataproc Worker (
roles/dataproc.worker
) on your project -
Dataproc Editor (
roles/dataproc.editor
) on the cluster for thedataproc.clusters.use
permission
For more information about granting roles, see Manage access to projects, folders, and organizations.
These predefined roles contain the permissions required to run a notebook file on a Dataproc Serverless cluster or a Dataproc cluster. To see the exact permissions that are required, expand the Required permissions section:
Required permissions
The following permissions are required to run a notebook file on a Dataproc Serverless cluster or a Dataproc cluster:
-
dataproc.agents.create
-
dataproc.agents.delete
-
dataproc.agents.get
-
dataproc.agents.update
-
dataproc.tasks.lease
-
dataproc.tasks.listInvalidatedLeases
-
dataproc.tasks.reportStatus
-
dataproc.clusters.use
Your administrator might also be able to give the service account these permissions with custom roles or other predefined roles.
Create an instance with Dataproc enabled
To create a Vertex AI Workbench instance with Dataproc enabled, do the following:
In the Google Cloud console, go to the Instances page.
Click
Create new.In the New instance dialog, click Advanced options.
In the Create instance dialog, in the Details section, make sure Enable Dataproc Serverless Interactive Sessions is selected.
Make sure Workbench type is set to Instance.
In the Environment section, make sure you use the latest version or a version numbered
M113
or higher.Click Create.
Vertex AI Workbench creates an instance and automatically starts it. When the instance is ready to use, Vertex AI Workbench activates an Open JupyterLab link.
Open JupyterLab
Next to your instance's name, click Open JupyterLab.
The JupyterLab Launcher tab opens in your browser. By default it contains sections for Dataproc Serverless Notebooks and Dataproc Jobs and Sessions. If there are Jupyter-ready clusters in the selected project and region, there will be a section called Dataproc Cluster Notebooks.
Use the plugin with Dataproc Serverless for Spark
Serverless Spark runtime templates that are in the same region and project as your Vertex AI Workbench instance appear in the Dataproc Serverless Notebooks section of the JupyterLab Launcher tab.
To create a runtime template, see Create a Dataproc Serverless runtime template.
To open a new Serverless Spark notebook, click a runtime template. It takes about a minute for the remote Spark kernel to start. After the kernel starts, you can start coding. To run your code on Serverless Spark, run a code cell in your notebook.
Use the plugin with Dataproc on Compute Engine
If you created a Dataproc on Compute Engine Jupyter cluster, the Launcher tab has a Dataproc Cluster Notebooks section.
Four cards appear for each Jupyter-ready Dataproc cluster that you have access to in that region and project.
To change the region and project, do the following:
Select Settings > Cloud Dataproc Settings.
On the Setup Config tab, under Project Info, change the Project ID and Region, and then click Save.
These changes don't take effect until you restart JupyterLab.
To restart JupyterLab, select File > Shut Down, and then click Open JupyterLab on the Vertex AI Workbench instances page.
To create a new notebook, click a card. After the remote kernel on the Dataproc cluster starts, you can start writing your code and then run it on your cluster.
Manage Dataproc on an instance using the gcloud CLI and the API
This section describes ways to manage Dataproc on a Vertex AI Workbench instance.
Change the region of your Dataproc cluster
Your Vertex AI Workbench instance's default kernels, such as Python and TensorFlow, are local kernels that run in the instance's VM. On a Dataproc-enabled Vertex AI Workbench instance, your notebook runs on a Dataproc cluster through a remote kernel. The remote kernel runs on a service outside of your instance's VM, which lets you access any Dataproc cluster within the same project.
By default Vertex AI Workbench uses Dataproc clusters within the same region as your instance, but you can change the Dataproc region as long as the Component Gateway and the optional Jupyter component are enabled on the Dataproc cluster.
To change the region of your instance's VM, use the following command:
gcloud config set compute/region REGION
Replace REGION with the region that you want, for example
us-east4
.To change the region of your Dataproc cluster, use the following command:
gcloud config set dataproc/region REGION
Replace REGION with the region that you want, for example
us-east4
.
Test Access
The Dataproc JupyterLab plugin is enabled by default for
Vertex AI Workbench instances. To test access to Dataproc,
you can check access to your instance's remote kernels by sending the following
curl request to the kernels.googleusercontent.com
domain:
curl --verbose -H "Authorization: Bearer $(gcloud auth print-access-token)" https://PROJECT_ID-dot-REGION.kernels.googleusercontent.com/api/kernelspecs | jq .
If the curl command fails, check to make sure that:
Your DNS entries are configured correctly.
There is a cluster available in the same project (or you will need to create one if it doesn't exist).
Your cluster has both the Component Gateway and the optional Jupyter component enabled.
Turn off Dataproc
Vertex AI Workbench instances are created with Dataproc
enabled by default. You can create a Vertex AI Workbench instance with
Dataproc turned off by setting the disable-mixer
metadata
key to true
.
gcloud workbench instances create INSTANCE_NAME --metadata=disable-mixer=true
Enable Dataproc
You can enable Dataproc on a stopped Vertex AI Workbench instance by updating the metadata value.
gcloud workbench instances update INSTANCE_NAME --metadata=disable-mixer=false
Manage Dataproc using Terraform
Dataproc for Vertex AI Workbench instances
on Terraform is managed using the disable-mixer
key in the metadata field.
Turn on Dataproc by setting the disable-mixer
metadata
key to false
. Turn off Dataproc by setting
the disable-mixer
metadata key to true
.
To learn how to apply or remove a Terraform configuration, see Basic Terraform commands.
Troubleshoot
To diagnose and resolve issues related to creating a Dataproc-enabled instance, see Troubleshooting Vertex AI Workbench.
What's next
For more information about the Dataproc JupyterLab plugin, see Use JupyterLab for serverless batch and interactive notebook sessions.
To learn more about Serverless Spark, see the Dataproc Serverless documentation
Learn how to run Serverless Spark workloads without provisioning and managing clusters.
To learn more about using Spark with Google Cloud products and services, see Spark on Google Cloud.
Browse the available Dataproc templates on GitHub.
Learn about Serverless Spark through the
serverless-spark-workshop
on GitHub.Read the Apache Spark documentation.