Use Dataproc Serverless Spark with managed notebooks

This page shows you how to run a notebook file on serverless Spark in a Vertex AI Workbench managed notebooks instance by using Dataproc Serverless.

Your managed notebooks instance can submit a notebook file's code to run on the Dataproc Serverless service. The service runs the code on a managed compute infrastructure that automatically scales resources as needed. Therefore, you don't need to provision and manage your own cluster.

Dataproc Serverless charges apply only to the time when the workload is executing.

Requirements

To run a notebook file on Dataproc Serverless Spark, see the following requirements.

  • Your Dataproc Serverless session must run in the same region as your managed notebooks instance.

  • The Require OS Login (constraints/compute.requireOsLogin) constraint must not be enabled for your project. See Manage OS Login in an organization.

  • To run a notebook file on Dataproc Serverless, you must provide a service account that has specific permissions. You can grant these permissions to the default service account or provide a custom service account. See the Permissions section of this page.

  • Your Dataproc Serverless Spark session uses a Virtual Private Cloud (VPC) network to execute workloads. The VPC subnetwork must meet specific requirements. See the requirements in Dataproc Serverless for Spark network configuration.

Permissions

To ensure that the service account has the necessary permissions to run a notebook file on Dataproc Serverless, ask your administrator to grant the service account the Dataproc Editor (roles/dataproc.editor) IAM role on your project. For more information about granting roles, see Manage access.

This predefined role contains the permissions required to run a notebook file on Dataproc Serverless. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to run a notebook file on Dataproc Serverless:

  • dataproc.agents.create
  • dataproc.agents.delete
  • dataproc.agents.get
  • dataproc.agents.update
  • dataproc.session.create
  • dataproc.sessions.get
  • dataproc.sessions.list
  • dataproc.sessions.terminate
  • dataproc.sessions.delete
  • dataproc.tasks.lease
  • dataproc.tasks.listInvalidatedLeases
  • dataproc.tasks.reportStatus

Your administrator might also be able to give the service account these permissions with custom roles or other predefined roles.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the Notebooks, Vertex AI, and Dataproc APIs.

    Enable the APIs

  5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  6. Make sure that billing is enabled for your Google Cloud project.

  7. Enable the Notebooks, Vertex AI, and Dataproc APIs.

    Enable the APIs

  8. If you haven't already, create a managed notebooks instance.
  9. If you haven't already, configure a VPC network that meets the requirements listed in Dataproc Serverless for Spark network configuration.

Open JupyterLab

  1. In the Google Cloud console, go to the Managed notebooks page.

    Go to Managed notebooks

  2. Next to your managed notebooks instance's name, click Open JupyterLab.

Start a Dataproc Serverless Spark session

To start a Dataproc Serverless Spark session, complete the following steps.

  1. In your managed notebooks instance's JupyterLab interface, select the Launcher tab, and then select Serverless Spark. If the Launcher tab is not open, select File > New Launcher to open it.

    The Create Serverless Spark session dialog appears.

  2. In the Session name field, enter a name for your session.

  3. In the Execution configuration section, enter the Service account that you want to use. If you don't enter a service account, your session will use the Compute Engine default service account.

  4. In the Network configuration section, select the Network and Subnetwork of a network that meets the requirements listed in Dataproc Serverless for Spark network configuration.

  5. Click Create.

    A new notebook file opens. The Dataproc Serverless Spark session that you created is the kernel that runs your notebook file's code.

Run your code on Dataproc Serverless Spark and other kernels

  1. Add code to your new notebook file, and run the code.

  2. To run code on a different kernel, change the kernel.

  3. When you want to run the code on your Dataproc Serverless Spark session again, change the kernel back to the Dataproc Serverless Spark kernel.

Terminate your Dataproc Serverless Spark session

You can terminate a Dataproc Serverless Spark session in the JupyterLab interface or in the Google Cloud console. The code in your notebook file is preserved.

JupyterLab

  1. In JupyterLab, close the notebook file that was created when you created your Dataproc Serverless Spark session.

  2. In the dialog that appears, click Terminate session.

Google Cloud console

  1. In the Google Cloud console, go to the Dataproc sessions page.

    Go to Dataproc sessions

  2. Select the session that you want to terminate, and then click Terminate.

Delete your Dataproc Serverless Spark session

You can delete a Dataproc Serverless Spark session by using the Google Cloud console. The code in your notebook file is preserved.

  1. In the Google Cloud console, go to the Dataproc sessions page.

    Go to Dataproc sessions

  2. Select the session that you want to delete, and then click Delete.

What's next