Use the Dataproc JupyterLab plugin for serverless batch and interactive notebook sessions

This document describes how to install and use the Dataproc JupyterLab plugin on a machine or VM that has access to Google services, such as your local machine or a Compute Engine VM instance. It also describes how to develop and deploy Spark notebook code.

Once you install the Dataproc JupyterLab plugin, you can use it to do the following tasks:

Launch Dataproc Serverless for Spark interactive notebook sessions
Submit Dataproc Serverless batch jobs

Dataproc Serverless limitations and considerations

Spark jobs are executed with the service account identity, not the submitting user's identity.

Before you begin

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Enable the Dataproc API.

Enable the API

Install the Google Cloud CLI.

To initialize the gcloud CLI, run the following command:

gcloud init

Note: If you installed the gcloud CLI previously, make sure you have the latest version by running

gcloud components
      update

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Enable the Dataproc API.

Enable the API

Install the Google Cloud CLI.

To initialize the gcloud CLI, run the following command:

gcloud init

Note: If you installed the gcloud CLI previously, make sure you have the latest version by running

gcloud components
      update

Make sure the regional VPC subnet where you will run your Dataproc Serverless interactive session has Private Google Access enabled. For more information, see Create a Dataproc Serverless runtime template

Install the Dataproc JupyterLab plugin

You can install and use the Dataproc JupyterLab plugin on a machine or VM that has access to Google services, such as your local machine or a Compute Engine VM instance.

To install the plugin, follow these steps:

Download and install Python version 3.8 or higher from python.org/downloads.
- Verify the Python 3.8+ installation.
```
python3 --version
```
Install JupyterLab 3.6.3+ on your machine.
```
pip3 install --upgrade jupyterlab
```
- Verify the JupyterLab 3.6.3+ installation.
```
 pip3 show jupyterlab
```
Install the Dataproc JupyterLab plugin.
```
pip3 install dataproc-jupyter-plugin
```
- If your JupyterLab version is earlier than 4.0.0, enable the plugin extension.
```
 jupyter server extension enable dataproc_jupyter_plugin
```
Start JupyterLab.
```
jupyter lab
```
1. The JupyterLab Launcher page opens in your browser. It contains a Dataproc Jobs and Sessions section. It can also contain Dataproc Serverless Notebooks and Dataproc Cluster Notebooks sections if you have access to Dataproc serverless notebooks or Dataproc clusters with the Jupyter optional component running in your project.
  
  On macOS, if you receive a SSL: CERTIFICATE_VERIFY_FAILED error in your terminal when you launch Jupyterlab, update your Python SSL certificate by executing /Applications/Python 3.11/Install Certificates.command. This file is located in the Python home directory.
2. By default, your Dataproc Serverless for Spark Interactive session runs in the project and region you set when you ran gcloud init in Before you begin. You can change the project and region settings for your sessions from the JupyterLab Settings > Dataproc Settings page.

Create a Dataproc Serverless Serverless runtime template

Dataproc Serverless runtime templates (also called session templates) contain configuration settings for executing Spark code in a session. You can create and manage runtime templates using Jupyterlab or the gcloud CLI.

JupyterLab

Click the New runtime template card in the Dataproc Serverless Notebooks section on the JupyterLab Launcher page.
Fill in the Runtime template form.
Specify a Display name and Description, and then input or confirm the other settings.

Notes:
- Network Configuration: The subnetwork must have Private Google Access enabled and must allow subnet communication on all ports (see Dataproc Serverless for Spark network configuration).
  
  If the default network's subnet for the region you configured when you ran gcloud init in Before you begin is not enabled for Private Google Access:
  - Enable it for Private Google Access, or
  - Select another network with a regional subnetwork that has Private Google Access enabled. You can change the region that Dataproc Serverless uses from the JupyterLab Settings > Dataproc Settings page.
- Metastore: To use a Dataproc Metastore service in your sessions, select the metastore project ID, region, and service.
- Max idle time: The maximum notebook idle time before the session is terminated. Allowable range: 10 minutes to 336 hours (14 days).
- Max session time: The maximum lifetime of a session before the session is terminated. Allowable range: 10 minutes to 336 hours (14 days).
- PHS: You can select an available Persistent Spark History Server to allow you to access session logs during and after sessions.
  
  The PHS must be set up in location (region) where your sessions run. By default, Dataproc Serverless sessions run in the project and region set with the gcloud init command. You can change the project and region settings from the JupyterLab Settings > Dataproc Settings page.
- Spark properties: Click Add Property for each property to set for your serverless Spark sessions. See Spark properties for a listing of supported and unsupported Spark properties, including Spark runtime, resource, and autoscaling properties.
- Labels: Click Add Label for each label to set on your serverless Spark sessions.
View your runtime templates from the Settings > Dataproc Settings page.
- You can delete a template from the Action menu for the template.
Click Save.
Open and reload the JupyterLab Launcher page to view the saved notebook template card on the JupyterLab Launcher page.

gcloud

Create a YAML file with your runtime template configuration.

Simple YAML

environmentConfig:
  executionConfig:
    networkUri: default
jupyterSession:
  kernel: PYTHON
  displayName: Team A
labels:
  purpose: testing
description: Team A Development Environment

Complex YAML

environmentConfig:
  executionConfig:
    serviceAccount: sa1
    # Choose either networkUri or subnetworkUri
    networkUri: default
    subnetworkUri: subnet
    networkTags:
     - tag1
    kmsKey: key1
    idleTtl: 3600s
    ttl: 14400s
    stagingBucket: staging-bucket
  peripheralsConfig:
    metastoreService: projects/my-project-id/locations/us-central1/services/my-metastore-id
    sparkHistoryServerConfig:
      dataprocCluster: projects/my-project-id/regions/us-central1/clusters/my-cluster-id
jupyterSession:
  kernel: PYTHON
  displayName: Team A
labels:
  purpose: testing
runtimeConfig:
  version: "1.1"
  containerImage: gcr.io/my-project-id/my-image:1.0.1
  properties:
    "p1": "v1"
description: Team A Development Environment

If the default network's subnet for the region you configured when you ran gcloud init in Before you begin is not enabled forPrivate Google Access:

Enable it for Private Google Access, or
Select another network with a regional subnetwork that has Private Google Access enabled. You can change the region that Dataproc Serverless uses from the JupyterLab Settings > Dataproc Settings page.

Create a session (runtime) template from your YAML file by running the following gcloud beta dataproc session-templates import command locally or in Cloud Shell:
```
gcloud beta dataproc session-templates import TEMPLATE_ID \
    --source=YAML_FILE \
    --project=PROJECT_ID \
    --location=REGION
```
- See gcloud beta dataproc session-templates for commands to describe, list, export, and delete session templates.

Launch and manage notebooks

After installing the Dataproc JupyterLab plugin, you can click template cards on the JupyterLab Launcher page to:

Launch a Jupyter notebook on Dataproc Serverless.
Launch a Jupyter notebook on a Dataproc on Compute Engine cluster.

Launch a Jupyter notebook on Dataproc Serverless

The Dataproc Serverless Notebooks section on the JupyterLab Launcher page displays notebook template cards that map to Dataproc Serverless runtime templates (see Create a Dataproc Serverless runtime template).

Click a card to create a Dataproc Serverless session and launch a notebook. When session creation is complete and the notebook kernel is ready to use, the kernel status changes from Unknown to Idle.

Write and test notebook code.

Copy and paste the following PySpark Pi estimation code in the PySpark notebook cell, then press Shift+Return to run the code.

import random
    
def inside(p):
    x, y = random.random(), random.random()
    return x*x + y*y < 1
    
count = sc.parallelize(range(0, 10000)) .filter(inside).count()
print("Pi is roughly %f" % (4.0 * count / 10000))

Notebook result:

After creating and using a notebook, you can terminate the notebook session by clicking Shut Down Kernel from the Kernel tab.
- To reuse the session, create a new notebook by choosing Notebook from the File>>New menu. After the new notebook is created, choose the existing session from the kernel selection dialog. The new notebook will reuse the session and retain the session context from the previous notebook.
If you don't terminate the session, Dataproc terminates the session when the session idle timer expires. You can configure the session idle time in the runtime template configuration. The default session idle time is one hour.

Launch a notebook on a Dataproc on Compute Engine cluster

If you created a Dataproc on Compute Engine Jupyter cluster, the JupyterLab Launcher page contains a Dataproc Cluster Notebook section with pre-installed kernel cards.

To launch a Jupyter notebook on your Dataproc on Compute Engine cluster:

Click a card in the Dataproc Cluster Notebook section.
When the kernel status changes from Unknown to Idle, you can start writing and executing notebook code.
After creating and using a notebook, you can terminate the notebook session by clicking Shut Down Kernel from the Kernel tab.

Manage input and output files in Cloud Storage

Analyzing exploratory data and building ML models often involves file-based inputs and outputs. Dataproc Serverless accesses these files on Cloud Storage.

To access the Cloud Storage browser, click the Cloud Storage browser icon in the JupyterLab Launcher page sidebar, then double-click a folder to view its contents.
You can click Jupyter-supported file types to open and edit them. When you save changes to the files, they are written to Cloud Storage.
To create a new Cloud Storage folder, click the new folder icon, then enter the name of the folder.
To upload files into a Cloud Storage bucket or a folder, click the upload icon, then select the files to upload.

Develop Spark notebook code

After installing the Dataproc JupyterLab plugin, you can launch Jupyter notebooks from the JupyterLab Launcher page to develop application code.

PySpark and Python code development

Dataproc Serverless and Dataproc on Compute Engine clusters support PySpark kernels. Dataproc on Compute Engine also supports Python kernels.

Click a PySpark card in the Dataproc Serverless Notebooks or Dataproc Cluster Notebook section on the JupyterLab Launcher page to open a PySpark notebook.
Click a Python kernel card in the Dataproc Cluster Notebook section on the JupyterLab Launcher page to open a Python notebook.

SQL code development

To open a PySpark notebook to write and execute SQL code, on the JupyterLab Launcher page, in the Dataproc Serverless Notebooks or Dataproc Cluster Notebook section, click the PySpark kernel card.

Spark SQL magic: Since the PySpark kernel that launches Dataproc Serverless Notebooks is preloaded with Spark SQL magic, instead of using spark.sql('SQL STATEMENT').show() to wrap your SQL statement, you can type %%sparksql magic at the top of a cell, then type your SQL statement in the cell.

BigQuery SQL: The BigQuery Spark connector allows your notebook code to load data from BigQuery tables, perform analysis in Spark, and then write the results to a BigQuery table.

The Dataproc Serverless 2.1 runtime includes the BigQuery Spark connector. If you use the Dataproc Serverless 2.0 or earlier runtime to launch Dataproc Serverless notebooks, you can install Spark BigQuery Connector by adding the following Spark property to your Dataproc Serverless runtime template:

spark.jars: gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.25.2.jar

Scala code development

Dataproc on Compute Engine clusters created with image version 2.0+, 2.1+ and later, include Apache Toree, a Scala kernel for the Jupyter Notebook platform that provides interactive access to Spark.

Click the Apache Toree card in the Dataproc cluster Notebook section on the JupyterLab Launcher page to open a notebook for Scala code development.

Figure 1. Apache Toree kernel card in the JupyterLab Launcher page.

Metadata explorer

If a Dataproc Metastore (DPMS) instance is attached to a Dataproc Serverless runtime template or a Dataproc on Compute Engine cluster, the DPMS instance schema is displayed in the JupyterLab Metadata Explorer when a notebook is opened. DPMS is a fully-managed and horizontally-scalable Hive Metastore (HMS) service on Google Cloud.

To view HMS metadata in the Metadata Explorer:

Enable the Data Catalog API in your project.
Enable Data Catalog sync in your DPMS service.
Specify a DPMS instance when you create the Dataproc Serverless runtime template or create the Dataproc on Compute Engine cluster.

To open the JupyterLab Metadata Explorer, click its icon in the sidebar.

You can search for a database, table, or column in the Metadata Explorer. Click a database, table, or column name to view the associated metadata.

Deploy your code

After installing the Dataproc JupyterLab plugin, you can use JupyterLab to:

Execute your notebook code on the Dataproc Serverless infrastructure
Submit batch jobs to the Dataproc Serverless infrastructure or to your Dataproc on Compute Engine cluster.

Run notebook code on Dataproc Serverless

To run code in a notebook cell, click Run or press the Shift-Return keys to run code in a notebook cell.
To run code in one or more notebook cells, use the Run menu.

Submit a batch job to Dataproc Serverless

Click the Serverless card in the Dataproc Jobs and Sessions section on the JupyterLab Launcher page.
Click the Batch tab, then click Create Batch and fill in the Batch Info fields.
Click Submit to submit the job.

Submit a batch job to a Dataproc on Compute Engine cluster

Click the Clusters card in the Dataproc Jobs and Sessions section on the JupyterLab Launcher page.
Click the Jobs tab, then click Submit Job.
Select a Cluster, then fill in the Job fields.
Click Submit to submit the job.

View and manage resources

After installing the Dataproc JupyterLab plugin, you can view and manage Dataproc Serverless and Dataproc on Compute Engine from the Dataproc Jobs and Sessions section on the JupyterLab Launcher page.

Click the Dataproc Jobs and Sessions section to show the Clusters and Serverless cards.

To view and manage Dataproc Serverless sessions:

Click the Serverless card.
Click the Sessions tab, then a session ID to open the Session details page to view session properties, view Google Cloud logs in Logs Explorer, and terminate a session. Note: A unique Dataproc Serverless session is created to launch each Dataproc Serverless notebook.

To view and manage Dataproc Serverless batches:

Click the Batches tab to view the list of Dataproc Serverless batches in the current project and region. Click a batch ID to view batch details.

To view and manage Dataproc on Compute Engine clusters:

Click the Clusters card. The Clusters tab is selected to list active Dataproc on Compute Engine clusters in the current project and region. You can click the icons in the Actions column to start, stop, or restart a cluster. Click a cluster name to view cluster details. You can click the icons in the Actions column to clone, stop, or delete a job.

To view and manage Dataproc on Compute Engine jobs:

Click the Jobs card to view the list of jobs in the current project. Click a job ID to view job details.