Use the Dataproc JupyterLab plugin for serverless batch and interactive notebook sessions

This document describes how to install and use the Dataproc JupyterLab plugin on a machine or VM that has access to Google services, such as your local machine or a Compute Engine VM instance. It also describes how to develop and deploy Spark notebook code.

Once you install the Dataproc JupyterLab plugin, you can use it to do the following tasks:

  • Launch Dataproc Serverless for Spark interactive notebook sessions
  • Submit Dataproc Serverless batch jobs

Dataproc Serverless limitations and considerations

  • Spark jobs are executed with the service account identity, not the submitting user's identity.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Enable the Dataproc API.

    Enable the API

  4. Install the Google Cloud CLI.
  5. To initialize the gcloud CLI, run the following command:

    gcloud init
  6. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  7. Enable the Dataproc API.

    Enable the API

  8. Install the Google Cloud CLI.
  9. To initialize the gcloud CLI, run the following command:

    gcloud init

Install the Dataproc JupyterLab plugin

You can install and use the Dataproc JupyterLab plugin on a machine or VM that has access to Google services, such as your local machine or a Compute Engine VM instance.

To install the plugin, follow these steps:

  1. Download and install Python version 3.8 or higher from python.org/downloads.

    • Verify the Python 3.8+ installation.

      python3 --version
  2. Install JupyterLab 3.6.3+ on your machine.

    pip3 install --upgrade jupyterlab
    • Verify the JupyterLab 3.6.3+ installation.

       pip3 show jupyterlab
       ```
  3. Install the Dataproc JupyterLab plugin.

    pip3 install dataproc-jupyter-plugin
    • If your JupyterLab version is earlier than 4.0.0, enable the plugin extension.

       jupyter server extension enable dataproc_jupyter_plugin
       ```
  4. Start JupyterLab.

    jupyter lab
    1. The JupyterLab Launcher page opens in your browser. It contains a Dataproc Jobs and Sessions section. It can also contain Dataproc Serverless Notebooks and Dataproc Cluster Notebooks sections if you have access to Dataproc serverless notebooks or Dataproc clusters with the Jupyter optional component running in your project.

    2. By default, your Dataproc Serverless for Spark Interactive session runs in the project and region you set when you ran gcloud init in Before you begin. You can change the project and region settings for your sessions from the JupyterLab Settings > Dataproc Settings page.

Create a Dataproc Serverless runtime template

Dataproc Serverless runtime templates (also called session templates) contain configuration settings for executing Spark code in a session. You can create and manage runtime templates using Jupyterlab or the gcloud CLI.

JupyterLab

  1. Click the New runtime template card in the Dataproc Serverless Notebooks section on the JupyterLab Launcher page.

  2. Fill in the Runtime template form.

  3. Specify a Display name and Description, and then input or confirm the other settings.

    Notes:

    • Network Configuration: The subnetwork must have Private Google Access enabled and must allow subnet communication on all ports (see Dataproc Serverless for Spark network configuration).

      If the default network's subnet for the region you configured when you ran gcloud init in Before you begin is not enabled for Private Google Access:

      • Enable it for Private Google Access, or
      • Select another network with a regional subnetwork that has Private Google Access enabled. You can change the region that Dataproc Serverless uses from the JupyterLab Settings > Dataproc Settings page.
    • Metastore: To use a Dataproc Metastore service in your sessions, select the metastore project ID, region, and service.

    • Max idle time: The maximum notebook idle time before the session is terminated. Allowable range: 10 minutes to 336 hours (14 days).

    • Max session time: The maximum lifetime of a session before the session is terminated. Allowable range: 10 minutes to 336 hours (14 days).

    • PHS: You can select an available Persistent Spark History Server to allow you to access session logs during and after sessions.

    • Spark properties: Click Add Property for each property to set for your serverless Spark sessions. See Spark properties for a listing of supported and unsupported Spark properties, including Spark runtime, resource, and autoscaling properties.

    • Labels: Click Add Label for each label to set on your serverless Spark sessions.

  4. View your runtime templates from the Settings > Dataproc Settings page.

    • You can delete a template from the Action menu for the template.
  5. Click Save.

  6. Open and reload the JupyterLab Launcher page to view the saved notebook template card on the JupyterLab Launcher page.

gcloud

  1. Create a YAML file with your runtime template configuration.

    Simple YAML

    environmentConfig:
      executionConfig:
        networkUri: default
    jupyterSession:
      kernel: PYTHON
      displayName: Team A
    labels:
      purpose: testing
    description: Team A Development Environment
    

    Complex YAML

    environmentConfig:
      executionConfig:
        serviceAccount: sa1
        # Choose either networkUri or subnetworkUri
        networkUri: default
        subnetworkUri: subnet
        networkTags:
         - tag1
        kmsKey: key1
        idleTtl: 3600s
        ttl: 14400s
        stagingBucket: staging-bucket
      peripheralsConfig:
        metastoreService: projects/my-project-id/locations/us-central1/services/my-metastore-id
        sparkHistoryServerConfig:
          dataprocCluster: projects/my-project-id/regions/us-central1/clusters/my-cluster-id
    jupyterSession:
      kernel: PYTHON
      displayName: Team A
    labels:
      purpose: testing
    runtimeConfig:
      version: "1.1"
      containerImage: gcr.io/my-project-id/my-image:1.0.1
      properties:
        "p1": "v1"
    description: Team A Development Environment
    

    If the default network's subnet for the region you configured when you ran gcloud init in Before you begin is not enabled forPrivate Google Access:

    • Enable it for Private Google Access, or
    • Select another network with a regional subnetwork that has Private Google Access enabled. You can change the region that Dataproc Serverless uses from the JupyterLab Settings > Dataproc Settings page.
  2. Create a session (runtime) template from your YAML file by running the following gcloud beta dataproc session-templates import command locally or in Cloud Shell:

    gcloud beta dataproc session-templates import TEMPLATE_ID \
        --source=YAML_FILE \
        --project=PROJECT_ID \
        --location=REGION
    

Launch and manage notebooks

After installing the Dataproc JupyterLab plugin, you can click template cards on the JupyterLab Launcher page to:

Launch a Jupyter notebook on Dataproc Serverless

The Dataproc Serverless Notebooks section on the JupyterLab Launcher page displays notebook template cards that map to Dataproc Serverless runtime templates (see Create a Dataproc Serverless runtime template).

  1. Click a card to create a Dataproc Serverless session and launch a notebook. When session creation is complete and the notebook kernel is ready to use, the kernel status changes from Unknown to Idle.

  2. Write and test notebook code.

    1. Copy and paste the following PySpark Pi estimation code in the PySpark notebook cell, then press Shift+Return to run the code.

      import random
          
      def inside(p):
          x, y = random.random(), random.random()
          return x*x + y*y < 1
          
      count = sc.parallelize(range(0, 10000)) .filter(inside).count()
      print("Pi is roughly %f" % (4.0 * count / 10000))

      Notebook result:

  3. After creating and using a notebook, you can terminate the notebook session by clicking Shut Down Kernel from the Kernel tab.

    • If you don't terminate the session, Dataproc terminates the session when the session idle timer expires. You can configure the session idle time in the runtime template configuration. The default session idle time is one hour.

Launch a notebook on a Dataproc on Compute Engine cluster

If you created a Dataproc on Compute Engine Jupyter cluster, the JupyterLab Launcher page contains a Dataproc Cluster Notebook section with pre-installed kernel cards.

To launch a Jupyter notebook on your Dataproc on Compute Engine cluster:

  1. Click a card in the Dataproc Cluster Notebook section.

  2. When the kernel status changes from Unknown to Idle, you can start writing and executing notebook code.

  3. After creating and using a notebook, you can terminate the notebook session by clicking Shut Down Kernel from the Kernel tab.

Manage input and output files in Cloud Storage

Analyzing exploratory data and building ML models often involves file-based inputs and outputs. Dataproc Serverless accesses these files on Cloud Storage.

  • To access the Cloud Storage browser, click the Cloud Storage browser icon in the JupyterLab Launcher page sidebar, then double-click a folder to view its contents.

  • You can click Jupyter-supported file types to open and edit them. When you save changes to the files, they are written to Cloud Storage.

  • To create a new Cloud Storage folder, click the new folder icon, then enter the name of the folder.

  • To upload files into a Cloud Storage bucket or a folder, click the upload icon, then select the files to upload.

Develop Spark notebook code

After installing the Dataproc JupyterLab plugin, you can launch Jupyter notebooks from the JupyterLab Launcher page to develop application code.

PySpark and Python code development

Dataproc Serverless and Dataproc on Compute Engine clusters support PySpark kernels. Dataproc on Compute Engine also supports Python kernels.

SQL code development

To open a PySpark notebook to write and execute SQL code, on the JupyterLab Launcher page, in the Dataproc Serverless Notebooks or Dataproc Cluster Notebook section, click the PySpark kernel card.

Spark SQL magic: Since the PySpark kernel that launches Dataproc Serverless Notebooks is preloaded with Spark SQL magic, instead of using spark.sql('SQL STATEMENT').show() to wrap your SQL statement, you can type %%sparksql magic at the top of a cell, then type your SQL statement in the cell.

BigQuery SQL: The BigQuery Spark connector allows your notebook code to load data from BigQuery tables, perform analysis in Spark, and then write the results to a BigQuery table.

The Dataproc Serverless 2.1 runtime includes the BigQuery Spark connector. If you use the Dataproc Serverless 2.0 or earlier runtime to launch Dataproc Serverless notebooks, you can install Spark BigQuery Connector by adding the following Spark property to your Dataproc Serverless runtime template:

spark.jars: gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.25.2.jar

Scala code development

Dataproc on Compute Engine clusters created with image version 2.0+, 2.1+ and later, include Apache Toree, a Scala kernel for the Jupyter Notebook platform that provides interactive access to Spark.

  • Click the Apache Toree card in the Dataproc cluster Notebook section on the JupyterLab Launcher page to open a notebook for Scala code development.

    Figure 1. Apache Toree kernel card in the JupyterLab Launcher page.

Metadata explorer

If a Dataproc Metastore (DPMS) instance is attached to a Dataproc Serverless runtime template or a Dataproc on Compute Engine cluster, the DPMS instance schema is displayed in the JupyterLab Metadata Explorer when a notebook is opened. DPMS is a fully-managed and horizontally-scalable Hive Metastore (HMS) service on Google Cloud.

To view HMS metadata in the Metadata Explorer:

To open the JupyterLab Metadata Explorer, click its icon in the sidebar.

You can search for a database, table, or column in the Metadata Explorer. Click a database, table, or column name to view the associated metadata.

Deploy your code

After installing the Dataproc JupyterLab plugin, you can use JupyterLab to:

  • Execute your notebook code on the Dataproc Serverless infrastructure

  • Submit batch jobs to the Dataproc Serverless infrastructure or to your Dataproc on Compute Engine cluster.

Run notebook code on Dataproc Serverless

  • To run code in a notebook cell, click Run or press the Shift-Return keys to run code in a notebook cell.

  • To run code in one or more notebook cells, use the Run menu.

Submit a batch job to Dataproc Serverless

  • Click the Serverless card in the Dataproc Jobs and Sessions section on the JupyterLab Launcher page.

  • Click the Batch tab, then click Create Batch and fill in the Batch Info fields.

  • Click Submit to submit the job.

Submit a batch job to a Dataproc on Compute Engine cluster

  • Click the Clusters card in the Dataproc Jobs and Sessions section on the JupyterLab Launcher page.

  • Click the Jobs tab, then click Submit Job.

  • Select a Cluster, then fill in the Job fields.

  • Click Submit to submit the job.

View and manage resources

After installing the Dataproc JupyterLab plugin, you can view and manage Dataproc Serverless and Dataproc on Compute Engine from the Dataproc Jobs and Sessions section on the JupyterLab Launcher page.

Click the Dataproc Jobs and Sessions section to show the Clusters and Serverless cards.

To view and manage Dataproc Serverless sessions:

  1. Click the Serverless card.
  2. Click the Sessions tab, then a session ID to open the Session details page to view session properties, view Google Cloud logs in Logs Explorer, and terminate a session. Note: A unique Dataproc Serverless session is created to launch each Dataproc Serverless notebook.

To view and manage Dataproc Serverless batches:

  1. Click the Batches tab to view the list of Dataproc Serverless batches in the current project and region. Click a batch ID to view batch details.

To view and manage Dataproc on Compute Engine clusters:

  1. Click the Clusters card. The Clusters tab is selected to list active Dataproc on Compute Engine clusters in the current project and region. You can click the icons in the Actions column to start, stop, or restart a cluster. Click a cluster name to view cluster details. You can click the icons in the Actions column to clone, stop, or delete a job.

To view and manage Dataproc on Compute Engine jobs:

  1. Click the Jobs card to view the list of jobs in the current project. Click a job ID to view job details.