This document describes how to install and use the Dataproc JupyterLab plugin on a machine or VM that has access to Google services, such as your local machine or a Compute Engine VM instance. It also describes how to develop and deploy Spark notebook code.
Once you install the Dataproc JupyterLab plugin, you can use it to do the following tasks:
- Launch Dataproc Serverless for Spark interactive notebook sessions
- Submit Dataproc Serverless batch jobs
Dataproc Serverless limitations and considerations
- Spark jobs are executed with the service account identity, not the submitting user's identity.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Enable the Dataproc API.
- Install the Google Cloud CLI.
-
To initialize the gcloud CLI, run the following command:
gcloud init
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Enable the Dataproc API.
- Install the Google Cloud CLI.
-
To initialize the gcloud CLI, run the following command:
gcloud init
- Make sure the regional VPC subnet where you will run your Dataproc Serverless interactive session has Private Google Access enabled. For more information, see Create a Dataproc Serverless runtime template
Install the Dataproc JupyterLab plugin
You can install and use the Dataproc JupyterLab plugin on a machine or VM that has access to Google services, such as your local machine or a Compute Engine VM instance.
To install the plugin, follow these steps:
Download and install Python version 3.8 or higher from
python.org/downloads
.Verify the Python 3.8+ installation.
python3 --version
Install
JupyterLab 3.6.3+
on your machine.pip3 install --upgrade jupyterlab
Verify the JupyterLab 3.6.3+ installation.
pip3 show jupyterlab ```
Install the Dataproc JupyterLab plugin.
pip3 install dataproc-jupyter-plugin
If your JupyterLab version is earlier than
4.0.0
, enable the plugin extension.jupyter server extension enable dataproc_jupyter_plugin ```
-
jupyter lab
The JupyterLab Launcher page opens in your browser. It contains a Dataproc Jobs and Sessions section. It can also contain Dataproc Serverless Notebooks and Dataproc Cluster Notebooks sections if you have access to Dataproc serverless notebooks or Dataproc clusters with the Jupyter optional component running in your project.
By default, your Dataproc Serverless for Spark Interactive session runs in the project and region you set when you ran
gcloud init
in Before you begin. You can change the project and region settings for your sessions from the JupyterLab Settings > Dataproc Settings page.
Create a Dataproc Serverless runtime template
Dataproc Serverless runtime templates (also called session
templates)
contain configuration settings for executing Spark code in a session. You can
create and manage runtime templates using Jupyterlab or the gcloud CLI.
JupyterLab
Click the
New runtime template
card in the Dataproc Serverless Notebooks section on the JupyterLab Launcher page.Fill in the Runtime template form.
Specify a Display name and Description, and then input or confirm the other settings.
Notes:
Network Configuration: The subnetwork must have Private Google Access enabled and must allow subnet communication on all ports (see Dataproc Serverless for Spark network configuration).
If the
default
network's subnet for the region you configured when you rangcloud init
in Before you begin is not enabled for Private Google Access:- Enable it for Private Google Access, or
- Select another network with a regional subnetwork that has Private Google Access enabled. You can change the region that Dataproc Serverless uses from the JupyterLab Settings > Dataproc Settings page.
Metastore: To use a Dataproc Metastore service in your sessions, select the metastore project ID, region, and service.
Max idle time: The maximum notebook idle time before the session is terminated. Allowable range: 10 minutes to 336 hours (14 days).
Max session time: The maximum lifetime of a session before the session is terminated. Allowable range: 10 minutes to 336 hours (14 days).
PHS: You can select an available Persistent Spark History Server to allow you to access session logs during and after sessions.
Spark properties: Click Add Property for each property to set for your serverless Spark sessions. See Spark properties for a listing of supported and unsupported Spark properties, including Spark runtime, resource, and autoscaling properties.
Labels: Click Add Label for each label to set on your serverless Spark sessions.
View your runtime templates from the Settings > Dataproc Settings page.
- You can delete a template from the Action menu for the template.
Click Save.
Open and reload the JupyterLab Launcher page to view the saved notebook template card on the JupyterLab Launcher page.
gcloud
Create a YAML file with your runtime template configuration.
Simple YAML
environmentConfig: executionConfig: networkUri: default jupyterSession: kernel: PYTHON displayName: Team A labels: purpose: testing description: Team A Development Environment
Complex YAML
environmentConfig: executionConfig: serviceAccount: sa1 # Choose either networkUri or subnetworkUri networkUri: default subnetworkUri: subnet networkTags: - tag1 kmsKey: key1 idleTtl: 3600s ttl: 14400s stagingBucket: staging-bucket peripheralsConfig: metastoreService: projects/my-project-id/locations/us-central1/services/my-metastore-id sparkHistoryServerConfig: dataprocCluster: projects/my-project-id/regions/us-central1/clusters/my-cluster-id jupyterSession: kernel: PYTHON displayName: Team A labels: purpose: testing runtimeConfig: version: "1.1" containerImage: gcr.io/my-project-id/my-image:1.0.1 properties: "p1": "v1" description: Team A Development Environment
If the
default
network's subnet for the region you configured when you rangcloud init
in Before you begin is not enabled forPrivate Google Access:- Enable it for Private Google Access, or
- Select another network with a regional subnetwork that has Private Google Access enabled. You can change the region that Dataproc Serverless uses from the JupyterLab Settings > Dataproc Settings page.
Create a session (runtime) template from your YAML file by running the following gcloud beta dataproc session-templates import command locally or in Cloud Shell:
gcloud beta dataproc session-templates import TEMPLATE_ID \ --source=YAML_FILE \ --project=PROJECT_ID \ --location=REGION
- See gcloud beta dataproc session-templates for commands to describe, list, export, and delete session templates.
Launch and manage notebooks
After installing the Dataproc JupyterLab plugin, you can click template cards on the JupyterLab Launcher page to:
Launch a Jupyter notebook on Dataproc Serverless
The Dataproc Serverless Notebooks section on the JupyterLab Launcher page displays notebook template cards that map to Dataproc Serverless runtime templates (see Create a Dataproc Serverless runtime template).
Click a card to create a Dataproc Serverless session and launch a notebook. When session creation is complete and the notebook kernel is ready to use, the kernel status changes from
Unknown
toIdle
.Write and test notebook code.
Copy and paste the following PySpark
Pi estimation
code in the PySpark notebook cell, then press Shift+Return to run the code.import random def inside(p): x, y = random.random(), random.random() return x*x + y*y < 1 count = sc.parallelize(range(0, 10000)) .filter(inside).count() print("Pi is roughly %f" % (4.0 * count / 10000))
Notebook result:
After creating and using a notebook, you can terminate the notebook session by clicking Shut Down Kernel from the Kernel tab.
- If you don't terminate the session, Dataproc terminates the session when the session idle timer expires. You can configure the session idle time in the runtime template configuration. The default session idle time is one hour.
Launch a notebook on a Dataproc on Compute Engine cluster
If you created a Dataproc on Compute Engine Jupyter cluster, the JupyterLab Launcher page contains a Dataproc Cluster Notebook section with pre-installed kernel cards.
To launch a Jupyter notebook on your Dataproc on Compute Engine cluster:
Click a card in the Dataproc Cluster Notebook section.
When the kernel status changes from
Unknown
toIdle
, you can start writing and executing notebook code.After creating and using a notebook, you can terminate the notebook session by clicking Shut Down Kernel from the Kernel tab.
Manage input and output files in Cloud Storage
Analyzing exploratory data and building ML models often involves file-based inputs and outputs. Dataproc Serverless accesses these files on Cloud Storage.
To access the Cloud Storage browser, click the Cloud Storage browser icon in the JupyterLab Launcher page sidebar, then double-click a folder to view its contents.
You can click Jupyter-supported file types to open and edit them. When you save changes to the files, they are written to Cloud Storage.
To create a new Cloud Storage folder, click the new folder icon, then enter the name of the folder.
To upload files into a Cloud Storage bucket or a folder, click the upload icon, then select the files to upload.
Develop Spark notebook code
After installing the Dataproc JupyterLab plugin, you can launch Jupyter notebooks from the JupyterLab Launcher page to develop application code.
PySpark and Python code development
Dataproc Serverless and Dataproc on Compute Engine clusters support PySpark kernels. Dataproc on Compute Engine also supports Python kernels.
Click a PySpark card in the Dataproc Serverless Notebooks or Dataproc Cluster Notebook section on the JupyterLab Launcher page to open a PySpark notebook.
Click a Python kernel card in the Dataproc Cluster Notebook section on the JupyterLab Launcher page to open a Python notebook.
SQL code development
To open a PySpark notebook to write and execute SQL code, on the JupyterLab Launcher page, in the Dataproc Serverless Notebooks or Dataproc Cluster Notebook section, click the PySpark kernel card.
Spark SQL magic: Since the PySpark kernel that launches
Dataproc Serverless Notebooks
is preloaded with Spark SQL magic, instead of using spark.sql('SQL STATEMENT').show()
to wrap your SQL statement, you can type
%%sparksql magic
at the top of a cell, then type your SQL statement in the cell.
BigQuery SQL: The BigQuery Spark connector allows your notebook code to load data from BigQuery tables, perform analysis in Spark, and then write the results to a BigQuery table.
The Dataproc Serverless 2.1 runtime includes the BigQuery Spark connector. If you use the Dataproc Serverless 2.0 or earlier runtime to launch Dataproc Serverless notebooks, you can install Spark BigQuery Connector by adding the following Spark property to your Dataproc Serverless runtime template:
spark.jars: gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.25.2.jar
Scala code development
Dataproc on Compute Engine clusters created with image version 2.0+, 2.1+ and later, include Apache Toree, a Scala kernel for the Jupyter Notebook platform that provides interactive access to Spark.
Click the Apache Toree card in the Dataproc cluster Notebook section on the JupyterLab Launcher page to open a notebook for Scala code development.
Figure 1. Apache Toree kernel card in the JupyterLab Launcher page.
Metadata explorer
If a Dataproc Metastore (DPMS) instance is attached to a Dataproc Serverless runtime template or a Dataproc on Compute Engine cluster, the DPMS instance schema is displayed in the JupyterLab Metadata Explorer when a notebook is opened. DPMS is a fully-managed and horizontally-scalable Hive Metastore (HMS) service on Google Cloud.
To view HMS metadata in the Metadata Explorer:
Enable the Data Catalog API in your project.
Enable Data Catalog sync in your DPMS service.
Specify a DPMS instance when you create the Dataproc Serverless runtime template or create the Dataproc on Compute Engine cluster.
To open the JupyterLab Metadata Explorer, click its icon in the sidebar.
You can search for a database, table, or column in the Metadata Explorer. Click a database, table, or column name to view the associated metadata.
Deploy your code
After installing the Dataproc JupyterLab plugin, you can use JupyterLab to:
Execute your notebook code on the Dataproc Serverless infrastructure
Submit batch jobs to the Dataproc Serverless infrastructure or to your Dataproc on Compute Engine cluster.
Run notebook code on Dataproc Serverless
To run code in a notebook cell, click Run or press the Shift-Return keys to run code in a notebook cell.
To run code in one or more notebook cells, use the Run menu.
Submit a batch job to Dataproc Serverless
Click the Serverless card in the Dataproc Jobs and Sessions section on the JupyterLab Launcher page.
Click the Batch tab, then click Create Batch and fill in the Batch Info fields.
Click Submit to submit the job.
Submit a batch job to a Dataproc on Compute Engine cluster
Click the Clusters card in the Dataproc Jobs and Sessions section on the JupyterLab Launcher page.
Click the Jobs tab, then click Submit Job.
Select a Cluster, then fill in the Job fields.
Click Submit to submit the job.
View and manage resources
After installing the Dataproc JupyterLab plugin, you can view and manage Dataproc Serverless and Dataproc on Compute Engine from the Dataproc Jobs and Sessions section on the JupyterLab Launcher page.
Click the Dataproc Jobs and Sessions section to show the Clusters and Serverless cards.
To view and manage Dataproc Serverless sessions:
- Click the Serverless card.
- Click the Sessions tab, then a session ID to open the Session details page to view session properties, view Google Cloud logs in Logs Explorer, and terminate a session. Note: A unique Dataproc Serverless session is created to launch each Dataproc Serverless notebook.
To view and manage Dataproc Serverless batches:
- Click the Batches tab to view the list of Dataproc Serverless batches in the current project and region. Click a batch ID to view batch details.
To view and manage Dataproc on Compute Engine clusters:
- Click the Clusters card. The Clusters tab is selected to list active Dataproc on Compute Engine clusters in the current project and region. You can click the icons in the Actions column to start, stop, or restart a cluster. Click a cluster name to view cluster details. You can click the icons in the Actions column to clone, stop, or delete a job.
To view and manage Dataproc on Compute Engine jobs:
- Click the Jobs card to view the list of jobs in the current project. Click a job ID to view job details.