You can install additional components like Jupyter when you create a Dataproc cluster using the Optional components feature. This page describes the Jupyter component.
The Jupyter component
is a Web-based single-user notebook for interactive data analytics and supports the
JupyterLab
Web UI. The Jupyter Web UI is available on port 8123
on the cluster's first master node.
Launch notebooks for multiple users. You can create a Dataproc-enabled Vertex AI Workbench instance or install the Dataproc JupyterLab plugin on a VM to to serve notebooks to multiple users.
Configure Jupyter. Jupyter can be configured by providing dataproc:jupyter
cluster properties.
To reduce the risk of remote code execution over unsecured notebook server
APIs, the default dataproc:jupyter.listen.all.interfaces
cluster property
setting is false
, which restricts connections to localhost (127.0.0.1)
when
the Component Gateway is
enabled (Component Gateway activation is required when installing the Jupyter component).
The Jupyter notebook provides a Python kernel to run Spark code, and a
PySpark kernel. By default, notebooks are saved in Cloud Storage
in the Dataproc staging bucket, which is specified by the user or
auto-created
when the cluster is created. The location can be changed at cluster creation time using the
dataproc:jupyter.notebook.gcs.dir
cluster property.
Work with data files. You can use a Jupyter notebook to work with data files that have been uploaded to Cloud Storage. Since the Cloud Storage connector is pre-installed on a Dataproc cluster, you can reference the files directly in your notebook. Here's an example that accesses CSV files in Cloud Storage:
df = spark.read.csv("gs://bucket/path/file.csv") df.show()
See Generic Load and Save Functions for PySpark examples.
Install Jupyter
Install the component when you create a Dataproc cluster. The Jupyter component requires activation of the Dataproc Component Gateway.
Console
- Enable the component.
- In the Google Cloud console, open the Dataproc Create a cluster page. The Set up cluster panel is selected.
- In the Components section:
- Under Optional components, select the Jupyter component.
- Under Component Gateway, select Enable component gateway (see Viewing and Accessing Component Gateway URLs).
gcloud CLI
To create a Dataproc cluster that includes the Jupyter component,
use the
gcloud dataproc clusters create cluster-name command with the --optional-components
flag.
Latest default image version example
The following example installs the Jupyter component on a cluster that uses the latest default image version.
gcloud dataproc clusters create cluster-name \ --optional-components=JUPYTER \ --region=region \ --enable-component-gateway \ ... other flags
REST API
The Jupyter component
can be installed through the Dataproc API using
SoftwareConfig.Component
as part of a
clusters.create
request.
- Set the EndpointConfig.enableHttpPortAccess
property to
true
as part of theclusters.create
request to enable connecting to the Jupyter notebook Web UI using the Component Gateway.
Open the Jupyter and JupyterLab UIs
Click the Google Cloud console Component Gateway links to open in your local browser the Jupyter notebook or JupyterLab UI running on the cluster master node.
Select "GCS" or "Local Disk" to create a new Jupyter Notebook in either location.
Attach GPUs to master and worker nodes
You can add GPUs to your cluster's master and worker nodes when using a Jupyter notebook to:
- Preprocess data in Spark, then collect a DataFrame onto the master and run TensorFlow
- Use Spark to orchestrate TensorFlow runs in parallel
- Run Tensorflow-on-YARN
- Use with other machine learning scenarios that use GPUs