You can install additional components like Jupyter when you create a Dataproc cluster using the Optional components feature. This page describes the Jupyter component.
The Jupyter notebook provides a Python kernel to run
Spark code, and a
PySpark kernel. By default, notebooks are
saved in Cloud Storage
Dataproc staging bucket, which is specified by the user or
when the cluster is created. The location can be changed at cluster creation time via the
dataproc:jupyter.notebook.gcs.dir cluster property.
Install the component when you create a Dataproc cluster. The Jupyter component requires activation of the Dataproc Component Gateway. When using image version 1.5, installation of the Jupyter component also requires installation of the Anaconda component.
- Enable the component.
- In the Google Cloud console, open the Dataproc Create a cluster page. The Set up cluster panel is selected.
- In the Components section:
- Under Optional components, select the Jupyter component, and, if using image version 1.5, the Anaconda component.
- Under Component Gateway, select Enable component gateway (see Viewing and Accessing Component Gateway URLs).
To create a Dataproc cluster that includes the Jupyter component,
gcloud dataproc clusters create cluster-name command with the
Latest default image version example
The following example installs the Jupyter component on a cluster that uses the latest default image version.
gcloud dataproc clusters create cluster-name \ --optional-components=JUPYTER \ --region=region \ --enable-component-gateway \ ... other flags
1.5 image version example
The following 1.5 image version example installs both the Jupyter and Anaconda components (Anaconda component installation is required when using image version 1.5).
gcloud dataproc clusters create cluster-name \ --optional-components=ANACONDA,JUPYTER \ --region=region \ --image-version=1.5 \ --enable-component-gateway \ ... other flags
The Jupyter component
can be installed through the Dataproc API using
as part of a
request (Anaconda component installation is also required when using image version 1.5).
Open the Jupyter and JupyterLab UIs
Click the Google Cloud console Component Gateway links to open in your local browser the Jupyter notebook or JupyterLab UI running on the cluster master node.
Select "GCS" or "Local Disk" to create a new Jupyter Notebook in either location.
Attaching GPUs to Master and/or Worker Nodes
You can add GPUs to your cluster's master and worker nodes when using a Jupyter notebook to:
- Preprocess data in Spark, then collect a DataFrame onto the master and run TensorFlow
- Use Spark to orchestrate TensorFlow runs in parallel
- Run Tensorflow-on-YARN
- Use with other machine learning scenarios that use GPUs