Dataproc Jupyter Component

You can install additional components when you create a Dataproc cluster using the Optional Components feature. This page describes the Jupyter component.

The Jupyter component is a Web-based notebook for interactive data analytics and supports the JupyterLab Web UI. The Jupyter Web UI is available on port 8123 on the cluster's first master node.

The Jupyter notebook provides a Python kernel to run Spark code, and a PySpark kernel. By default, notebooks are saved in Cloud Storage in the Dataproc staging bucket, which is specified by the user or auto-created when the cluster is created. The location can be changed at cluster creation time via the dataproc:jupyter.notebook.gcs.dir property.

Install Jupyter

Install the component when you create a Dataproc cluster. Components can be added to clusters created with Dataproc version 1.3 and later. With Dataproc image versions other than the preview 2.0 image, the Jupyter component requires the installation of the Anaconda component (Anaconda component installation is not needed or available when using the preview 2.0 image).

See Supported Dataproc versions for the component version included in each Dataproc image release.

gcloud command

To create a Dataproc cluster that includes the Jupyter component, use the gcloud dataproc clusters create cluster-name command with the --optional-components flag. The example, below, installs both the Jupyter and Anaconda components (Anaconda component installation is not needed or available when using the preview 2.0 image).

gcloud dataproc clusters create cluster-name \
    --optional-components=ANACONDA,JUPYTER \
    --region=region \
    --enable-component-gateway \
    ... other flags

REST API

The Jupyter and Anaconda components can be specified through the Dataproc API using SoftwareConfig.Component as part of a clusters.create request (Anaconda component installation is not needed or available when using the preview 2.0 image).

Console

  1. Enable the component and component gateway.
    • In the Cloud Console, open the Dataproc Create a cluster page. The Set up cluster panel is selected.
    • In the Components section:
      • Under Optional components, select Anaconda, Jupyter, and other optional components to install on your cluster. NOTE: If using the preview 2.0 image, Anaconda component installation is not needed or available.
      • Under Component Gateway, select Enable component gateway (see Viewing and Accessing Component Gateway URLs).

Open the Jupyter and JupyterLab UIs

Click the Cloud Console Component Gateway links to open in your local browser the Jupyter notebook or JupyterLab UIs running on your cluster's master node.

Select "GCS" or "Local Disk" to create a new Jupyter Notebook in either location.

Attaching GPUs to Master and/or Worker Nodes

You can add GPUs to your cluster's master and worker nodes when using a Jupyter notebook to:

  1. Preprocess data in Spark, then collect a DataFrame onto the master and run TensorFlow
  2. Use Spark to orchestrate TensorFlow runs in parallel
  3. Run Tensorflow-on-YARN
  4. Use with other machine learning scenarios that use GPUs