Use Dataproc Hub

Use Dataproc Hub to open the JupyterLab UI on on a single-user Dataproc cluster.

Objectives

  1. Use Dataproc Hub to create a JupyterLab notebook environment running on a single-user Dataproc cluster.

  2. Create a notebook and run a Spark job on the Dataproc cluster.

  3. Delete your cluster and preserve your notebook in Cloud Storage.

Before you begin

  1. You must have access to a Dataproc Hub instance created by an administrator. Make sure the admin has granted you the roles/iam.serviceAccountUser role for the service account running on the Hub VM.

Open a JupyterLab Notebook UI on a Dataproc cluster

  1. On the Dataproc→Notebooks instances page in the Cloud Console, click OPEN JUPYTERLAB in the row that lists the Dataproc Hub instance created by an administrator.

  2. On the Jupyterhub page, select a cluster configuration and zone, specify any customizations, then click Start.

    After the cluster is created, you are redirected to the JupyterLab UI running on the Dataproc cluster.

Create a notebook and run a Spark job

  1. From the JupyterLab UI, create a PySpark notebook.

  2. The PySpark kernel initializes a SparkContext (using the sc variable). You can examine the SparkContext and run a Spark job from the notebook.

    rdd = (sc.parallelize(['lorem', 'ipsum', 'dolor', 'sit', 'amet', 'lorem'])
           .map(lambda word: (word, 1))
           .reduceByKey(lambda a, b: a + b))
    print(rdd.collect())
    
  3. Name and save the notebook. The notebook is saved and remains in Cloud Storage after the Dataproc cluster is deleted.

Delete the Dataproc cluster

  1. Select File→Hub Control Panel to navigate to the Dataproc Hub UI.

  2. Click Stop My Server to shut down the Dataproc cluster.

Cleaning up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial:

What's next