Use Dataproc Hub

Use Dataproc Hub to open the JupyterLab UI on on a single-user Dataproc cluster.

Objectives

  1. Use Dataproc Hub to create a JupyterLab notebook environment running on a single-user Dataproc cluster.

  2. Create a notebook and run a Spark job on the Dataproc cluster.

  3. Delete your cluster and preserve your notebook in Cloud Storage.

Before you begin

  1. You must have access to a Dataproc Hub instance created by an administrator. Make sure the admin has granted you the roles/iam.serviceAccountUser role for the service account running on the Hub VM.

Open a JupyterLab Notebook UI on a Dataproc cluster

  1. On the Dataproc→Notebooks instances page in the Cloud Console, click OPEN JUPYTERLAB in the row that lists the Dataproc Hub instance created by an administrator.

  2. On the Jupyterhub page, select a cluster configuration and zone, specify any customizations, then click Start.

    After the cluster is created, you are redirected to the JupyterLab UI running on the Dataproc cluster.

Create a notebook and run a Spark job

  1. From the JupyterLab UI, create a PySpark notebook.

  2. The PySpark kernel initializes a SparkContext (using the sc variable). You can examine the SparkContext and run a Spark job from the notebook.

    rdd = (sc.parallelize(['lorem', 'ipsum', 'dolor', 'sit', 'amet', 'lorem'])
           .map(lambda word: (word, 1))
           .reduceByKey(lambda a, b: a + b))
    print(rdd.collect())
    
  3. Name and save the notebook. The notebook is saved and remains in Cloud Storage after the Dataproc cluster is deleted.

Delete the Dataproc cluster

  1. Select File→Hub Control Panel to navigate to the Dataproc Hub UI.

  2. Click Stop My Server to shut down the Dataproc cluster.

What's next