Use Dataproc Hub to open the JupyterLab UI on a single-user Dataproc cluster.
Objectives
Use Dataproc Hub to create a JupyterLab notebook environment running on a single-user Dataproc cluster.
Create a notebook and run a Spark job on the Dataproc cluster.
Delete your cluster and preserve your notebook in Cloud Storage.
Before you begin
- The administrator must granted you
notebooks.instances.use
permission (see Set Identity and Access Management (IAM) roles).
Open a JupyterLab Notebook UI on a Dataproc cluster
Open the Dataproc Hub UI:
- If you have access to the Google Cloud console, on the
Dataproc→Notebooks instances
page in the Google Cloud console, click OPEN JUPYTERLAB in the row that
lists the Dataproc Hub instance created by an administrator.
- If you do not have access to the Google Cloud console, from your web browser enter the Dataproc Hub instance URL that the administrator shared with you.
- If you have access to the Google Cloud console, on the
Dataproc→Notebooks instances
page in the Google Cloud console, click OPEN JUPYTERLAB in the row that
lists the Dataproc Hub instance created by an administrator.
On the Jupyterhub page, select a cluster configuration and zone. If enabled, specify any customizations, then click Start.
The cluster takes a few minutes to create. After the cluster is created, you are redirected to the JupyterLab UI running on the Dataproc cluster.
Create a notebook and run a Spark job
On the left panel of the JupyterLab UI, click on
GCS
orlocal
.Create a PySpark notebook.
The PySpark kernel initializes a SparkContext (using the
sc
variable). You can examine the SparkContext and run a Spark job from the notebook.rdd = (sc.parallelize(['lorem', 'ipsum', 'dolor', 'sit', 'amet', 'lorem']) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b)) print(rdd.collect())
Name and save the notebook. The notebook is saved and remains in Cloud Storage after the Dataproc cluster is deleted.
Shut down the Dataproc cluster
From the JupyterLab UI, select File→Hub Control Panel to OPEN the Dataproc Hub UI.
Click Stop My Cluster to shut down (delete) the Jupyter server, which deletes the Dataproc cluster.
- go to the Dataproc→Notebooks instances page in the Google Cloud console, then click OPEN JUPYTERLAB in the row that lists the Dataproc Hub instance to open the Datahub UI.
- Select the cluster configuration and zone to spawn a new cluster.
What's next
- Explore Spark and Jupyter Notebooks on Dataproc on GitHub.