Use Dataproc Hub to open the JupyterLab UI on on a single-user Dataproc cluster.
Use Dataproc Hub to create a JupyterLab notebook environment running on a single-user Dataproc cluster.
Create a notebook and run a Spark job on the Dataproc cluster.
Delete your cluster and preserve your notebook in Cloud Storage.
Before you begin
Open a JupyterLab Notebook UI on a Dataproc cluster
On the Dataproc→Notebooks instances page in the Cloud Console, click OPEN JUPYTERLAB in the row that lists the Dataproc Hub instance created by an administrator.
On the Jupyterhub page, select a cluster configuration and zone, specify any customizations, then click Start.
After the cluster is created, you are redirected to the JupyterLab UI running on the Dataproc cluster.
Create a notebook and run a Spark job
From the JupyterLab UI, create a PySpark notebook.
The PySpark kernel initializes a SparkContext (using the
scvariable). You can examine the SparkContext and run a Spark job from the notebook.
rdd = (sc.parallelize(['lorem', 'ipsum', 'dolor', 'sit', 'amet', 'lorem']) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b)) print(rdd.collect())
Name and save the notebook. The notebook is saved and remains in Cloud Storage after the Dataproc cluster is deleted.
Delete the Dataproc cluster
Select File→Hub Control Panel to navigate to the Dataproc Hub UI.
Click Stop My Server to shut down the Dataproc cluster.
- Explore Spark and Jupyter Notebooks on Dataproc on GitHub.