Install and run a Jupyter notebook on a Dataproc cluster

Objectives

This tutorial shows you how to install the Dataproc Jupyter and Anaconda components on a new cluster, and then connect to the Jupyter notebook UI running on the cluster from your local browser using the Dataproc Component Gateway.

Costs

This tutorial uses billable components of Google Cloud, including:

Use the Pricing Calculator to generate a cost estimate based on your projected usage. New Google Cloud users may be eligible for a free trial.

Before you begin

If you haven't already done so, create a Google Cloud Platform project and a Cloud Storage bucket.

  1. Setting up your project

    1. Sign in to your Google Account.

      If you don't already have one, sign up for a new account.

    2. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

      Go to the project selector page

    3. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

    4. Enable the Dataproc, Compute Engine, and Cloud Storage APIs.

      Enable the APIs

    5. Install and initialize the Cloud SDK.

  2. Creating a Cloud Storage bucket in your project to store any notebooks you create in this tutorial.

    1. In the Cloud Console, go to the Cloud Storage Browser page.

      Go to the Cloud Storage Browser page

    2. Click Create bucket.
    3. In the Create bucket dialog, specify the following attributes:
    4. Click Create.
    5. Your notebooks will be stored in Cloud Storage under gs://bucket-name/notebooks/jupyter.

Create a cluster and install the Jupyter component

Create a cluster with the installed Jupyter component.

Open the Jupyter and JupyterLab UIs

Click the Cloud Console Component Gateway links in the Cloud Console to open the Jupyter notebook or JupyterLab UIs running on your cluster's master node.

The top-level directory displayed by your Jupyter instance is a virtual directory that allows you to see the contents of either your Cloud Storage bucket or your local file system. You can choose either location by clicking on the GCS link for Cloud Storage or Local Disk for the local filesystem of the master node in your cluster.

  1. Click the GCS link. The Jupyter notebook web UI displays notebooks stored in your Cloud Storage bucket, including any notebooks you create in this tutorial.

Cleaning up

After you've finished the Install and run a Jupyter notebook on a Dataproc cluster tutorial, you can clean up the resources that you created on Google Cloud so they won't take up quota and you won't be billed for them in the future. The following sections describe how to delete or turn off these resources.

Deleting the project

The easiest way to eliminate billing is to delete the project that you created for the tutorial.

To delete the project:

  1. In the Cloud Console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

Deleting the cluster

  • To delete your cluster:
    gcloud dataproc clusters delete cluster-name \
        --region=${REGION}
    

Deleting the bucket

  • To delete the Cloud Storage bucket you created in Before you begin, step 2, including the notebooks stored in the bucket:
    gsutil -m rm -r gs://${BUCKET_NAME}
    

What's next