Install and run a Jupyter notebook on a Dataproc cluster

Objectives

This tutorial shows you how to install the Dataproc Jupyter and Anaconda components on a new cluster, and then connect to the Jupyter notebook UI running on the cluster from your local browser using the Dataproc Component Gateway.

Costs

This tutorial uses billable components of Google Cloud, including:

Use the Pricing Calculator to generate a cost estimate based on your projected usage. New Google Cloud users may be eligible for a free trial.

Before you begin

If you haven't already done so, create a Google Cloud Platform project and a Cloud Storage bucket.

  1. Setting up your project

    1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
    2. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

      Go to project selector

    3. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

    4. Enable the Dataproc, Compute Engine, and Cloud Storage APIs.

      Enable the APIs

    5. Install and initialize the Cloud SDK.
    6. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

      Go to project selector

    7. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

    8. Enable the Dataproc, Compute Engine, and Cloud Storage APIs.

      Enable the APIs

    9. Install and initialize the Cloud SDK.

  2. Creating a Cloud Storage bucket in your project to store any notebooks you create in this tutorial.

    1. In the Cloud Console, go to the Cloud Storage Browser page.

      Go to Browser

    2. Click Create bucket.
    3. On the Create a bucket page, enter your bucket information. To go to the next step, click Continue.
      • For Name your bucket, enter a name that meets the bucket naming requirements.
      • For Choose where to store your data, do the following:
        • Select a Location type option.
        • Select a Location option.
      • For Choose a default storage class for your data, select a storage class.
      • For Choose how to control access to objects, select an Access control option.
      • For Advanced settings (optional), specify an encryption method, a retention policy, or bucket labels.
    4. Click Create.
    5. Your notebooks will be stored in Cloud Storage under gs://bucket-name/notebooks/jupyter.

Create a cluster and install the Jupyter component

Create a cluster with the installed Jupyter component.

Open the Jupyter and JupyterLab UIs

Click the Cloud Console Component Gateway links in the Cloud Console to open the Jupyter notebook or JupyterLab UIs running on your cluster's master node.

The top-level directory displayed by your Jupyter instance is a virtual directory that allows you to see the contents of either your Cloud Storage bucket or your local file system. You can choose either location by clicking on the GCS link for Cloud Storage or Local Disk for the local filesystem of the master node in your cluster.

  1. Click the GCS link. The Jupyter notebook web UI displays notebooks stored in your Cloud Storage bucket, including any notebooks you create in this tutorial.

Clean up

After you finish the tutorial, you can clean up the resources that you created so that they stop using quota and incurring charges. The following sections describe how to delete or turn off these resources.

Deleting the project

The easiest way to eliminate billing is to delete the project that you created for the tutorial.

To delete the project:

  1. In the Cloud Console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

Deleting the cluster

  • To delete your cluster:
    gcloud dataproc clusters delete cluster-name \
        --region=${REGION}
    

Deleting the bucket

  • To delete the Cloud Storage bucket you created in Before you begin, step 2, including the notebooks stored in the bucket:
    gsutil -m rm -r gs://${BUCKET_NAME}
    

What's next