Install and run a Jupyter notebook on a Cloud Dataproc cluster

Before you begin

If you haven't already done so, create a Google Cloud Platform project and a Cloud Storage bucket.

Set up your project

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. Select or create a GCP project.

    Go to the Manage resources page

  3. Make sure that billing is enabled for your project.

    Learn how to enable billing

  4. Enable the Cloud Dataproc,Compute Engine APIs.

    Enable the APIs

  5. Install and initialize the Cloud SDK.

Create a Cloud Storage bucket in your project

  1. In the GCP Console, go to the Cloud Storage browser.

    Go to the Cloud Storage browser

  2. Click Create bucket.
  3. In the Create bucket dialog, specify the following attributes:
  4. Click Create.
  5. Your notebooks will be stored in Cloud Storage under gs://bucket-name/notebooks/.

Overview

This tutorial demonstrates the creation of a Cloud Dataproc cluster with a bash shell initialization script that installs and runs a Jupyter notebook on the cluster. The script is located in Cloud Storage at gs://dataproc-initialization-actions/jupyter/jupyter.sh and co-located with other initialization scripts at the GoogleCloudPlatform/dataproc-initialization-actions repository.

After creating the cluster with an initialization script that installs the Jupyter notebook on the cluster, this tutorial shows you how to connect to the notebook with your local browser.

#!/bin/bash

#

set -exo pipefail

readonly ROLE="$(/usr/share/google/get_metadata_value attributes/dataproc-role)"
readonly INIT_ACTIONS_REPO="$(/usr/share/google/get_metadata_value attributes/INIT_ACTIONS_REPO \
  || echo 'https://github.com/GoogleCloudPlatform/dataproc-initialization-actions.git')"
readonly INIT_ACTIONS_BRANCH="$(/usr/share/google/get_metadata_value attributes/INIT_ACTIONS_BRANCH \
  || echo 'master')"

readonly JUPYTER_CONDA_CHANNELS="$(/usr/share/google/get_metadata_value attributes/JUPYTER_CONDA_CHANNELS)"

readonly JUPYTER_CONDA_PACKAGES="$(/usr/share/google/get_metadata_value attributes/JUPYTER_CONDA_PACKAGES)"

echo "Cloning fresh dataproc-initialization-actions from repo ${INIT_ACTIONS_REPO} and branch ${INIT_ACTIONS_BRANCH}..."
git clone -b "${INIT_ACTIONS_BRANCH}" --single-branch "${INIT_ACTIONS_REPO}"

./dataproc-initialization-actions/conda/bootstrap-conda.sh

source /etc/profile.d/conda.sh

if [ -n "${JUPYTER_CONDA_CHANNELS}" ]; then
  echo "Adding custom conda channels '${JUPYTER_CONDA_CHANNELS//:/ }'"
  conda config --add channels "${JUPYTER_CONDA_CHANNELS//:/,}"
fi

if [ -n "${JUPYTER_CONDA_PACKAGES}" ]; then
  echo "Installing custom conda packages '${JUPYTER_CONDA_PACKAGES/:/ }'"
  conda install ${JUPYTER_CONDA_PACKAGES//:/ }
fi

if [[ "${ROLE}" == 'Master' ]]; then
  conda install jupyter matplotlib

  pip install jgscm==0.1.7

  ./dataproc-initialization-actions/jupyter/internal/setup-jupyter-kernel.sh
  ./dataproc-initialization-actions/jupyter/internal/launch-jupyter-kernel.sh
fi
echo "Completed installing Jupyter!"

if [[ ! -v "${INSTALL_JUPYTER_EXT}" ]]; then
  INSTALL_JUPYTER_EXT=false
fi
if [[ "${INSTALL_JUPYTER_EXT}" = true ]]; then
  echo "Installing Jupyter Notebook extensions..."
  ./dataproc-initialization-actions/jupyter/internal/bootstrap-jupyter-ext.sh
  echo "Jupyter Notebook extensions installed!"
fi

Create a cluster and install a Jupyter notebook

gcloud Command

  1. Run the following gcloud dataproc clusters create command locally in a terminal window or in Cloud Shell (Cloud Shell runs Linux) to create your cluster and install your Jupyter notebook on the cluster's master node. Insert your values in the cluster-name, project-id, and bucket-name placeholders. For bucket-name, specify the name of the bucket you created in Create a Cloud Storage bucket in your project (only specify the name of the bucket). Your notebooks will be stored in Cloud Storage under gs://bucket-name/notebooks/. The --initialization-actions flag specifies the location in Cloud Storage where the initialization script is located (see Initialization actions for more information).

    Linux/Mac OS X

    gcloud dataproc clusters create cluster-name \
        --project project-id \
        --bucket bucket-name \
        --initialization-actions \
            gs://dataproc-initialization-actions/jupyter/jupyter.sh
    

    Windows

    gcloud dataproc clusters create cluster-name ^
        --project project-id ^
        --bucket bucket-name ^
        --initialization-actions ^
            gs://dataproc-initialization-actions/jupyter/jupyter.sh
    

Console

  1. Go to the GCP Console Cloud Dataproc Clusters page.
  2. Click Create cluster to open the Create a cluster page.
  3. Enter the name of your cluster in the Name field.
  4. Select a region and zone for the cluster from the Region and Zone drop-down menus (see Available regions & zones). The default is the global region, which is a special multi-region namespace that is capable of deploying instances into all Compute Engine zones globally. You can also specify a distinct region and select "No preference" for the zone to let Cloud Dataproc pick a zone within the selected region for your cluster (see Cloud Dataproc Auto Zone Placement).
  5. Expand the Preemptible workers, bucket, network, version, initialization, & access options panel.

  6. Enter the name of the bucket you created in Create a Cloud Storage in your project in the Cloud Storage staging bucket field (only specify the name of the bucket). Your notebooks will be stored in Cloud Storage under gs://bucket-name/notebooks/.
  7. Enter gs://dataproc-initialization-actions/jupyter/jupyter.sh in the Initialization actions field. This script, which sets up and runs a Jupyter notebook in the master instance of the cluster, will be run immediately after the new cluster is created.

    You can copy and paste the URI from the code block below.
    gs://dataproc-initialization-actions/jupyter/jupyter.sh
    
  8. You can use the provided defaults for all the other options.

  9. Click Create to create the cluster and install the Jupyter notebook on the cluster's master node.

Open the Jupyter notebook in your browser

After your cluster is running, perform the steps below to open the Jupyter notebook in your browser.

  1. Create an SSH tunnel. If you use Cloud Shell with local port forwarding to create the SSH tunnel, specify port 8123 as the web interface port on the cluster master node.

  2. Configure your browser

  3. Connect to the notebook interface. If you used a SOCKS proxy with dynamic port forwarding to create an SSH tunnel, enter the following URL in your browser to connect to your notebook: http://cluster-name-m:8123.

The opening page of the Jupyter notebook displays in your local browser.

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Dataproc Documentation