How to install and run a Jupyter notebook in a Cloud Dataproc cluster

This tutorial shows the steps to install and run a Jupyter notebook on a Cloud Dataproc cluster. The initialization action used in this tutorial will install and configure Jupyter and the PySpark kernel.

Before you begin

If you haven't already done so, create a Google Cloud Platform project and a Google Cloud Storage bucket.

Set up your project

  1. Sign in to your Google account.

    If you don't already have one, sign up for a new account.

  2. Select or create a Cloud Platform project.

    Go to the Projects page

  3. Enable billing for your project.

    Enable billing

  4. Enable the Cloud Dataproc,Google Compute Engine APIs.

    Enable the APIs

  5. Install and initialize the Cloud SDK.

Create a Cloud Storage bucket in your project

  1. In the Cloud Platform Console, go to the Cloud Storage browser.

    Go to the Cloud Storage browser

  2. Click Create bucket.
  3. In the Create bucket dialog, specify the following attributes:
  4. Click Create.

Overview

This tutorial demonstrates the use of the Cloud SDK gcloud dataproc clusters create command to create a cluster and run a Jupyter notebook initialization script on the cluster.

The following --help command shows how to specify an initialization script to pass to the Cloud SDK gcloud dataproc clusters create command.

gcloud dataproc clusters create --help
...
SYNOPSIS
    gcloud dataproc clusters create NAME [--bucket BUCKET]
        ...
        [--initialization-action-timeout TIMEOUT; default="10m"]
        [--initialization-actions CLOUD_STORAGE_URI,[CLOUD_STORAGE_URI,...]]

The --initialization-actions flag specifies the location(s) in Google Cloud Storage where the initialization scripts and/or executables are located (see Initialization actions for more information on this Cloud Dataproc feature).

This tutorial uses a bash shell initialization script that has been uploaded to Cloud Storage at gs://dataproc-initialization-actions/jupyter/jupyter.sh. This script and other initialization scripts are co-located at the GitHub GoogleCloudPlatform/dataproc-initialization-actions repository repository.

#!/usr/bin/env bash
set -e

ROLE=$(curl -f -s -H Metadata-Flavor:Google http://metadata/computeMetadata/v1/instance/attributes/dataproc-role)
INIT_ACTIONS_REPO=$(curl -f -s -H Metadata-Flavor:Google http://metadata/computeMetadata/v1/instance/attributes/INIT_ACTIONS_REPO || true)
INIT_ACTIONS_REPO="${INIT_ACTIONS_REPO:-https://github.com/GoogleCloudPlatform/dataproc-initialization-actions.git}"
INIT_ACTIONS_BRANCH=$(curl -f -s -H Metadata-Flavor:Google http://metadata/computeMetadata/v1/instance/attributes/INIT_ACTIONS_BRANCH || true)
INIT_ACTIONS_BRANCH="${INIT_ACTIONS_BRANCH:-master}"
DATAPROC_BUCKET=$(curl -f -s -H Metadata-Flavor:Google http://metadata/computeMetadata/v1/instance/attributes/dataproc-bucket)

JUPYTER_CONDA_PACKAGES=$(curl -f -s -H Metadata-Flavor:Google http://metadata/computeMetadata/v1/instance/attributes/JUPYTER_CONDA_PACKAGES || true)

echo "Cloning fresh dataproc-initialization-actions from repo $INIT_ACTIONS_REPO and branch $INIT_ACTIONS_BRANCH..."
git clone -b "$INIT_ACTIONS_BRANCH" --single-branch $INIT_ACTIONS_REPO
./dataproc-initialization-actions/conda/bootstrap-conda.sh

source /etc/profile.d/conda.sh

if [ -n "${JUPYTER_CONDA_PACKAGES}" ]; then
  echo "Installing custom conda packages '$(echo ${JUPYTER_CONDA_PACKAGES} | tr ':' ' ')'"
  conda install $(echo ${JUPYTER_CONDA_PACKAGES} | tr ':' ' ')
fi

if [[ "${ROLE}" == 'Master' ]]; then
    conda install jupyter
    if gsutil -q stat "gs://$DATAPROC_BUCKET/notebooks/**"; then
        echo "Pulling notebooks directory to cluster master node..."
        gsutil -m cp -r gs://$DATAPROC_BUCKET/notebooks /root/
    fi
    ./dataproc-initialization-actions/jupyter/internal/setup-jupyter-kernel.sh
    ./dataproc-initialization-actions/jupyter/internal/launch-jupyter-kernel.sh
fi
echo "Completed installing Jupyter!"

if [[ ! -v $INSTALL_JUPYTER_EXT ]]
    then
    INSTALL_JUPYTER_EXT=false
fi
if [[ "$INSTALL_JUPYTER_EXT" = true ]]
then
    echo "Installing Jupyter Notebook extensions..."
    ./dataproc-initialization-actions/jupyter/internal/bootstrap-jupyter-ext.sh
    echo "Jupyter Notebook extensions installed!"
fi

Create your cluster

Run the following gcloud dataproc clusters command to create your cluster and install the Jupyter notebook on your cluster's master node. Insert your values in the <cluster-name>, <project-id>, and <bucket-name> placeholders.

gcloud dataproc clusters create <cluster-name> \
    --project <project-id> \
    --bucket <bucket-name>
    --initialization-actions \
        gs://dataproc-initialization-actions/jupyter/jupyter.sh

Open the Jupyter notebook in your browser

This section explains how to use dynamic port forwarding (via an SSH tunnel using the SOCKS protocol) to connect your browser to the Jupyter notebook running on your cluster's master node. This method is more secure than opening a publicly reachable port on the master node. See Connecting to the web interfaces for more information on the dynamic port forwarding commands and command options used in this section.

Create an SSH tunnel

Run the following command to create an SSH tunnel to your cluster's master node from port 10000 on your localhost machine. Insert your cluster's zone and name in the <cluster-zone> and <cluster-name>placeholders, respectively.

gcloud compute ssh --zone=<cluster-zone> \
  --ssh-flag="-D" --ssh-flag="10000" --ssh-flag="-N" "<cluster-name>-m"

This runs the SSH tunnel process in the foreground, so you may want to issue that command in a separate terminal tab/window since you must leave it running as long as you're using the tunnel. Alternatively, you can run it as a background process as long as you add the -n flag to redirect stdin from /dev/null, and the SSH tunnel should terminate automatically if you delete the cluster. Note that the -n flag may not be supported in all operating systems.

gcloud compute ssh --zone=<cluster-zone> \
  --ssh-flag="-D" --ssh-flag="10000" --ssh-flag="-N" --ssh-flag="-n" "<cluster-name>-m" &

Configure your browser

Your SSH tunnel supports traffic proxying using the SOCKS protocol. Run the following command in a separate terminal window to configure your browser to use the proxy when connecting to your cluster:

  • insert the path to your browser's executable in the <browser executable path> placeholder. You can use the table, below, to determine your browser's path.

  • replace the <cluster-name> placeholder with the name of your cluster

This command launches a new browser that connects through the SSH tunnel to the Jupyter notebook application running on your cluster's master node.

<browser executable path> \ "http://<cluster-name>-m:8123" \ --proxy-server="socks5://localhost:10000" \ --host-resolver-rules="MAP * 0.0.0.0 , EXCLUDE localhost" \ --user-data-dir=/tmp/

The opening page of the Jupyter notebook displays in your browser.

Send feedback about...

Google Cloud Dataproc Documentation