Install and run a Cloud Datalab notebook on a Cloud Dataproc cluster

Set up your project

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. Select or create a GCP project.

    Go to the Manage resources page

  3. Make sure that billing is enabled for your project.

    Learn how to enable billing

  4. Enable the Cloud Dataproc, Compute Engine APIs.

    Enable the APIs

  5. Install and initialize the Cloud SDK.

Overview

This tutorial demonstrates the creation of a Cloud Dataproc cluster with a bash shell initialization script that installs and runs a Cloud Datalab notebook on the cluster. The script is located in Cloud Storage at gs://dataproc-initialization-actions/datalab/datalab.sh and co-located with other initialization scripts at the GitHub GoogleCloudPlatform/dataproc-initialization-actions repository repository.

After creating the cluster with an initialization script that installs the Cloud Datalab notebook on the cluster, this tutorial shows you how to connect to the notebook with your local browser.

#!/bin/bash


set -exo pipefail

readonly ROLE="$(/usr/share/google/get_metadata_value attributes/dataproc-role)"
readonly PROJECT="$(/usr/share/google/get_metadata_value ../project/project-id)"
readonly SPARK_PACKAGES="$(/usr/share/google/get_metadata_value attributes/spark-packages || true)"
readonly SPARK_CONF='/etc/spark/conf/spark-defaults.conf'
readonly DATALAB_DIR="${HOME}/datalab"
readonly PYTHONPATH="/env/python:$(find /usr/lib/spark/python/lib -name '*.zip' | paste -sd:)"
readonly DOCKER_IMAGE="$(/usr/share/google/get_metadata_value attributes/docker-image || \
  echo 'gcr.io/cloud-datalab/datalab:local')"

readonly INIT_ACTIONS_REPO="$(/usr/share/google/get_metadata_value attributes/INIT_ACTIONS_REPO \
  || echo 'https://github.com/GoogleCloudPlatform/dataproc-initialization-actions.git')"
readonly INIT_ACTIONS_BRANCH="$(/usr/share/google/get_metadata_value attributes/INIT_ACTIONS_BRANCH \
  || echo 'master')"

readonly VOLUMES="$(echo /etc/{hadoop*,hive*,*spark*} /usr/lib/hadoop/lib/{gcs,bigquery}*)"
readonly VOLUME_FLAGS="$(echo "${VOLUMES}" | sed 's/\S*/-v &:&/g')"

function err() {
  echo "[$(date +'%Y-%m-%dT%H:%M:%S%z')]: $@" >&2
  return 1
}

function install_docker() {
  git clone -b "${INIT_ACTIONS_BRANCH}" --single-branch "${INIT_ACTIONS_REPO}"
  chmod +x ./dataproc-initialization-actions/docker/docker.sh
  ./dataproc-initialization-actions/docker/docker.sh
}

function docker_pull() {
  for ((i = 0; i < 10; i++)); do
    if (gcloud docker -- pull $1); then
      return 0
    fi
    sleep 5
  done
  return 1
}

function configure_master(){
  mkdir -p "${DATALAB_DIR}"

  docker_pull ${DOCKER_IMAGE} || err "Failed to pull ${DOCKER_IMAGE}"

  if ! grep -q '^spark\.sql\.warehouse\.dir=' "${SPARK_CONF}"; then
    echo 'spark.sql.warehouse.dir=/root/spark-warehouse' >> "${SPARK_CONF}"
  fi

  touch ${VOLUMES}

  pyspark_submit_args=''
  for package in ${SPARK_PACKAGES//','/' '}; do
    pyspark_submit_args+="--packages ${package} "
  done
  pyspark_submit_args+='pyspark-shell'

  mkdir -p datalab-pyspark
  pushd datalab-pyspark
  cp /etc/apt/trusted.gpg .
  cp /etc/apt/sources.list.d/backports.list .
  cp /etc/apt/sources.list.d/dataproc.list .
  cat << EOF > Dockerfile
FROM ${DOCKER_IMAGE}
ADD backports.list /etc/apt/sources.list.d/
ADD dataproc.list /etc/apt/sources.list.d/
ADD trusted.gpg /tmp/vm_trusted.gpg

RUN apt-key add /tmp/vm_trusted.gpg
RUN apt-get update
RUN apt-get install -y hive spark-python openjdk-8-jre-headless

ENV PYSPARK_PYTHON=$(ls /opt/conda/bin/python || which python)

ENV SPARK_HOME='/usr/lib/spark'
ENV JAVA_HOME='${JAVA_HOME}'
ENV PYTHONPATH='${PYTHONPATH}'
ENV PYTHONSTARTUP='/usr/lib/spark/python/pyspark/shell.py'
ENV PYSPARK_SUBMIT_ARGS='${pyspark_submit_args}'
ENV DATALAB_ENV='GCE'
EOF
  docker build -t datalab-pyspark .
  popd

}

function run_datalab(){
  if docker run -d --restart always --net=host \
      -v "${DATALAB_DIR}:/content/datalab" ${VOLUME_FLAGS} datalab-pyspark; then
    echo 'Cloud Datalab Jupyter server successfully deployed.'
  else
    err 'Failed to run Cloud Datalab'
  fi
}

function main(){
  install_docker

  if [[ "${ROLE}" == 'Master' ]]; then
    configure_master
    run_datalab
  fi
}

main

Create a cluster and install a Cloud Datalab notebook

gcloud Command

  1. Run the following gcloud dataproc clusters create command locally in a terminal window or in Cloud Shell (Cloud Shell runs Linux) to create your cluster and install your Cloud Datalab notebook on the cluster's master node. Insert your values in the cluster-name and project-id placeholders. The --initialization-actions flag specifies the location in Cloud Storage where the initialization script is located (see Initialization actions for more information).

    Linux/macOS

    gcloud dataproc clusters create cluster-name \
        --project project-id \
        --initialization-actions \
            gs://dataproc-initialization-actions/datalab/datalab.sh
    

    Windows

    gcloud dataproc clusters create cluster-name ^
        --project project-id ^
        --initialization-actions ^
            gs://dataproc-initialization-actions/datalab/datalab.sh
    

Console

  1. Go to the GCP Console Cloud Dataproc Clusters page.
  2. Click Create cluster to open the Create a cluster page.
  3. Enter the name of your cluster in the Name field.
  4. Select a region and zone for the cluster from the Region and Zone drop-down menus (see Available regions & zones). The default is the global region, which is a special multi-region namespace that is capable of deploying instances into all Compute Engine zones globally. You can also specify a distinct region and select "No preference" for the zone to let Cloud Dataproc pick a zone within the selected region for your cluster (see Cloud Dataproc Auto Zone Placement).
  5. Expand the Advanced options panel.
  6. Enter gs://dataproc-initialization-actions/datalab/datalab.sh in the Initialization actions field. This script, which installs up and runs a Cloud Datalab notebook on the master instance of the cluster, will be run immediately after the new cluster is created.
    You can copy and paste the URI from the code block below.
    gs://dataproc-initialization-actions/datalab/datalab.sh
    
  7. You can use the provided defaults for all the other options.
  8. Click Create to create the cluster and install the Cloud Datalab notebook on the cluster's master node.

Open the Cloud Datalab notebook in your browser

After your cluster is running, perform the following steps to open the Cloud Datalab notebook in your browser:

  1. Create an SSH tunnel. If you use Cloud Shell with local port forwarding to create the SSH tunnel, specify port 8080 as the web interface port on the cluster master node.

  2. Configure your browser.

  3. Connect to the notebook interface. If you used a SOCKS proxy with dynamic port forwarding to create an SSH tunnel, enter the following URL in your browser to connect to your notebook: http://cluster-name-m:8080.

The Cloud Datalab notebook opens in your browser's window.

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Dataproc Documentation