Install and run a Datalab notebook on a Dataproc cluster

Set up your project

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. In the Cloud Console, on the project selector page, select or create a Cloud project.

    Go to the project selector page

  3. Make sure that billing is enabled for your Google Cloud project. Learn how to confirm billing is enabled for your project.

  4. Enable the Dataproc, Compute Engine APIs.

    Enable the APIs

  5. Install and initialize the Cloud SDK.


This tutorial demonstrates the creation of a Dataproc cluster with a bash shell initialization script that installs and runs a Datalab notebook on the cluster. The script is located in regional Cloud Storage bucket at gs://goog-dataproc-initialization-actions-<REGION>/datalab/ and co-located with other initialization scripts at the GitHub GoogleCloudDataproc/initialization-actions repository repository.

After creating the cluster with an initialization script that installs the Datalab notebook on the cluster, this tutorial shows you how to connect to the notebook with your local browser.


set -exo pipefail

readonly ROLE="$(/usr/share/google/get_metadata_value attributes/dataproc-role)"
readonly PROJECT="$(/usr/share/google/get_metadata_value ../project/project-id)"
readonly SPARK_PACKAGES="$(/usr/share/google/get_metadata_value attributes/spark-packages || true)"
readonly SPARK_CONF='/etc/spark/conf/spark-defaults.conf'
readonly DATALAB_DIR="${HOME}/datalab"
readonly PYTHONPATH="/env/python:$(find /usr/lib/spark/python/lib -name '*.zip' | paste -sd:)"
readonly DOCKER_IMAGE="$(/usr/share/google/get_metadata_value attributes/docker-image ||
  echo '')"

readonly DEAFULT_INIT_ACTIONS_REPO=gs://dataproc-initialization-actions
readonly INIT_ACTIONS_REPO="$(/usr/share/google/get_metadata_value attributes/INIT_ACTIONS_REPO ||
readonly INIT_ACTIONS_BRANCH="$(/usr/share/google/get_metadata_value attributes/INIT_ACTIONS_BRANCH ||
  echo 'master')"

VOLUMES="$(echo /etc/{hadoop*,hive*,*spark*})"

if [[ -d /usr/local/share/google/dataproc/lib ]]; then
if [[ -L ${CONNECTORS_LIB}/gcs-connector.jar ]]; then
  VOLUMES+=" ${CONNECTORS_LIB}/gcs-connector.jar"
  VOLUMES+=" $(compgen -G ${CONNECTORS_LIB}/gcs*)"
if [[ -L ${CONNECTORS_LIB}/bigquery-connector.jar ]]; then
  VOLUMES+=" ${CONNECTORS_LIB}/bigquery-connector.jar"
elif compgen -G "${CONNECTORS_LIB}/bigquery*" >/dev/null; then
  VOLUMES+=" $(compgen -G ${CONNECTORS_LIB}/bigquery*)"

readonly VOLUMES
readonly VOLUME_FLAGS="$(echo "${VOLUMES}" | sed 's/\S*/-v &:&/g')"

function err() {
  echo "[$(date +'%Y-%m-%dT%H:%M:%S%z')]: $*" >&2
  return 1

function install_docker() {
  local init_actions_dir
  init_actions_dir=$(mktemp -d -t dataproc-init-actions-XXXX)
  if [[ ${INIT_ACTIONS_REPO} == gs://* ]]; then
    gsutil -m rsync -r "${INIT_ACTIONS_REPO}" "${init_actions_dir}"
    git clone -b "${INIT_ACTIONS_BRANCH}" --single-branch "${INIT_ACTIONS_REPO}" "${init_actions_dir}"
  find "${init_actions_dir}" -name '*.sh' -exec chmod +x {} \;

function docker_pull() {
  for ((i = 0; i < 10; i++)); do
    if (gcloud docker -- pull "$1"); then
      return 0
    sleep 5
  return 1

function configure_master() {
  mkdir -p "${DATALAB_DIR}"

  docker_pull "${DOCKER_IMAGE}" || err "Failed to pull ${DOCKER_IMAGE}"

  if ! grep -q '^spark\.sql\.warehouse\.dir=' "${SPARK_CONF}"; then
    echo 'spark.sql.warehouse.dir=/root/spark-warehouse' >>"${SPARK_CONF}"

  touch ${VOLUMES}

  for package in ${SPARK_PACKAGES//','/' '}; do
    pyspark_submit_args+="--packages ${package} "

  mkdir -p datalab-pyspark
  pushd datalab-pyspark
  cp /etc/apt/trusted.gpg .
  cp /etc/apt/sources.list.d/dataproc.list .
  cat <<EOF >Dockerfile

ADD dataproc.list /etc/apt/sources.list.d/
ADD trusted.gpg /tmp/vm_trusted.gpg
RUN apt-key add /tmp/vm_trusted.gpg

RUN apt-get update
RUN apt-get install -y software-properties-common
RUN add-apt-repository 'deb bionic main'

RUN apt-get update
RUN apt-get install -y hive spark-python openjdk-8-jre-headless

ENV PYSPARK_PYTHON=$(ls /opt/conda/bin/python || command -v python)

ENV SPARK_HOME='/usr/lib/spark'
ENV JAVA_HOME='/usr/lib/jvm/java-8-openjdk-amd64'
ENV PYTHONSTARTUP='/usr/lib/spark/python/pyspark/'
ENV PYSPARK_SUBMIT_ARGS='${pyspark_submit_args}'
  docker build -t datalab-pyspark .

function run_datalab() {
  if docker run -d --restart always --net=host \
    -v "${DATALAB_DIR}:/content/datalab" ${VOLUME_FLAGS} datalab-pyspark; then
    echo 'Cloud Datalab Jupyter server successfully deployed.'
    err 'Failed to run Cloud Datalab'

function main() {
  if [[ "${ROLE}" == 'Master' ]]; then


Create a cluster and install a Datalab notebook

gcloud Command

  1. Run the following gcloud dataproc clusters create command locally in a terminal window or in Cloud Shell (Cloud Shell runs Linux) to create your cluster and install your Datalab notebook on the cluster's master node. Insert your values in the cluster-name and project-id placeholders. The --initialization-actions flag specifies the location in Cloud Storage where the initialization script is located (see Initialization actions for more information).


    gcloud dataproc clusters create cluster-name \
        --project project-id \
        --initialization-actions \


    gcloud dataproc clusters create cluster-name ^
        --project project-id ^
        --initialization-actions ^


  1. Go to the Cloud Console Dataproc Clusters page.
  2. Click Create cluster to open the Create a cluster page.
  3. Enter the name of your cluster in the Name field.
  4. Select a region and zone for the cluster from the Region and Zone drop-down menus (see Available regions and zones). You can specify a distinct region and select "No preference" for the zone to let Dataproc pick a zone within the selected region for your cluster (see Dataproc Auto Zone Placement). You can instead select a global region, which is a special multi-region namespace that is capable of deploying instances into all Compute Engine zones globally (when selecting a global region, you must also select a zone).
  5. Expand the Advanced options panel.
  6. Enter gs://goog-dataproc-initialization-actions-<REGION>/datalab/ in the Initialization actions field. This script, which installs up and runs a Datalab notebook on the master instance of the cluster, will be run immediately after the new cluster is created.
    You can copy and paste the URI from the code block below.
  7. You can use the provided defaults for all the other options.
  8. Click Create to create the cluster and install the Cloud Datalab notebook on the cluster's master node.

Open the Datalab notebook in your browser

After your cluster is running, perform the following steps to open the Datalab notebook in your browser:

  1. Create an SSH tunnel. If you use Cloud Shell with local port forwarding to create the SSH tunnel, specify port 8080 as the web interface port on the cluster master node.

  2. Configure your browser.

  3. Connect to the notebook interface. If you used a SOCKS proxy with dynamic port forwarding to create an SSH tunnel, enter the following URL in your browser to connect to your notebook: http://cluster-name-m:8080.

The Datalab notebook opens in your browser's window.