Tutorial: Customizing the Dataproc notebook spawner for Dataproc Hub using Terraform

This tutorial shows how to customize Dataproc Hub by customizing the JupyterHub open source code that it's based on. The tutorial is intended for administrators who build Jupyter notebook environments for data science research. It assumes that you are familiar with Jupyter notebooks and with Dataproc Hub.

This article is the second part of a series that discusses how to choose a notebook platform on Google Cloud, how to customize the JupyterHub open source code, and then how to run the modified version on one of Google Cloud infrastructure options.

The series includes the following documents:

This tutorial is relevant if all of the following apply:

  • You are an IT administrator in your company.
  • Your company uses Apache Spark.
  • You plan to use Dataproc as a managed way to run Spark.
  • You need to centrally manage notebook profiles for your end users (data scientists).
  • Data scientists need to run interactive Spark jobs and use their preferred ML library from the same Jupyter notebooks, using a hardware configuration that meets their needs.
  • You want to launch Dataproc Hub using Terraform.

This tutorial explains how to adapt the existing open source Dataproc Jupyter spawner to your needs and how to deploy it using Dataproc Hub.

For background information about extending Dataproc Hub, see the introduction to this series. For definitions of the terms used in this series, see the terminology section of the introduction. The introduction document also explains the following architecture, which is included here as a reminder of the infrastructure that you build in this tutorial.

Architecture of the solution created in this tutorial.

This tutorial focuses on the following:

  • Customizing the Dataproc spawner used by the JupyterHub service in the diagram.
  • Deploying a new version of the JupyterHub service.

Objectives

  • Launch Dataproc Hub with Terraform.
  • Customize the JupyterHub container of the JupyterHub component in the architecture illustrated earlier. You add a custom field in the spawner code.
  • Spawn a notebook server on Dataproc by using a custom service account.
  • Spawn another cluster using the same custom service account.

Costs

This tutorial uses the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  4. In the Cloud Console, activate Cloud Shell.

    Activate Cloud Shell

    At the bottom of the Cloud Console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Cloud SDK already installed, including the gcloud command-line tool, and with values already set for your current project. It can take a few seconds for the session to initialize.

Setting up your environment

  1. In Cloud Shell, export your Google project ID to an environment variable, replacing YOUR_PROJECT_ID with the ID of your project:

    export PROJECT_ID=YOUR_PROJECT_ID
    
  2. Set your project ID for the Cloud SDK client:

    gcloud config set project ${PROJECT_ID}
    
  3. Enable the APIs that you need for this tutorial:

    gcloud services enable \
        compute.googleapis.com \
        dataproc.googleapis.com \
        cloudbuild.googleapis.com \
        containerregistry.googleapis.com
    
  4. Set environment variables for values that you need later in the tutorial:

    # Image location of the modified dataproc spawner image.
    export DOCKER_UPDATED_SPAWNER=gcr.io/${PROJECT_ID}/dataprocspawner:sa
    # Top-level folder where you download git repositories.
    export WORKING_FOLDER=$(pwd)
    # Name of the instance that hosts JupyterHub and its modified spawner.
    export HUB_INSTANCE_NAME=hubsa
    

Modifying the spawner

By default, Dataproc Hub sets the identity of the cluster that hosts the spawned Jupyter notebook to the default service account for your project. The repository that you cloned contains a Python script (customize_cluster.py) that generates an HTML form where users can specify custom values before the notebook is spawned. In this section, you modify that Python code to add an <input> element in the spawning form so that users can specify their own service account.

  1. In Cloud Shell, clone the jupyterhub-dataprocspawner repository:

    git clone https://github.com/GoogleCloudDataproc/jupyterhub-dataprocspawner.git
    
  2. Go to the directory that has the spawner code:

    cd  ${WORKING_FOLDER}/jupyterhub-dataprocspawner/dataprocspawner
    
  3. Add an <input> element in the HTML form so that data scientists can use a custom service account as identity for the Dataproc cluster:

    # Creates the new input field as a new HTML string.
    sed -i '/html_zone += f'\'''\'''\''<\/select><\/div><\/section>'\'''\'''\''/c \
      html_zone += "</select></div>"\
      html_sa = """<div class="form-group">\
          <label for="cluster_sa">Service Account for Dataproc</label>\
          <input name="cluster_sa" class="form-control" \
            placeholder="Example: '\''sa@[PROJECT_ID].iam.gserviceaccount.com'\''" \
            value=""></input></div>"""\
      html_sa += "</section>"' customize_cluster.py
    
    # Adds the new HTML string to the returned string.
    sed -i 's~return html_config + "\\n" + html_zone~return html_config + "\\n" + html_zone + "\\n" + html_sa~g' customize_cluster.py
    

    The command does the following:

    1. Creates a string variable (html_sa) in the Python code that stores HTML markup; the variable contains the markup for the <input> element. The <input> element name is cluster_sa.
    2. Adds the html_sa string variable to the returned string variable that stores the HTML form.

    The following images show the form that's created by default, and what the form looks like after the <input> element has been added. Before you made this change, the form rendered as follows:

    Form showing options for launching an cluster.

    After this change, the form renders as follows:

    Form showing options for launching a clusterk but with an added textbox for the service account.

  4. Update the spawner to read the new value of the new <input> element using its cluster_sa reference name:

    sed -i '/cluster_zone = self.user_options.get('\''cluster_zone'\'')/a \
          cluster_sa = self.user_options.get('\''cluster_sa'\'')' spawner.py
    
  5. Update the logic that decides which service account to use for the Dataproc instances:

    sed -i '/if self.dataproc_service_account:/i \
        self.dataproc_service_account = self.dataproc_service_account or cluster_sa' spawner.py
    

    Before you made this change, the code supported only two options: either the administrator provided a service account or the system used the default Compute Engine service account. After this change, a user can pass their own service account name if the administrator has not provided one.

  6. Return to the working folder:

    cd ${WORKING_FOLDER}
    

Creating a new spawner image

By default, Dataproc Hub uses the content of the jupyterhub-dataprocspawner repository to create the Dataproc Hub Docker image. This section shows how you can modify the Dockerfile behavior to use your own source folder.

  1. In Cloud Shell, clone the repository that contains the Dataproc Hub builder:

    git clone https://github.com/GoogleCloudPlatform/ai-notebooks-extended
    
  2. Go to the folder where you can create a Dataproc Hub Docker image:

    cd  ${WORKING_FOLDER}/ai-notebooks-extended/dataproc-hub-example/build/dataprochub-builder
    
  3. Copy the folder with the updated spawner code to the current folder:

      cp -rf ${WORKING_FOLDER}/jupyterhub-dataprocspawner jupyterhub-dataprocspawner
    
  4. Update the spawner reference in the Dockerfile:

    sed -i '/RUN pip install git+https:\/\/github.com\/GoogleCloudDataproc\/jupyterhub-dataprocspawner.git/c \
    COPY jupyterhub-dataprocspawner/ jupyterhub-dataprocspawner\
    RUN cd jupyterhub-dataprocspawner && pip install .' Dockerfile
    
  5. Create a new Docker container image for your spawner and host it in Container Registry:

    gcloud builds submit -t ${DOCKER_UPDATED_SPAWNER} .
    
  6. Wait for the container to build and deploy.

    The process takes about 5 minutes. While it's working, the process displays output; when the process is done, the output stops.

Deploying using Terraform

  1. Go to the folder where you can use Terraform to deploy Dataproc Hub:

    cd ${WORKING_FOLDER}/ai-notebooks-extended/dataproc-hub-example/build/infrastructure-builder/ain/
    
  2. Update the Terraform file to change the reference to the spawner image that you uploaded to Container Registry:

    sed -i "s~container = \"gcr.io/cloud-dataproc/dataproc-spawner:prod\"~container = \"${DOCKER_UPDATED_SPAWNER}\"~g" main.tf
    
  3. Initialize the Terraform working directory:

    terraform init
    
  4. Create an execution plan to validate your changes:

    terraform plan \
        -var="instance_name=${HUB_INSTANCE_NAME}" \
        -var="project_id=${PROJECT_ID}"
    

    If you want to see what will be deployed, examine the output of the command.

  5. Deploy Dataproc Hub with the updated spawner:

    terraform apply -auto-approve \
        -var="instance_name=${HUB_INSTANCE_NAME}" \
        -var="project_id=${PROJECT_ID}"
    

Creating a new service account for Dataproc

This section shows you how to modify the Dataproc Hub default form to let users add a default service account.

  1. Create a service account:

    SA_NAME=sa-example
    PROJECT_ID=$(gcloud config list --format 'value(core.project)')
    
    gcloud iam service-accounts create ${SA_NAME}\
        --display-name ${SA_NAME} \
        --project ${PROJECT_ID}
    
  2. Grant the minimum permissions to use Dataproc:

    gcloud projects add-iam-policy-binding ${PROJECT_ID} \
        --member serviceAccount:${SA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com \
        --role roles/storage.admin
    
    gcloud projects add-iam-policy-binding ${PROJECT_ID} \
        --member serviceAccount:${SA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com \
        --role roles/dataproc.worker
    
    gcloud projects add-iam-policy-binding ${PROJECT_ID} \
        --member serviceAccount:${SA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com \
        --role roles/compute.imageUser
    

Checking the results

When the Dataproc Hub instance is running, you can check the results.

  1. In Cloud Shell, get the IP address of the Inverting Proxy:

    gcloud compute instances get-serial-port-output hubsa | grep Hostname | sed 's/^.*: //'
    

    The command returns a URL that has a format similar to the following

    1a2b3cde4fgh5i6j-dot-us-west1.notebooks.googleusercontent.com
    
  2. Open a web browser and enter the URL.

  3. Check that the HTML form includes the <input> element and the label Service Account for Dataproc that you added earlier.

    Form for launching a cluster with sample values filled in.

  4. In the HTML form, enter the name of a service account that has enough Identity and Access Management roles.

  5. To spawn the notebook server on Dataproc, click Start. The Dataproc cluster takes a few minutes to start.

  6. From Cloud Shell, check that Dataproc uses the service account that you provided:

    gcloud dataproc clusters describe dataprochub-[USERNAME] --region us-central1 \
      | grep serviceAccount: \
      |  sed 's/^.*: //'
    

If all goes well, you see the service account email address that you created earlier.

Cleaning up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Delete the project

  1. In the Cloud Console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

  • If you want to use a different Dataproc spawner image than the one used by Dataproc Hub, you can still use this Terraform approach. High-level steps for using any image are listed in the introduction document of this series.
  • To learn how to deploy Dataproc Hub on a managed instance group with Identity-Aware Proxy, see the Dataproc Hub Extended example tutorial.
  • Explore reference architectures, diagrams, tutorials, and best practices about Google Cloud. Take a look at our Cloud Architecture Center.