Configure Dataproc Hub

Dataproc Hub is a customized Jupyterhub server. Admins configure and create Dataproc Hub instances that can spawn single-user Dataproc clusters to host Jupyter and JupyterLab notebook environments (see Use Dataproc Hub).

Objectives

  1. Define a Dataproc Hub cluster configuration.

  2. Set Dataproc Hub instance environment variables.

  3. Create a Dataproc Hub instance. .

Before you begin

If you haven't already done so, create a Google Cloud project and a Cloud Storage bucket.

  1. Setting up your project

    1. Sign in to your Google Account.

      If you don't already have one, sign up for a new account.

    2. In the Cloud Console, on the project selector page, select or create a Cloud project.

      Go to the project selector page

    3. Make sure that billing is enabled for your Google Cloud project. Learn how to confirm billing is enabled for your project.

    4. Enable the Dataproc, Compute Engine, and Cloud Storage APIs.

      Enable the APIs

    5. Install and initialize the Cloud SDK.

  2. Creating a Cloud Storage bucket in your project to hold the data used in this tutorial.

    1. In the Cloud Console, go to the Cloud Storage Browser page.

      Go to the Cloud Storage Browser page

    2. Click Create bucket.
    3. In the Create bucket dialog, specify the following attributes:
    4. Click Create.

Define a cluster configuration

A Dataproc Hub instance creates a cluster from configuration values contained in a YAML cluster configuration file.

The cluster configuration can specify any feature or component available to Dataproc clusters (such as machine type, initialization actions, and optional components). The cluster image version must be 1.4.13 or higher. Attempting to spawn a cluster with an image version lower than 1.4.13 will cause an error and fail.

Sample YAML cluster configuration file

clusterName: cluster-name
config:
  gceClusterConfig:
    metadata:
      'PIP_PACKAGES': 'google-cloud-core>=1.3.0 google-cloud-storage>=1.28.1'
  initializationActions:
  - executableFile: gs://dataproc-initialization-actions/python/pip-install.sh
  softwareConfig:
    imageVersion: 1.5-ubuntu18
    optionalComponents:
    - ANACONDA
    - JUPYTER

Each configuration must be saved in Cloud Storage. You can create and save multiple configuration files of both types to give users a choice when they use Dataproc Hub to create their Dataproc Notebook environment.

There are two ways to create a YAML cluster configuration file:

  1. Create YAML cluster configuration file from the console

  2. Export a YAML cluster configuration file from an existing cluster

Create YAML cluster configuration file from the console

  1. Open the Create a cluster page in the Cloud Console.
  2. Select and fill in the fields to specify the type of cluster the Dataproc Hub will spawn for users.
    1. At the bottom of the page, select "Equivalent REST".
    2. Copy the generated JSON block, excluding the leading POST request line, then paste the JSON block into an online JSON-to-YAML converter (search for "Convert JSON to YAML").
    3. Copy the converted YAML into a local cluster-config-filename.yaml file.

Export a YAML cluster configuration file from an existing cluster

  1. Create a cluster that matches your requirements.
  2. Export the cluster configuration to a local cluster-config-filename.yaml file.
    gcloud dataproc clusters export cluster-name \
        --destination cluster-config-filename.yaml  \
        --region region
     

Save the YAML configuration file in Cloud Storage

Copy your local YAML cluster configuration file to your Cloud Storage bucket.

gsutil cp cluster-config-name.yaml gs://bucket-name/

Set Dataproc Hub instance environment variables

Create a text file that sets Dataproc Hub instance environment variables. It must set required variables, and may set optional variables to allow users to customize the Dataproc cluster spawned by the Dataproc Hub instance.

Variable Required/Optional Description Example
NOTEBOOKS_LOCATION Required Cloud Storage bucket or bucket folder that contains user notebooks. The gs:// prefix is optional. gs://bucket-name/
DATAPROC_CONFIGS Required Comma delimited list of strings of the Cloud Storage paths to YAML cluster config files. The gs:// prefix is optional. gs://bucket-name/folder/cluster-config-filename1.yaml,gs://bucket/folder/cluster-config-filename2.yaml
DATAPROC_LOCATIONS_LIST Required Zone suffixes in region where the Dataproc Hub instance is located. Users can select one of these zones as the zone where their Dataproc cluster will be spawned. b,c,d
DATAPROC_DEFAULT_SUBNET Optional Subnet on which Dataproc Hub instance will spawn Dataproc clusters. Must be the same as the Dataproc Hub subnet. https://www.googleapis.com/compute/v1/projects/project-id/regions/region/subnetworks/subnet-name
DATAPROC_SERVICE_ACCOUNT Optional Service account that will run on spawned user clusters. If not set, the default service account for Dataproc clusters is used. service-account@project-id.iam.gserviceaccount.com
SPAWNER_DEFAULT_URL Optional Whether to show the Jupyter or JupyterLab UI on spawned Dataproc clusters by default. / or /lab, for Jupyter or JupyterLab, respectively.
DATAPROC_ALLOW_CUSTOM_CLUSTERS Optional Whether to allow users to customize their Dataproc clusters. true or false
DATAPROC_MACHINE_TYPES_LIST Optional List of machine types that users are allowed to choose for their spawned Dataproc clusters, if cluster customization (DATAPROC_ALLOW_CUSTOM_CLUSTERS) is enabled. n1-standard-4,n1-standard-8,e2-standard-4,n1-highcpu-4
NOTEBOOKS_EXAMPLES_LOCATION Optional Cloud Storage path to notebooks bucket or bucket folder to be downloaded to the spawned Dataproc cluster when the cluster starts. gs://bucket-name/

You can create a local environment-variable-filename that sets Dataproc Hub instance environment variables by running the following command. Fill in placeholder values and change or add variables and their values.

cat <<EOF > environment-variables-file
DATAPROC_CONFIGS=gs://bucket/cluster-config-name.yaml
NOTEBOOKS_LOCATION=gs://bucket/notebooks
DATAPROC_DEFAULT_SUBNET=https://www.googleapis.com/compute/v1/projects/project/regions/region/subnetworks/default
DATAPROC_LOCATIONS_LIST=b,c
EOF

Save the Dataproc Hub instance environment variables file in Cloud Storage

Copy your local YAML Dataproc Hub instance environment variables file to your Cloud Storage bucket.

gsutil cp environment-variable-filename gs://bucket-name/folder-name/

Set IAM roles

By default, the Dataproc Hub instance will run as the project default Compute Engine service account You can choose to create the instance with a different service account. The Dataproc Hub instance service account must have the following IAM (Identity and Access Management) roles:

To allow project users to access the Hub instance, you must also grant the roles/iam.serviceAccountUser to individual users.

Configuring roles

Note that both the hub service account and the notebook user must be granted the roles/iam.serviceAccountUser role, but each must be applied to different targets: the Hub service account needs to be able to act as the Dataproc service account, and the user needs to be able to act as the Hub service account. To apply these roles, either:

  1. Grant the serviceAccountUser role at the project level to the Hub service account (<hub-service-account>@<my-project>.iam.gserviceaccount.com) and the notebook user (<user>@gmail.com) to allow both to act as any service account in your project. Also grant the roles/dataproc.admin role at the project level to the Hub service account. OR

  2. Grant the following roles to the following principals:

    • Grant the roles/dataproc.worker role to the Dataproc service account (<dataproc-service-account>@<my-project>.iam.gserviceaccount.com ) .
    • Grant the roles/dataproc.admin role and the roles/iam.serviceAccountUser role to the Hub service account (<hub-service-account>@<my-project>.iam.gserviceaccount.com). The serviceAccountUser role can be granted at the project level or for access to the Dataproc service account (<dataproc-service-account>@<my-project>.iam.gserviceaccount.com ).
    • Grant the roles/dataproc.admin role to the notebook user (<user>@gmail.com). This role can be granted at the project level or for access to the Hub service account (<hub-service-account>@<my-project>.iam.gserviceaccount.com).

Create a Dataproc Hub instance

  1. Go to the Dataproc→Notebooks instances page in the Cloud Console.

  2. Click NEW INSTANCE, and then select "Smart Analytics Frameworks→Dataproc Hub"

  3. On the New notebook instance page, provide the following information:

    1. Instance name: Dataproc Hub instance name.
    2. Region - Select a region for the Dataproc Hub instance. Note: Dataproc clusters spawned by this Dataproc Hub instance will also be created in this region.
    3. Zone: Select a zone within the selected region.
    4. Environment: Dataproc Hub
      1. Environment variables:
        1. container-env-file: gs://<var>bucket-name</var>/folder-name<var>/<var>environment-variable-filename</var>. Provide the name and the Cloud Storage location of your Dataproc Hub instance environment variables file.
    5. Machine configuration:
      1. Machine Type - Select the machine type for the Compute Engine.
      2. Set other Machine configuration options.
    6. Click CREATE to launch the instance.
  4. When the instance is running, click the "JupyterLab" link on the Notebooks instances page to access the instance.

Cleaning up

Deleting the Dataproc Hub instance

  • To delete your Dataproc Hub instance:
    gcloud compute instances delete --project=${PROJECT} ${INSTANCE_NAME}
    

Deleting the bucket

  • To delete the Cloud Storage bucket you created in Before you begin, including the data files stored in the bucket:
    gsutil -m rm -r gs://${BUCKET_NAME}
    

What's next