Configure Dataproc Hub

Dataproc Hub is a customized Jupyterhub server. Admins configure and create Dataproc Hub instances that can spawn single-user Dataproc clusters to host Jupyter and JupyterLab notebook environments (see Use Dataproc Hub).

Objectives

  1. Define a Dataproc cluster configuration (or use one of the predefined config files).

  2. Set Dataproc Hub instance environment variables.

  3. Create a Dataproc Hub instance.

Before you begin

If you haven't already done so, create a Google Cloud project and a Cloud Storage bucket.

  1. Setting up your project

    1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
    2. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

      Go to project selector

    3. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

    4. Enable the Dataproc, Compute Engine, and Cloud Storage APIs.

      Enable the APIs

    5. Install and initialize the Cloud SDK.
    6. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

      Go to project selector

    7. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

    8. Enable the Dataproc, Compute Engine, and Cloud Storage APIs.

      Enable the APIs

    9. Install and initialize the Cloud SDK.

  2. Creating a Cloud Storage bucket in your project to hold the data used in this tutorial.

    1. In the Cloud Console, go to the Cloud Storage Browser page.

      Go to Browser

    2. Click Create bucket.
    3. On the Create a bucket page, enter your bucket information. To go to the next step, click Continue.
      • For Name your bucket, enter a name that meets the bucket naming requirements.
      • For Choose where to store your data, do the following:
        • Select a Location type option.
        • Select a Location option.
      • For Choose a default storage class for your data, select a storage class.
      • For Choose how to control access to objects, select an Access control option.
      • For Advanced settings (optional), specify an encryption method, a retention policy, or bucket labels.
    4. Click Create.

Define a cluster configuration

A Dataproc Hub instance creates a cluster from configuration values contained in a YAML cluster configuration file.

Your cluster configuration can specify any feature or component available to Dataproc clusters (such as machine type, initialization actions, and optional components). The cluster image version must be 1.4.13 or higher. Attempting to spawn a cluster with an image version lower than 1.4.13 will cause an error and fail.

Sample YAML cluster configuration file

clusterName: cluster-name
config:
  gceClusterConfig:
    metadata:
      'PIP_PACKAGES': 'google-cloud-core>=1.3.0 google-cloud-storage>=1.28.1'
  initializationActions:
  - executableFile: gs://dataproc-initialization-actions/python/pip-install.sh
  softwareConfig:
    imageVersion: 1.5-ubuntu18
    optionalComponents:
    - ANACONDA
    - JUPYTER

Each configuration must be saved in Cloud Storage. You can create and save multiple configuration files to give users a choice when they use Dataproc Hub to create their Dataproc cluster notebook environment.

There are two ways to create a YAML cluster configuration file:

  1. Create YAML cluster configuration file from the console

  2. Export a YAML cluster configuration file from an existing cluster

Create YAML cluster configuration file from the console

  1. Open the Create a cluster page in the Cloud Console, then select and fill in the fields to specify the type of cluster the Dataproc Hub will spawn for users.
    1. At the bottom of the left panel, select "Equivalent REST".
    2. Copy the generated JSON block, excluding the leading POST request line, then paste the JSON block into an online JSON-to-YAML converter (search online for "Convert JSON to YAML").
    3. Copy the converted YAML into a local cluster-config-filename.yaml file.

Export a YAML cluster configuration file from an existing cluster

  1. Create a cluster that matches your requirements.
  2. Export the cluster configuration to a local cluster-config-filename.yaml file.
    gcloud dataproc clusters export cluster-name \
        --destination cluster-config-filename.yaml  \
        --region region
     

Save the YAML configuration file in Cloud Storage

Copy your local YAML cluster configuration file to your Cloud Storage bucket.

gsutil cp cluster-config-filename.yaml gs://bucket-name/

Set Dataproc Hub instance environment variables

The administrator can set the hub environment variables listed in the table, below, to set attributes of the Dataproc clusters that will be spawned by hub users.

Variable Description Example
NOTEBOOKS_LOCATION Cloud Storage bucket or bucket folder that contains user notebooks. The `gs://` prefix is optional. Default: The Dataproc staging bucket. gs://bucket-name/
DATAPROC_CONFIGS Comma delimited list of strings of the Cloud Storage paths to YAML cluster config files. The `gs://` prefix is optional. Default: gs://dataproc-spawner-dist/example-configs/. which contains predefined example-cluster.yaml and example-single-node.yaml. gs://cluster-config-filename.yaml
DATAPROC_LOCATIONS_LIST Zone suffixes in region where the Dataproc Hub instance is located. Users can select one of these zones as the zone where their Dataproc cluster will be spawned. Default: "b". b,c,d
DATAPROC_DEFAULT_SUBNET Subnet on which Dataproc Hub instance will spawn Dataproc clusters. Default: the Dataproc Hub instance subnet. https://www.googleapis.com/compute/v1/projects/project-id/regions/region/subnetworks/subnet-name
DATAPROC_SERVICE_ACCOUNT Service account that Dataproc VMs will run as. Default: If not set, the default Dataproc service account is used. service-account@project-id.iam.gserviceaccount.com
SPAWNER_DEFAULT_URL Whether to show the Jupyter or JupyterLab UI on spawned Dataproc clusters by default. Default: "/lab". `/` or `/lab`, for Jupyter or JupyterLab, respectively.
DATAPROC_ALLOW_CUSTOM_CLUSTERS Whether to allow users to customize their Dataproc clusters. Default: false. "true" or "false"
DATAPROC_MACHINE_TYPES_LIST List of machine types that users are allowed to choose for their spawned Dataproc clusters, if cluster customization (DATAPROC_ALLOW_CUSTOM_CLUSTERS) is enabled. Default: empty (all machine types are allowed). n1-standard-4,n1-standard-8,e2-standard-4,n1-highcpu-4
NOTEBOOKS_EXAMPLES_LOCATION Cloud Storage path to notebooks bucket or bucket folder to be downloaded to the spawned Dataproc cluster when the cluster starts. Default: empty. gs://bucket-name/

Setting hub environment variables

There are two ways to set hub environment variables:

  1. Set hub environment variables from the console

  2. Set hub environment variables in a text file

Set hub environment variables from the console

When you create a hub instance from the Dataproc→Notebooks instances page in the Cloud Console, you can click the POPULATE button to open a Populate Dataproc Hub form that allows you to set each variable (see Create a Dataproc Hub instance).

Set hub environment variables in a text file

  1. Create the file. You can use a text editor to set Dataproc Hub instance environment variables in a local file. Alternatively, you can create the file by running the following command after filling in placeholder values and changing or adding variables and their values.

    cat <<EOF > environment-variables-file
    DATAPROC_CONFIGS=gs://bucket/cluster-config-filename.yaml
    NOTEBOOKS_LOCATION=gs://bucket/notebooks
    DATAPROC_LOCATIONS_LIST=b,c
    EOF
    

  2. Save the file in Cloud Storage. Copy your local Dataproc Hub instance environment variables file to your Cloud Storage bucket.

    gsutil cp environment-variable-filename gs://bucket-name/folder-name/

Set Identity and Access Management (IAM) roles

Dataproc Hub includes the following identities with the following abilities:

  • Administrator: creates a Dataproc Hub instance
  • Data and ML user: accesses the Dataproc Hub UI
  • Dataproc Hub service account: represents Dataproc Hub
  • Dataproc service account: represents the Dataproc cluster that Dataproc Hub creates.

Each identity requires specific roles or permissions to perform their associated tasks. The table below summarizes IAM roles and permissions required by each identity.

Identity Type Role or permission
Dataproc Hub administrator User or Service account roles/notebooks.admin
Dataproc Hub user User notebooks.instances.use, dataproc.clusters.use
Dataproc Hub Service account roles/dataproc.hubAgent
Dataproc Service account roles/dataproc.worker

Create a Dataproc Hub instance

  1. Before you begin: To create a Dataproc Hub instance from the Cloud Console, your user account must have compute.instances.create permission. Also, the service account of the instance—the Compute Engine default service account or your user-specified service account listed in IAM & admin > Service Accounts (see Dataproc VM service account)— must have iam.serviceAccounts.actAs permission.

  2. Go to the Dataproc→Notebooks instances page in the Cloud Console.

  3. Click NEW INSTANCE→Dataproc Hub

  4. On the New notebook instance page, provide the following information:

    1. Instance name: Dataproc Hub instance name.
    2. Region: Select a region for the Dataproc Hub instance. Note: Dataproc clusters spawned by this Dataproc Hub instance will also be created in this region.
    3. Zone: Select a zone within the selected region.
    4. Environment:
      1. Environment: Select "Dataproc Hub"
      2. Select a script to run after creation (optional): You can insert or browse and select an initialization action script or executable to run on the spawned Dataproc cluster.
      3. Populate Dataproc Hub (optional): Click POPULATE to open a form that allows you to set each of the hub environment variables (see Set Dataproc Hub instance environment variables for a description of each variable). As an alternative, use the container-env-file field, below, to point to a text file you created that contains your variable settings. Note, if you do not set some or all environment variables, Dataproc will use default values for unset variables.
      4. Environment variables:
        1. container-env-file (optional): If you created a text file that contains your hub environment variable settings (see Setting hub environment variables) provide the name and the Cloud Storage location of your file.

          Example:

          gs://bucket-name/folder-name/environment-variable-filename.

          Note, if you do not set some or all environment variables, Dataproc will use default values for unset variables.

    5. Machine configuration:
      1. Machine Type: Select the Compute Engine machine type.
      2. Set other machine configuration options.
    6. Click CREATE to launch the instance.
  5. When the instance is running, click the "JupyterLab" link on the Notebooks instances page to access the instance.

Clean up

Deleting the Dataproc Hub instance

  • To delete your Dataproc Hub instance:
    gcloud compute instances delete --project=${PROJECT} ${INSTANCE_NAME}
    

Deleting the bucket

  • To delete the Cloud Storage bucket you created in Before you begin, including the data files stored in the bucket:
    gsutil -m rm -r gs://${BUCKET_NAME}
    

What's next