Configure Dataproc Hub

Dataproc Hub is a customized JupyterHubserver. Admins configure and create Dataproc Hub instances that can spawn single-user Dataproc clusters to host Jupyter and JupyterLab notebook environments (see Use Dataproc Hub).

Launch Notebooks for multiple users. You can create a Dataproc-enabled Vertex AI Workbench instance or install the Dataproc JupyterLab plugin on a VM to to serve notebooks to multiple users.

Objectives

Define a Dataproc cluster configuration (or use one of the predefined config files).
Set Dataproc Hub instance environment variables.
Create a Dataproc Hub instance.

Before you begin

If you haven't already done so, create a Google Cloud project and a Cloud Storage bucket.

Setting up your project
Creating a Cloud Storage bucket in your project to hold the data used in this tutorial.
1. In the Google Cloud console, go to the Cloud Storage Buckets page.
  Go to Buckets
2. Click Create.
3. On the Create a bucket page, enter your bucket information. To go to the next step, click Continue.
  1. In the Get started section, do the following:
    - Enter a globally unique name that meets the bucket naming requirements.
    - To add a bucket label, expand the Labels section (), click Add label, and specify a key and a value for your label.
  2. In the Choose where to store your data section, do the following:
    1. Select a Location type.
    2. Choose a location where your bucket's data is permanently stored from the Location type drop-down menu.
      - If you select the dual-region location type, you can also choose to enable turbo replication by using the relevant checkbox.
    3. To set up cross-bucket replication, select Add cross-bucket replication via Storage Transfer Service and follow these steps:
      Set up cross-bucket replication
      
      In the Bucket menu, select a bucket.
      
      In the Replication settings section, click Configure to configure settings for the replication job.
      
      The Configure cross-bucket replication pane appears.
      
      To filter objects to replicate by object name prefix, enter a prefix that you want to include or exclude objects from, then click Add a prefix.
      
      To set a storage class for the replicated objects, select a storage class from the Storage class menu. If you skip this step, the replicated objects will use the destination bucket's storage class by default.
      
      Click Done.
  3. In the Choose how to store your data section, do the following:
    1. Select a default storage class for the bucket or Autoclass for automatic storage class management of your bucket's data.
    2. To enable hierarchical namespace, in the Optimize storage for data-intensive workloads section, select Enable hierarchical namespace on this bucket.
      Note: You cannot enable hierarchical namespace in existing buckets.
  4. In the Choose how to control access to objects section, select whether or not your bucket enforces public access prevention, and select an access control method for your bucket's objects.
    Note: You cannot change the Prevent public access setting if this setting is enforced at an organization policy.
  5. In the Choose how to protect object data section, do the following:
    - Select any of the options under Data protection that you want to set for your bucket.
      - To enable soft delete, click the Soft delete policy (For data recovery) checkbox, and specify the number of days you want to retain objects after deletion.
      - To set Object Versioning, click the Object versioning (For version control) checkbox, and specify the maximum number of versions per object and the number of days after which the noncurrent versions expire.
      - To enable the retention policy on objects and buckets, click the Retention (For compliance) checkbox, and then do the following:
        
        To enable Object Retention Lock, click the Enable object retention checkbox.
        
        To enable Bucket Lock, click the Set bucket retention policy checkbox, and choose a unit of time and a length of time for your retention period.
    - To choose how your object data will be encrypted, expand the Data encryption section (), and select a Data encryption method.
4. Click Create.

Define a cluster configuration

A Dataproc Hub instance creates a cluster from configuration values contained in a YAML cluster configuration file.

Your cluster configuration can specify any feature or component available to Dataproc clusters (such as machine type, initialization actions, and optional components). The cluster image version must be 1.4.13 or higher. Attempting to spawn a cluster with an image version lower than 1.4.13 will cause an error and fail.

Sample YAML cluster configuration file

clusterName: cluster-name
config:
  softwareConfig:
    imageVersion: 2.2-ubuntu22
    optionalComponents:
    - JUPYTER

Each configuration must be saved in Cloud Storage. You can create and save multiple configuration files to give users a choice when they use Dataproc Hub to create their Dataproc cluster notebook environment.

There are two ways to create a YAML cluster configuration file:

Create YAML cluster configuration file from the console
Export a YAML cluster configuration file from an existing cluster

Create YAML cluster configuration file from the console

Open the Create a cluster page in the Google Cloud console, then select and fill in the fields to specify the type of cluster the Dataproc Hub will spawn for users.
The region and zone settings will be overridden when the user's cluster is spawned: the spawned cluster's region will be the region where the Dataproc Hub is located, and the user will select a zone within this region.
1. At the bottom of the left panel, select "Equivalent REST".
2. Copy the generated JSON block, excluding the leading POST request line, then paste the JSON block into an online JSON-to-YAML converter (search online for "Convert JSON to YAML").
  Some JSON to YAML converters generate a first line containing "---". The inclusion of this line in the YAML file is optional.
3. Copy the converted YAML into a local cluster-config-filename.yaml file.

Export a YAML cluster configuration file from an existing cluster

Create a cluster that matches your requirements.

Export the cluster configuration to a local cluster-config-filename.yaml file.

gcloud dataproc clusters export cluster-name \
    --destination cluster-config-filename.yaml  \
    --region region

Save the YAML configuration file in Cloud Storage

Copy your local YAML cluster configuration file to your Cloud Storage bucket.

gcloud storage cp cluster-config-filename.yaml gs://bucket-name/

Set Dataproc Hub instance environment variables

The administrator can set the hub environment variables listed in the table, below, to set attributes of the Dataproc clusters that will be spawned by hub users.

Variable	Description	Example
NOTEBOOKS_LOCATION	Cloud Storage bucket or bucket folder that contains user notebooks. The `gs://` prefix is optional. Default: The Dataproc staging bucket.	gs://`bucket-name`/
DATAPROC_CONFIGS	Comma delimited list of strings of the Cloud Storage paths to YAML cluster config files. The `gs://` prefix is optional. Default: `gs://dataproc-spawner-dist/example-configs/`. which contains predefined `example-cluster.yaml` and `example-single-node.yaml`.	gs://`cluster-config-filename`.yaml
DATAPROC_LOCATIONS_LIST	Zone suffixes in region where the Dataproc Hub instance is located. Users can select one of these zones as the zone where their Dataproc cluster will be spawned. Default: "b".	b,c,d
DATAPROC_DEFAULT_SUBNET	Subnet on which Dataproc Hub instance will spawn Dataproc clusters. Default: the Dataproc Hub instance subnet.	https://www.googleapis.com/compute/v1/projects/`project-id`/regions/`region`/subnetworks/`subnet-name`
DATAPROC_SERVICE_ACCOUNT	Service account that Dataproc VMs will run as. Default: If not set, the default Dataproc service account is used.	`service-account`@`project-id`.iam.gserviceaccount.com
SPAWNER_DEFAULT_URL	Whether to show the Jupyter or JupyterLab UI on spawned Dataproc clusters by default. Default: "/lab".	`/` or `/lab`, for Jupyter or JupyterLab, respectively.
DATAPROC_ALLOW_CUSTOM_CLUSTERS	Whether to allow users to customize their Dataproc clusters. Default: false.	"true" or "false"
DATAPROC_MACHINE_TYPES_LIST	List of machine types that users are allowed to choose for their spawned Dataproc clusters, if cluster customization (DATAPROC_ALLOW_CUSTOM_CLUSTERS) is enabled. Default: empty (all machine types are allowed).	n1-standard-4,n1-standard-8,e2-standard-4,n1-highcpu-4
NOTEBOOKS_EXAMPLES_LOCATION	Cloud Storage path to notebooks bucket or bucket folder to be downloaded to the spawned Dataproc cluster when the cluster starts. Default: empty.	gs://`bucket-name`/

Setting hub environment variables

There are two ways to set hub environment variables:

Set hub environment variables from the console
Set hub environment variables in a text file

Set hub environment variables from the console

When you create a Dataproc Hub instance from the User-Managed Notebooks tab on the Dataproc→Workbench page in the Google Cloud console, you can click the Populate button to open a Populate Dataproc Hub form that allows you to set each environment variable.

Set hub environment variables in a text file

Create the file. You can use a text editor to set Dataproc Hub instance environment variables in a local file. Alternatively, you can create the file by running the following command after filling in placeholder values and changing or adding variables and their values.
```
cat <<EOF > environment-variables-file
DATAPROC_CONFIGS=gs://bucket/cluster-config-filename.yaml
NOTEBOOKS_LOCATION=gs://bucket/notebooks
DATAPROC_LOCATIONS_LIST=b,c
EOF
```
Save the file in Cloud Storage. Copy your local Dataproc Hub instance environment variables file to your Cloud Storage bucket.
```
gcloud storage cp environment-variable-filename gs://bucket-name/folder-name/
```

Set Identity and Access Management (IAM) roles

Dataproc Hub includes the following identities with the following abilities:

Administrator: creates a Dataproc Hub instance
Data and ML user: accesses the Dataproc Hub UI
Dataproc Hub service account: represents Dataproc Hub
Dataproc service account: represents the Dataproc cluster that Dataproc Hub creates.

Each identity requires specific roles or permissions to perform their associated tasks. The table below summarizes IAM roles and permissions required by each identity.

Identity	Type	Role or permission
Dataproc Hub administrator	User or Service account	roles/notebooks.admin
Dataproc Hub user	User	notebooks.instances.use, dataproc.clusters.use
Dataproc Hub	Service account	roles/dataproc.hubAgent
Dataproc	Service account	roles/dataproc.worker

Create a Dataproc Hub instance

Before you begin: To create a Dataproc Hub instance from the Google Cloud console, your user account must have compute.instances.create permission. Also, the service account of the instance—the Compute Engine default service account or your user-specified service account listed in IAM & admin > Service Accounts (see Dataproc VM service account)— must have iam.serviceAccounts.actAs permission.
Go to the Dataproc→Workbench page in the Google Cloud console, then select the User-Managed Notebooks tab.
If not pre-selected as a filter, click in the Filter box, then select **Environment:Dataproc Hub"".
Click New Notebook→Dataproc Hub.
On the Create a user-managed notebook page, provide the following information:
1. Notebook name: Dataproc Hub instance name.
2. Region: Select a region for the Dataproc Hub instance. Dataproc clusters spawned by this Dataproc Hub instance will also be created in this region.
  For best performance, select a geographically close region.
3. Zone: Select a zone within the selected region.
4. Environment:
  1. Environment: Select Dataproc Hub.
  2. Select a script to run after creation (optional): You can insert or browse and select an initialization action script or executable to run on the spawned Dataproc cluster.
  3. Populate Dataproc Hub (optional): Click Populate to open a form that allows you to set each of the hub environment variables (see Set Dataproc Hub instance environment variables for a description of each variable). Dataproc uses default values for any unset environment variables. As an alternative, you can set Metadata key:value pairs to set environment variables (see next item).
  4. Metadata:
    1. If you created a text file that contains your hub environment variable settings (see Setting hub environment variables), provide the name of the file as the key and the gs://bucket-name/folder-name/environment-variable-filename Cloud Storage location of the file as the value. Dataproc uses default values for any unset environment variables.
5. Machine configuration:
  1. Machine Type: Select the Compute Engine machine type.
  2. Set other machine configuration options.
6. Other Options:
  1. You can expand and set or replace default values in the Disks, Networking, Permission, Security, and Environment upgrade and system health sections.
7. Click Create to launch the Dataproc Hub instance.
The Open JupyterLab link for the Dataproc Hub instance becomes active after the instance is created. Users click this link to open the JupyterHub server page to configure and create a Dataproc JupyterLab cluster (see Use Dataproc Hub).

Clean up

Delete the Dataproc Hub instance

To delete your Dataproc Hub instance:

gcloud compute instances delete --project=${PROJECT} ${INSTANCE_NAME}

Delete the bucket

To delete the Cloud Storage bucket you created in Before you begin, including the data files stored in the bucket:
```
gcloud storage rm gs://${BUCKET_NAME} --recursive
```

What's next

Use Dataproc Hub