Create a multi-tenant cluster using service accounts

Dataproc service-account-based secure multi-tenancy lets you share a cluster with multiple users, with a set of user accounts mapped to service accounts when the cluster is created. Users can submit interactive workloads, such as Jupyter notebook, to kernels running on the multi-tenant cluster with isolated user environments.

When a user submits a job to the multi-tenant cluster:

The job runs as a specific OS user with a specific Kerberos principal.
The job accesses Google Cloud resources using a mapped service account.

This document shows you how to create a Dataproc multi-tenant cluster, and then launch and connect a Jupyter notebook to a PySpark Kernel running on the cluster.

Considerations and limitations

When you create a multi-tenant cluster:

The cluster is available only to Google Account users with mapped service accounts. Google groups can't be mapped. Unmapped users can't run jobs on the cluster.
Kerberos is enabled and configured on the cluster for secure intra-cluster communication. End user authentication through Kerberos is not supported.
Direct SSH access to the cluster and Compute Engine features, such as the ability to run startup scripts on cluster VMs, are blocked. Also, jobs cannot run with sudo privileges.
Dataproc Workflows are not supported.

Create a multi-tenant cluster

You enable the multi-tenant feature when you create a Dataproc cluster.

Console

Create a Dataproc cluster using the Google Cloud console, as follows:

In the Google Cloud console, go to the Dataproc Create a Dataproc cluster on Compute Engine page: Create a Dataproc cluster on Compute Engine
On the Set up cluster panel:
1. Under Components:
  1. Under Component Gateway, select Enable component gateway.
  2. Under Optional components, select Jupyter Kernel Gateway to let multiple users connect their Jupyter notebooks to the multi-tenant cluster.

On the Customize cluster panel:

Under Cluster properties:

To allow adding or removing multi-tenant users without re-creating the cluster (see Update multi-tenant cluster users), click Add Properties, then add the dataproc prefix, dynamic.multi.tenancy.enabled property, and set its value to true.

Recommendation: Since YARN consumes significant cluster resources for each notebook kernel running on a multi-tenant cluster, add Spark and YARN properties to increase resource allocation.

Example:

Prefix	Key	Value
spark	spark.driver.memory	5g
spark	spark.executor.memory	5g
spark	spark.executor.cores	2
capacity-scheduler	yarn.scheduler.capacity.maximum-am-resource-percent	0.5

On the Manage security panel:
1. Under Project access, select Enables the cloud-platform scope for this cluster.
2. Under Secure Multi Tenancy:
  1. Select Enable.
  2. Under Multi-tenancy Mapping:
    1. Click Add Multi-tenancy Mapping to add mappings of user accounts to service accounts.
Confirm or input other cluster settings (see Create a Dataproc cluster using the Google Cloud console).
Click Create.

gcloud

Use the gcloud dataproc clusters create command with the --secure-multi-tenancy-user-mapping flag to specify a list of user account to service-account mappings.

gcloud dataproc clusters create CLUSTER_NAME \
    --region=REGION \
    --secure-multi-tenancy-user-mapping=USER_MAPPINGS: \
    --properties "dataproc:dataproc.dynamic.multi.tenancy.enabled=true" \
    --service-account=CLUSTER_SERVICE_ACCOUNT@iam.gserviceaccount.com \
    --scopes=https://www.googleapis.com/auth/iam \
    --optional-components=JUPYTER_KERNEL_GATEWAY \
    --enable-component-gateway \
    other args ...

Notes:

USER_MAPPINGS: Specify a comma-separated list that maps user accounts to service accounts.

--secure-multi-tenancy-user-mapping=UserA@my-company.com:SERVICE_ACCOUNT_FOR_USERA@iam.gserviceaccount.com,UserB@my-company.com:SERVICE_ACCOUNT_FOR_USERB@iam.gserviceaccount.com,UserC@my-company.com:SERVICE_ACCOUNT_FOR_USERC@iam.gserviceaccount.com

Use a YAML mapping file: Instead of using the --secure-multi-tenancy-user-mapping flag to specify the user account to service account mappings, you can use the --identity-config-file flag to specify a local or Cloud Storage YAML file that contains the mappings.

--identity-config-file=LOCAL_FILE or gs://BUCKET/FOLDER/FILENAME

Each line in the mapping file maps a user account to a service account. The first line contains the user_service_account_mapping: header.

user_service_account_mapping:
UserA@my-company.com:SERVICE_ACCOUNT_FOR_USERA@iam.gserviceaccount.com
UserB@my-company.com:SERVICE_ACCOUNT_FOR_USERB@iam.gserviceaccount.com
UserC@my-company.com:SERVICE_ACCOUNT_FOR_USERC@iam.gserviceaccount.com

--properties "dataproc:dataproc.dynamic.multi.tenancy.enabled=true": This property allows adding or removing multi-tenant cluster users without re-creating the cluster (see Update multi-tenant cluster users).

Recommendation: Since YARN consumes significant cluster resources for each notebook kernel running on a multi-tenant cluster, add Spark and YARN properties to increase resource allocation.

Example:
```
--properties=" \
spark:spark.driver.memory=5g,\
spark:spark.executor.memory=5g,\
spark:spark.executor.cores=200, \
capacity-scheduler:yarn.scheduler.capacity.maximum-am-resource-percent=0.5"
```
CLUSTER_SERVICE_ACCOUNT (Optional): You can use the --service-account flag to specify a custom VM service account for the cluster. If you omit this flag, the default cluster VM service account, PROJECT_NUMBER-compute@developer.gserviceaccount.com, is used.

Recommendation: Use different cluster service accounts for different clusters to allow each cluster VM service account to impersonate only a limited group of mapped user service accounts.
--scopes=https://www.googleapis.com/auth/iam is necessary for the cluster service account to perform impersonation.
--enable-component-gateway and --optional-components=JUPYTER_KERNEL_GATEWAY: Enabling the Dataproc Component Gateway and the Jupyter Kernel Gateway lets multiple users connect their Jupyter notebooks to the multi-tenant cluster.

API

Use the SecurityConfig.IdentityConfig.userServiceAccountMapping field to specify a list of user account to service account mappings.

Grant Identity and Access Management permissions

To connect user notebooks to notebook kernels running on a multi-tenant cluster, mapped users, mapped service accounts, and the cluster VM service account must have IAM permissions needed to access resources.

Mapped user permissions

Each mapped user must have thedataproc.clusters.get and dataproc.clusters.use permissions, which are needed for the user to access and connect to notebook kernels running on the multi-tenant cluster. You can grant the Dataproc Editor role (roles/dataproc.editor), which contains these permissions (see Grant a single IAM role), or create a custom role with these permissions.

Mapped service account permissions

Each mapped service account must have permissions needed by the mapped user's notebook application, such as access to a Cloud Storage bucket or access to a BigQuery table (see Manage access to service accounts).

VM service account permissions

The multi-tenant cluster VM service account must have the iam.serviceAccounts.getAccessToken permission on each mapped service account. You can grant the Service Account Token Creator role (roles/iam.serviceAccountTokenCreator), which contains this permission (see Manage access to service accounts), or create a custom role with this permission. See Dataproc VM service account for information on other VM service account roles.

Connect Jupyter notebooks to a multi-tenant cluster kernel

Mapped multi-tenant cluster users can connect their Vertex AI Workbench or user-managed Jupyter notebook to the kernels installed on the multi-tenant cluster.

Vertex AI notebook

To create and connect a Jupyter notebook to the multi-tenant cluster, do the following;

Create a Vertex AI Workbench instance.
On the Workbench Instances tab, click the Open JupyterLab link for your instance.
Under Dataproc Cluster Notebooks, click the PySpark (YARN Cluster) on MULTI_TENANCY_CLUSTER_NAME card to connect to and launch a new Jupyter PySpark notebook.

User-managed notebook

To create and connect a user-managed Jupyter notebook to your Dataproc multi-tenant cluster, follow the steps to Install the JupyterLab extension on your user-managed VM.

Update multi-tenant cluster users (Preview)

If you set the dataproc:dataproc.dynamic.multi.tenancy.enabled cluster property to true when you created a multi-tenant cluster, you can add, remove, or replace multi-tenant cluster users after cluster creation.

Add users

The following update command uses the --add-user-mappings flag to add two new user account to service account mappings to the secure multi-tenant cluster.

gcloud dataproc clusters update CLUSTER_NAME \
    --region=REGION \
    --add-user-mappings=new-user1@my-company.com=SERVICE_ACCOUNT_FOR_NEW_USER1@iam.gserviceaccount.com,new-user2@my-company.com=SERVICE_ACCOUNT_FOR_NEW_USER2@iam.gserviceaccount.com

Remove users

The following update command uses the --remove-user-mappings flag to remove two users from the multi-tenant cluster. The flag accepts the user accounts of the users to be removed.

gcloud dataproc clusters update CLUSTER_NAME \
    --region=REGION \
    --remove-user-mappings=UserB@my-company.com,UserC@my-company.com

Replace users

You can use the update command with the --identity-config-file flag to replace the existing set of users with a new set. This flag is useful to both add and remove users with one update command.

gcloud dataproc clusters update CLUSTER_NAME \
    --region=REGION \
    --identity-config-file=identity-config.yaml