Dataproc cooperative multi-tenancy
Susheel Kaushik
Product Manager, Data Analytics
Chao Yuan
Software Engineer
Data analysts run their BI workloads on Dataproc to generate dashboards, reports and insights. Diverse sets of data analysts from various teams analyzing data to generate reports, dashboards and insights drive the need for multi-tenancy for Dataproc workloads. Today, workloads from all the users on the cluster runs as a single service account thereby every workload has the same data access. Dataproc Cooperative Multi-tenancy enables multiple users with distinct data accesses to run workloads on the same cluster.
A Dataproc cluster usually runs the workloads as the cluster service account. Creating a Dataproc cluster with Dataproc Cooperative Multi-tenancy enables you to isolate user identities when running jobs that access Cloud Storage resources. The mapping of the Cloud IAM user(s) to a service account is specified at cluster creation time and many service accounts can be configured for a given cluster. This means that interactions with Cloud Storage will be authenticated as a service account that is mapped to the user who submits the job, instead of the cluster service account.
Considerations
Dataproc Cooperative Multi-Tenancy has the following considerations:
- Setup the mapping of the Cloud IAM user to the service account by enabling the
dataproc:dataproc.cooperative.multi-tenancy.user.mapping
property. When a user submits a job to the cluster, the VM service account will impersonate the service account mapped to this user and interact with Cloud Storage as that service account, through the GCS connector. - Requires GCS connector version to be at least 2.1.4.
- Does not support clusters with Kerberos enabled.
- Intended for jobs submitted through the Dataproc Jobs API only.
Objectives
We intend to demonstrate the following objects as part of this blog.
- Create a Dataproc cluster with Dataproc Cooperative Multi-tenancy enabled.
- Submit jobs to the cluster with different user identities and observe different access rules applied when interacting with Cloud Storage.
Verify that interactions with Cloud Storage are authenticated with different service accounts, through StackDriver loggings.
Before You Begin
Create a Project
- In the Cloud Console, on the project selector page, select or create a Cloud project.
- Make sure that billing is enabled for your Google Cloud project. Learn how to confirm billing is enabled for your project.
- Enable the Dataproc API.
- Enable the StackDriver API.
- Install and initialize the Cloud SDK.
Simulate a Second User
Usually, you have another user as a second user, however you can also simulate a second user by using a separate service account. Since you are going to submit jobs to the cluster by different users, you can activate a service account in your gcloud settings to simulate a second user.
First, get your current activated account in gcloud. In most cases this would be your personal account
FIRST_USER=$(gcloud auth list --filter=status:ACTIVE --format="value(account)")
- Create a service account
Grant the service account proper permissions to submit jobs to a Dataproc cluster
Create a key for the service account and use the key to activate it in gcloud. You can delete the key file after the service account is activated.
Now if you run the following command:
gcloud auth list --filter=status:ACTIVE --format="value(account)"
You will see this service account is the active account. In order to proceed with the examples below, switch back to your original active account
gcloud config set account ${FIRST_USER}
Configure the Service Accounts
Create 3 additional service accounts, one as the Dataproc VM service account, and the other 2 as the service accounts mapped to users (user service accounts). Note: we recommend using a per-cluster VM service account and only allow it to impersonate user service accounts you intend to use on the specific cluster.
Grant the
iam.serviceAccountTokenCreator
role to the VM service account on the two user service accounts, so it can impersonate them.
And
Grant the
dataproc.worker
role to the VM service account so it can perform necessary jobs on the cluster VMs.
Create Cloud Storage Resource and Configure Service Accounts
Create a bucket
- Write a simple file to the bucket.
echo "This is a simple file" | gsutil cp - gs://${BUCKET}/file
- Grant only the first user service account,
USER_SA_ALLOW
, admin access to the bucket.
gsutil iam ch serviceAccount:${USER_SA_ALLOW}:admin gs://${BUCKET}
Create a Cluster and Configure Service Accounts
- In this example, we will map the user
“FIRST_USER”
(personal user) to the service account with GCS admin permissions, and the user“SECOND_USER”
(simulated with as a service account) to the service account without GCS access. - Note that cooperative multi-tenancy is only available in GCS connector from version 2.1.4 onwards. It is pre-installed on Dataproc image version 1.5.11 and up, but you can use the connectors initialization action to install a specific version of GCS connector on older Dataproc images.
- The VM service account needs to call the generateAccessToken API to fetch access token for the job service account, so make sure your cluster has the right scopes. In the example below I’ll just use the cloud-platform scope.
Note:
- The user service accounts might need to have access to the config bucket associated with the cluster in order to run jobs, so make sure you grant the user service accounts access.
2. On Dataproc clusters with 1.5+ images, by default, Spark and MapReduce history files are sent to the temp-bucket associated with the cluster, so you might want to grant the user service accounts access to this bucket.
Run Example Jobs
- Run a Spark job as
“FIRST_USER”
, and since the mapped service account has access to the GCS filegs://${BUCKET}/file
, the job will succeed.
And the job will succeed with output like:
- Now run the same job as
“SECOND_USER”
, and since the mapped service account has no access to the GCS filegs://${BUCKET}/file
, the job will fail, and the driver output will show it is because of permission issues.
And the job driver shows it is because the service account used does not have storage.get.access
to the GCS file.
Similarly for a Hive job (creating an external table in GCS, inserting records, then reading the records), when running the following as user “FIRST_USER”
,It will succeed because the mapped service account has access to the bucket <BUCKET>
:
However, when querying the table employee
as a different user “SECOND_USER”
, the job will use the second user service account which has no access to the bucket, and the job will fail.
Verify Service Accounts Authentication With Cloud Storage Through StackDriver Logging
First, check the usage of the first service account which has access to the bucket.
- Make sure the gcloud active account is your personal account
gcloud config set account ${FIRST_USER}
- Find logs about access to the bucket using the service account with GCS permissions
gcloud logging read "resource.type=\"gcs_bucket\" AND resource.labels.bucket_name=\"${BUCKET}\" AND protoPayload.authenticationInfo.principalEmail=\"${USER_SA_ALLOW}\""
And we can see the results are that permissions are always granted:
Checking the service account which has no access to the bucket
And we see access permissions were never granted:
And we can verify the VM service account was never directly used to access the bucket (the following gcloud command returns 0 log entries)
gcloud logging read "resource.type=\"gcs_bucket\" AND resource.labels.bucket_name=\"${BUCKET}\" AND protoPayload.authenticationInfo.principalEmail=\"${VM_SA}\""
Cleanup
Delete the cluster
gcloud dataproc clusters delete ${CLUSTER_NAME} --region ${REGION} --quiet
Delete the bucket
gsutil rm -r gs://${BUCKET}
Deactivate the service account used to simulate a second user
gcloud auth revoke ${SECOND_USER}
Delete the service accounts
Note
- The cooperative multi-tenancy feature does not yet work on clusters with Kerberos enabled.
- Jobs submitted by users without service accounts mapped to them will fall back to use the VM service account when accessing GCS resources. However, you can set the `
core:fs.gs.auth.impersonation.service.account`
property to change the fallback service account. The VM service account will have to be able to call `generateAccessToken` to fetch access tokens for this fallback service account as well.
This blog successfully demonstrates how you can use Dataproc Cooperative Multi-Tenancy to share Dataproc clusters across multiple users.