This page describes service accounts and VM access scopes and how they are used with Dataproc.
What are service accounts?
A service account is a special account that can be used by services and applications running on a Compute Engine virtual machine (VM) instance to interact with other Google Cloud APIs. Applications can use service account credentials to authorize themselves to a set of APIs and perform actions on the VM within the permissions granted to the service account.
Dataproc service accounts
The following service accounts are granted permissions required to perform Dataproc actions in the project where your cluster is located.
Dataproc VM service account: VMs in a Dataproc cluster use this service account for Dataproc data plane operations, such reading and writing data from and to Cloud Storage and BigQuery (see Dataproc VM Service Account (Data Plane identity)). The Compute Engine default service account,
[project-number]-firstname.lastname@example.org, is used as the Dataproc VM service account unless you specify a user-managed VM service account when you create a cluster.
The Dataproc Worker role provides the VM service account with the minimum permissions necessary to operate with Dataproc. Additional roles are necessary to grant permissions to read and write data to Google Cloud resources, such as BigQuery.
Dataproc Service Agent service account: Dataproc creates this service account with the Dataproc Service Agent role in a Dataproc user's Google Cloud project. This service account cannot be replaced by a user-specified service account when you create a cluster. This service agent account is used to perform Dataproc control plane operations, such as the creation, update, and deletion of cluster VMs (see Dataproc Service Agent (Control Plane identity)).
By default, Dataproc uses the
service-[project-number]@dataproc-accounts.iam.gserviceaccount.comas the service agant account. If that service account doesn't exist, Dataproc uses the Google APIs service agent account,
[project-number]@cloudservices.gserviceaccount.com, for control plane operations.
Shared VPC networks: If the cluster uses a Shared VPC network, a Shared VPC Admin must grant both of the above service accounts the role of Network User for the Shared VPC host project. For more information, see:
- Creating a cluster that uses a VPC network in another project
- Shared VPC documentation: configuring service accounts as Service Project Admins
Dataproc VM access scopes
VM Access scopes are used to grant or limit VM instances access to APIs. They work
together with the VM service account to determine API access.
For example, if cluster VMs are granted only the
https://www.googleapis.com/auth/storage-full scope, applications running
on cluster VMs can call Cloud Storage APIs, but they are not able to
make requests to BigQuery even if the VM service account they
are running as is granted a BigQuery role with broad
Default Dataproc VM scopes. If scopes are not specified when a cluster is created (see gcloud dataproc cluster create --scopes), Dataproc VMs have the following default set of scopes:
https://www.googleapis.com/auth/bigquery https://www.googleapis.com/auth/bigtable.admin.table https://www.googleapis.com/auth/bigtable.data https://www.googleapis.com/auth/cloud.useraccounts.readonly https://www.googleapis.com/auth/devstorage.full_control https://www.googleapis.com/auth/devstorage.read_write https://www.googleapis.com/auth/logging.write
If you specify scopes when creating a cluster, cluster VMs will have the scopes you specify and the following minimum set of required scopes (even if you don't specify them):
https://www.googleapis.com/auth/cloud.useraccounts.readonly https://www.googleapis.com/auth/devstorage.read_write https://www.googleapis.com/auth/logging.write
Creating a cluster with a user-managed VM Service account
You can specify a VM service account when you create a cluster. Dataproc does not support specifying or changing the VM service account after the cluster is created.
Why specify a user-managed VM service account? Service accounts have IAM roles granted to them. Specifying a user-managed VM service account when creating a Dataproc cluster allows you to create clusters with fine-grained access to and control of project resources. Using different user-managed VM service accounts with different Dataproc clusters allows you to set up clusters with different access to Cloud resources.
Before creating the cluster, create the service account within the project in which the cluster will be created. Grant the service account the Dataproc Worker role and any additional roles that will be needed by your jobs, for example, to allow reading and writing data from and to Google Cloud resources, such as BigQuery.
Use the gcloud clusters create command to create a new cluster with a user-specified VM service account and VM access scopes.
gcloud dataproc clusters create cluster-name \ --region=region \ --email@example.com \ --scopes=scope[, ...]
Currently, setting a user-managed Dataproc VM service account in the Cloud Console is not supported. You can set the "cloud-platform" scope on the VMs in your cluster by clicking "Allow API access to all Google Cloud services in the same project" in the Project access section of the Manage security panel on the Dataproc Create a cluster page in the Cloud Console.
- Service Accounts
- Dataproc permissions and IAM roles
- Dataproc principals and roles
- Dataproc Service Account Based Secure Multi-tenancy
- Dataproc Personal Cluster Authentication
- Dataproc Granular IAM