This page describes service accounts and how they can be used with Dataproc.
What are service accounts?
A service account is a special account that can be used by services and applications running on your Compute Engine instance to interact with other Google Cloud APIs. Applications can use service account credentials to authorize themselves to a set of APIs and perform actions within the permissions granted to the service account and virtual machine instance.
When created, Compute Engine virtual machines can be configured to use a specific service account. If a service account is not specified, a default service account is used. For more information, review the Compute Engine service account documentation.
Service accounts in Dataproc
Dataproc clusters are built on top of Compute Engine virtual machines.
Specifying a user-managed service account
when creating a Dataproc cluster
allows you to use that service account for Dataproc virtual machines in
that cluster. If a service account is not specified, Dataproc virtual
machines will use the default Google-managed Compute Engine service
Why specify a service account?
Service accounts have IAM roles granted to them. Specifying a user-managed service account when creating a Dataproc cluster allows you to create and utilize clusters with fine-grained access and control to Cloud resources. Using multiple user-managed service accounts with different Dataproc clusters allows for clusters with different access to Cloud resources.
Service account requirements and Limitations
- Service accounts can only be set when a cluster is created.
- You need to create a service account before creating the Dataproc cluster that will be associated with the service account.
- Once set, the service account used for a cluster cannot be changed.
- Make sure that service accounts have appropriate IAM roles for your needs.
- Service accounts used with Dataproc must have Dataproc/Dataproc Worker role (or have all the permissions granted by Dataproc Worker role).
- Service accounts must reside within the project the cluster will be created in.
- Compute Engine virtual machines used in Dataproc clusters still need specific access scopes. Access scopes are also limited to the service to which they apply. For example, if a Dataproc cluster been granted only the
https://www.googleapis.com/auth/storage-fullscope for Cloud Storage, then it can't use the same scope to make requests to BigQuery.
Default and minimum scopes
If service account scopes are not specified, Dataproc uses the following default set of scopes:
https://www.googleapis.com/auth/bigquery https://www.googleapis.com/auth/bigtable.admin.table https://www.googleapis.com/auth/bigtable.data https://www.googleapis.com/auth/cloud.useraccounts.readonly https://www.googleapis.com/auth/devstorage.full_control https://www.googleapis.com/auth/devstorage.read_write https://www.googleapis.com/auth/logging.writeIf custom scopes are specified, Dataproc uses the combination of the user-specified scopes and the following minimum set of Dataproc-required scopes:
https://www.googleapis.com/auth/cloud.useraccounts.readonly https://www.googleapis.com/auth/devstorage.read_write https://www.googleapis.com/auth/logging.write
Using service accounts
gcloud CommandUse the gcloud clusters create command to create a new cluster with a user-specified service account and access scopes.
gcloud dataproc clusters create cluster-name \ --region=region \ --firstname.lastname@example.org \ --scopes=scope[, ...]
REST APIYou can set the
serviceAccountScopesin the GceClusterConfig object as part of the clusters.create API request.