Dataproc principals and roles

When you use the Dataproc service to create clusters and run jobs on your clusters, the service sets ups the necessary Dataproc Permissions and IAM Roles in your project to access and use the Google Cloud resources it needs to accomplish these tasks. However, if you do cross-project work, for example to access data in another project, you will need to set up the necessary roles and permissions to access cross-project resources.

To help you do cross-project work successfully, this document lists the different principals that use the Dataproc service and the roles and associated permissions necessary for those principals to access and use Google Cloud resources.

Dataproc API User (End User identity)

Example: username@example.com

This is the end user that calls the Dataproc service. The end user is usually an individual, but it can also be a service account if Dataproc is invoked through an API client or from another Google Cloud service such as Compute Engine, Cloud Functions, or Cloud Composer.

Related roles and permissions:

Good to Know:

  • Dataproc API-submitted jobs run as root
  • Dataproc clusters inherit project-wide Compute Engine SSH metadata unless explicitly blocked by setting --metadata=block-project-ssh-keys=true when you create your cluster (see Cluster metadata)
  • If you gcloud compute ssh into a Dataproc cluster to submit a job via the command line, jobs should run under the logged-in username, but this authentication isn't enforced by default within the VM
  • HDFS user directories are created for each of project-level SSH user. These HDFS directories are created at cluster-deployment time, and a new (post-deployment) SSH user won't be given an HDFS directory on existing clusters

Dataproc Service Agent (Control Plane identity)

Example: service-project-number@dataproc-accounts.iam.gserviceaccount.com

Dataproc creates this service account with the Dataproc Service Agent role in a Dataproc user's Google Cloud project. This service account cannot be replaced by a user-specified service account when you create a cluster. You do not need to configure this service account unless you are creating a cluster that uses a shared VPC network in another project.

This service account is used to perform a broad set of system operations, including:

  • get and list operations to confirm the configuration of resources such as images, firewalls, Dataproc initialization actions, and Cloud Storage buckets
  • Auto-creation of the Dataproc staging bucket if the staging bucket is not specified by the user
  • Writing cluster configuration metadata to the staging bucket
  • Creation of Compute Engine resources, including VM instances, instance groups, and instance templates

Related error: "The service account does not have read or list access to the resource."

Related roles and permissions:

  • Role: Dataproc Service Agent

Dataproc VM Service Account (Data Plane identity)

Example: project-number-compute@developer.gserviceaccount.com

Dataproc VMs run as this service account. User jobs are granted the permissions of this service account—your application code runs under this service account on Dataproc worker VMs.

You can specify a user-managed service account using the optional --service-account flag with the gcloud dataproc clusters create command or the GceClusterConfig.serviceAccount field as part of a Dataproc clusters.create API request. If you do not specify a user-managed service account when creating a cluster, the Compute Engine default service account, listed in example, above, is used.

The VM service account must have permissions to:

  • read and write to the Dataproc staging bucket

The VM service account may also need permissions, according to job requirements, to:

  • read and write to Cloud Storage, BigQuery, Stackdriver Logging, and to other Google Cloud resources

Related roles and permissions:

For more information

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Dataproc Documentation
Need help? Visit our support page.