Cloud Dataproc principals and roles

When you use the Cloud Dataproc service to create clusters and run jobs on your clusters, the service sets ups the necessary Cloud Dataproc Permissions and IAM Roles in your project to access and use the Google Cloud Platform resources it needs to accomplish these tasks. However, if you do cross-project work, for example to access data in another project, you will need to set up the necessary roles and permissions to access cross-project resources.

To help you do cross-project work successfully, this document lists the different principals that use the Cloud Dataproc service and the roles and associated permissions necessary for those principals to access and use GCP resources.

Cloud Dataproc API User (End User identity)


This is the end user that calls the Cloud Dataproc service. The end user is usually an individual, but it can also be a service account if Cloud Dataproc is invoked through an API client or from another Google Cloud Platform service such as Compute Engine, Cloud Functions, or Cloud Composer.

Related roles and permissions:

Good to Know:

  • Cloud Dataproc API-submitted jobs run as root
  • Cloud Dataproc clusters inherit project-wide Compute Engine SSH metadata unless explicitly blocked by setting --metadata=block-project-ssh-keys=true when you create your cluster (see Cluster metadata)
  • If you gcloud compute ssh into a Cloud Dataproc cluster to submit a job via the command line, jobs should run under the logged-in username, but this authentication isn't enforced by default within the VM
  • HDFS user directories are created for each of project-level SSH user. These HDFS directories are created at cluster-deployment time, and a new (post-deployment) SSH user won't be given an HDFS directory on existing clusters

Cloud Dataproc Service Agent (Control Plane identity)


Cloud Dataproc creates this service account with the Dataproc Service Agent role in a Cloud Dataproc user's GCP project. This service account cannot be replaced by a user-specified service account when you create a cluster. You do not need to configure this service account unless you are creating a cluster that uses a shared VPC network in another project.

This service account is used to perform a broad set of system operations, including:

  • get and list operations to confirm the configuration of resources such as images, firewalls, Cloud Dataproc initialization actions, and Cloud Storage buckets
  • Auto-creation of the Cloud Dataproc staging bucket if the staging bucket is not specified by the user
  • Writing cluster configuration metadata to the staging bucket
  • Creation of Compute Engine resources, including VM instances, instance groups, and instance templates

Related error: "The service account does not have read or list access to the resource."

Related roles and permissions:

  • Role: Dataproc Service Agent

Cloud Dataproc VM Service Account (Data Plane identity)


Cloud Dataproc VMs run as this service account. User jobs are granted the permissions of this service account. If you do not specify a user-managed service account when creating a cluster, the default Compute Engine service account (as shown in the above example) will be used.

The VM service account must have permissions to:

  • read and write to the Cloud Dataproc staging bucket

The VM service account may also need permissions, according to job requirements, to:

  • read and write to Cloud Storage, BigQuery, Stackdriver Logging, and to other Google Cloud Platform resources

Related roles and permissions:

For more information

Oliko tästä sivusta apua? Kerro mielipiteesi

Palautteen aihe:

Tämä sivu
Cloud Dataproc Documentation
Tarvitsetko apua? Siirry tukisivullemme.