Dataproc on GKE IAM Roles and Identity

Data plane Identity

Dataproc on GKE uses GKE workload identity to allow pods within the Dataproc on GKE cluster to act with the authority of the default Dataproc VM service account (data plane identity). Workload identity requires the following permissions to update IAM policies on the GSA used by your Dataproc on GKE virtual cluster:

  • compute.projects.get
  • iam.serviceAccounts.getIamPolicy
  • iam.serviceAccounts.setIamPolicy

GKE workload identity links the following GKE Service Accounts (KSAs) to the Dataproc VM Service Account:

  1. agent KSA (interacts with Dataproc control plane):
    serviceAccount:${PROJECT}.svc.id.goog[${DPGKE_NAMESPACE}/agent]
  2. spark-driver KSA (runs Spark drivers):
    serviceAccount:${PROJECT}.svc.id.goog[${DPGKE_NAMESPACE}/spark-driver]
  3. spark-executor KSA (runs Spark executors):
    serviceAccount:${PROJECT}.svc.id.goog[${DPGKE_NAMESPACE}/spark-executor]

Assign roles

Grant permissions to the Dataproc VM service account to allow the spark-driver and spark-executor to access project resources, data sources, data sinks, and any other services required by your workload.

Example:

The following command assigns roles to the default Dataproc VM service account to allow Spark workloads running on Dataproc on GKE cluster VMs to access Cloud Storage buckets and BigQuery data sets in the project.

gcloud projects add-iam-policy-binding \
    --role=roles/storage.objectAdmin \
    --role=roles/bigquery.dataEditor \
    --member="project-number-compute@developer.gserviceaccount.com" \
    "${PROJECT}"

Custom IAM configuration

Dataproc on GKE uses GKE workload identity to link the default Dataproc VM service account (data plane identity) to the three GKE service accounts (KSAs).

To create and use a different Google service account (GSA) to link to the KSAs:

  1. Create the GSA (see Creating and managing service accounts).

    gcloud CLI example:

    gcloud iam service-accounts create "dataproc-${USER}" \
        --description "Used by Dataproc on GKE workloads."
    
    Notes:

    • The example sets the GSA name as "dataproc-${USER}", but you can use a different name.
  2. Set environmental variables:

    PROJECT=project-id \
      DPGKE_GSA="dataproc-${USER}@${PROJECT}.iam.gserviceaccount.com"
      DPGKE_NAMESPACE=GKE namespace
    
    Notes:

    • DPGKE_GSA: The examples set and use DPGKE_GSA as the name of the variable that contains the email address of your GSA. You can set and use a different variable name.
    • DPGKE_NAMESPACE: The default GKE namespace is the name of your Dataproc on GKE cluster.
  3. When you create the Dataproc on GKE cluster, add the following properties for Dataproc to use your GSA instead of the default GSA:

    --properties "dataproc:dataproc.gke.agent.google-service-account=${DPGKE_GSA}" \
    --properties "dataproc:dataproc.gke.spark.driver.google-service-account=${DPGKE_GSA}" \
    --properties "dataproc:dataproc.gke.spark.executor.google-service-account=${DPGKE_GSA}" \

  4. Run the following commands to assign necessary Workload Identity permissions to the service accounts:

    1. Assign your GSA the dataproc.worker role to allow it to act as agent:
      gcloud projects add-iam-policy-binding \
          --role=roles/dataproc.worker \
          --member="serviceAccount:${DPGKE_GSA}" \
          "${PROJECT}"
      
    2. Assign the agent KSA the iam.workloadIdentityUser role to allow it to act as your GSA:

      gcloud iam service-accounts add-iam-policy-binding \
          --role=roles/iam.workloadIdentityUser \
          --member="serviceAccount:${PROJECT}.svc.id.goog[${DPGKE_NAMESPACE}/agent]" \
          "${DPGKE_GSA}"
      

    3. Grant the spark-driver KSA the iam.workloadIdentityUser role to allow it to act as your GSA:

      gcloud iam service-accounts add-iam-policy-binding \
          --role=roles/iam.workloadIdentityUser \
          --member="serviceAccount:${PROJECT}.svc.id.goog[${DPGKE_NAMESPACE}/spark-driver]" \
          "${DPGKE_GSA}"
      

    4. Grant the spark-executor KSA the iam.workloadIdentityUser role to allow it to act as your GSA:

      gcloud iam service-accounts add-iam-policy-binding \
          --role=roles/iam.workloadIdentityUser \
          --member="serviceAccount:${PROJECT}.svc.id.goog[${DPGKE_NAMESPACE}/spark-executor]" \
          "${DPGKE_GSA}"