Customer managed encryption keys (CMEK)

When you use Dataproc, cluster and job data is stored on Persistent Disks (PDs) associated with the Compute Engine VMs in your cluster and in a Cloud Storage staging bucket. This PD and bucket data is encrypted using a Google-generated data encryption key (DEK) and key encryption key (KEK). The CMEK feature allows you to create, use, and revoke the key encryption key (KEK). Google still controls the data encryption key (DEK). For more information on Google data encryption keys, see Encryption at Rest.

Using CMEK

You can use CMEK to encrypt data on the PDs associated with the VMs in your Dataproc cluster and/or the cluster metadata and job output written to the Dataproc staging bucket. Follow Steps 1 and 2, then follow Steps 3, 4, or 5, below, to use CMEK with your cluster's PDs, Cloud Storage bucket, or both, respectively.

  1. Create a key using the Cloud Key Management Service (Cloud KMS). Copy the resource name, which you can use in the next steps. The resource name is constructed as follows:
  2. To enable the Compute Engine and Cloud Storage service accounts to use your key:

    1. Follow Item #5 in Compute Engine→Protecting Resources with Cloud KMS Keys→Before you begin to assign the Cloud KMS CryptoKey Encrypter/Decrypter role to the Compute Engine service agent.
    2. Assign the Cloud KMS CryptoKey Encrypter/Decrypter role to the Cloud Storage service agent.
  3. You can use the Google Cloud CLI or the Dataproc API to set the key you created in Step 1 on the PDs associated with the VMs in the Dataproc cluster.

    gcloud Command

    Pass the Cloud KMS resource ID obtained in Step 1 to the --gce-pd-kms-key flag when you create the cluster with the gcloud dataproc clusters create command.


    gcloud dataproc clusters create my-cluster-name \
        --region=region \
        --gce-pd-kms-key='projects/project-id/locations/region/keyRings/key-ring-name/cryptoKeys/key-name' \
        other args ...

    You can verify the key setting from the gcloud command-line tool.

    gcloud dataproc clusters describe cluster-name \
    configBucket: dataproc- ...
    gcePdKmsKeyName: projects/project-id/locations/region/keyRings/key-ring-name/cryptoKeys/key-name


    Use ClusterConfig.EncryptionConfig.gcePdKmsKeyName as part of a cluster.create request.

    You can verify the key setting by issuing a clusters.get request. The returned JSON contains lists the gcePdKmsKeyName:

    "projectId": "project-id",
    "clusterName": "cluster-name",
    "config": {
       "encryptionConfig": {
      "gcePdKmsKeyName": "projects/project-id/locations/region/keyRings/key-ring-name/cryptoKeys/key-name"

  4. To use CMEK on the Cloud Storage bucket used by Dataproc to read/write cluster and job data, create a bucket with CMEK. Note: Use the key created in Step 1 when adding the key on the bucket. Then, pass the bucket name to the gcloud dataproc clusters create command when you create the cluster.


    gcloud dataproc clusters create my-cluster \
        --region=region \
        --bucket=name-of-CMEK-bucket \
        other args ...

    You can also pass a CMEK-enabled buckets to the gcloud dataproc jobs submit command if your job takes bucket arguments (see the ...cmek-bucket... bucket arguments in the following PySpark job submission example).

    gcloud dataproc jobs submit pyspark gs://cmek-bucket/ \
        --region=region \
        --cluster=cluster-name \
        -- gs://cmek-bucket/shakespeare.txt gs://cmek-bucket/counts

  5. To use CMEK on the PDs in your cluster and the Cloud Storage bucket used by Dataproc, pass both the --gce-pd-kms-key and the --bucket flags to the gcloud dataproc clusters create command as explained in Steps 3 and 4. You can create and use a separate key for PD data and bucket data.

Cloud External Key Manager

Cloud External Key Manager (EKM) allows you to protect Dataproc data using keys managed by a supported external key management partner. The steps you follow to use EKM in Dataproc are the same as as those you use to set up CMEK keys, with the following difference: your key points to a URI for the externally managed key (see Cloud EKM Overview).

Cloud EKM errors

When you use Cloud EKM, an attempt to create a cluster may fail due to errors associated with inputs, Cloud EKM, the external key management partner system, or communications between EKM and the external system. If you use the REST API or the Google Cloud console, errors are logged in Logging. You can examine the failed cluster's errors from the "View Logs" tab.