Customer Managed Encryption Keys (CMEK)

When you use Cloud Dataproc, cluster and job data is stored on Persistent Disks (PDs) associated with the Compute Engine VMs in your cluster and in a Cloud Storage bucket. This PD and bucket data is encrypted using a Google-generated data encryption key (DEK) and key encryption key (KEK). The CMEK feature allows you to create, use, and revoke the key encryption key (KEK). Google still controls the data encryption key (DEK). For more information on Google data encryption keys, see Encryption at Rest.

Using CMEK

You can use CMEK to encrypt data on the PDs associated with the VMs in your Cloud Dataproc cluster and/or the cluster metadata and job driver output written to the Cloud Storage bucket used by Cloud Dataproc (see ClusterConfig.configBucket and Accessing job driver output→CLOUD STORAGE tab). Follow Steps 1 and 2, then follow Steps 3, 4, or 5, below, to use CMEK with your cluster's PDs, Cloud Storage bucket, or both, respectively.

  1. Create a key using the Cloud Key Management Service (Cloud KMS). Copy the resource name, which you can use in the next steps. The resource name is constructed as follows:
    projects/project-id/locations/region/keyRings/key-ring-name/cryptoKeys/key-name
    
  2. To enable the Compute Engine and Cloud Storage service accounts to use your key:
    1. Follow Item #5 in Compute Engine→Protecting Resources with Cloud KMS Keys→Before you begin to assign the Cloud KMS CryptoKey Encrypter/Decrypter role to the Compute Engine service account.
    2. Assign the Cloud KMS CryptoKey Encrypter/Decrypter role to the Cloud Storage service account.
  3. You can use the gcloud command-line tool or the Cloud Dataproc API to set the key you created in Step 1 on the PDs associated with the VMs in the Cloud Dataproc cluster.

    gcloud Command

    Pass the Cloud KMS resource ID obtained in Step 1 to the --gce-pd-kms-key flag when you create the cluster with the gcloud dataproc clusters create command.

    Example:

    gcloud dataproc clusters create my-cluster-name \
      --gce-pd-kms-key='projects/project-id/locations/region/keyRings/key-ring-name/cryptoKeys/key-name' \
      other args ...
    

    You can verify the key setting from the gcloud command-line tool.

    gcloud dataproc clusters describe cluster-name
    
    ..
      configBucket: dataproc-...
      encryptionConfig:
        gcePdKmsKeyName: projects/project-id/locations/region/keyRings/key-ring-name/cryptoKeys/key-name
    ...
    

    REST API

    Use ClusterConfig.EncryptionConfig.gcePdKmsKeyName as part of a cluster.create request.

    You can verify the key setting by issuing a clusters.get request. The returned JSON contains lists the gcePdKmsKeyName:

    ...
    {
      "projectId": "project-id",
      "clusterName": "cluster-name",
      "config": {
           "encryptionConfig": {
          "gcePdKmsKeyName": "projects/project-id/locations/region/keyRings/key-ring-name/cryptoKeys/key-name"
        }
      },
    
  4. To use CMEK on the Cloud Storage bucket used by Cloud Dataproc to read/write cluster and job data, create a bucket with CMEK. Note: Use the key created in Step 1 when adding the key on the bucket. Then, pass the bucket name to the gcloud dataproc clusters create command when you create the cluster.

    Example:

    gcloud dataproc clusters create my-cluster \
      --bucket name-of-CMEK-bucket \
      other args
    

    You can also pass a CMEK-enabled buckets to the gcloud dataproc jobs submit command if your job takes bucket arguments (see the ...cmek-bucket... bucket arguments in the following PySpark job submission example).

    Example:
    gcloud dataproc jobs submit pyspark gs://cmek-bucket/wordcount.py
      --cluster clustername -- gs://cmek-bucket/shakespeare.txt gs://cmek-bucket/counts
    
  5. To use CMEK on the PDs in your cluster and the Cloud Storage bucket used by Cloud Dataproc, pass both the --gce-pd-kms-key and the --bucket flags to the gcloud dataproc clusters create command as explained in Steps 3 and 4. You can create and use a separate key for PD data and bucket data.
Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Dataproc Documentation