Customer managed encryption keys (CMEK)

When you use Dataproc, cluster and job data is stored on persistent disks associated with the Compute Engine VMs in your cluster and in a Cloud Storage staging bucket. This persistent disk and bucket data is encrypted using a Google-generated data encryption key (DEK) and key encryption key (KEK).

The CMEK feature lets you create, use, and revoke the key encryption key (KEK). Google still controls the data encryption key (DEK). For more information on Google data encryption keys, see Encryption at Rest.

Use CMEK with cluster data

You can use customer-managed encryption keys (CMEK) to encrypt the following cluster data:

  • Data on the persistent disks attached to VMs in your Dataproc cluster
  • Job argument data submitted to your cluster, such as a query string submitted with a Spark SQL job
  • Cluster metadata, job driver output, and other data written to a Dataproc staging bucket that you create

Follow these steps to use CMEK with the encryption of cluster data:

  1. Create one or more keys using the Cloud Key Management Service. The resource name, also called the resource ID of a key, which you use in the next steps, is constructed as follows:
    projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME
    
  2. Assign the following roles to the following service accounts:

    1. Follow item #5 in Compute Engine→Protecting Resources with Cloud KMS Keys→Before you begin to assign the Cloud KMS CryptoKey Encrypter/Decrypter role to the Compute Engine service agent service account.
    2. Assign the Cloud KMS CryptoKey Encrypter/Decrypter role to the Cloud Storage service agent service account.

    3. Assign the Cloud KMS CryptoKey Encrypter/Decrypter role to the Dataproc service agent service account. You can use the Google Cloud CLI to assign the role:

        gcloud projects add-iam-policy-binding KMS_PROJECT_ID \
        --member serviceAccount:service-PROJECT_NUMBER@dataproc-accounts.iam.gserviceaccount.com \
        --role roles/cloudkms.cryptoKeyEncrypterDecrypter
      

      Replace the following:

      KMS_PROJECT_ID: the ID of your Google Cloud project that runs Cloud KMS. This project can also be the project that runs Dataproc resources.

      PROJECT_NUMBER: the project number (not the project ID) of your Google Cloud project that runs Dataproc resources.

    4. Enable the Cloud KMS API on the project that runs Dataproc resources.

    5. If the Dataproc Service Agent role is not attached to the Dataproc Service Agent service account, then add the serviceusage.services.use permission to the custom role attached to the Dataproc Service Agent service account. If the Dataproc Service Agent role is attached to the Dataproc Service Agent service account, you can skip this step.

  3. Pass the resource ID of your key to the Google Cloud CLI or the Dataproc API to use with cluster data encryption.

    gcloud CLI

    • To encrypt cluster persistent disk data using your key, pass the resource ID of your key to the --gce-pd-kms-key flag when you create the cluster.
      gcloud dataproc clusters create CLUSTER_NAME \
          --region=REGION \
          --gce-pd-kms-key='projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME' \
          other arguments ...
      

      You can verify the key setting from the gcloud command-line tool.

      gcloud dataproc clusters describe CLUSTER_NAME \
          --region=REGION
      

      Command output snippet:

      ...
      configBucket: dataproc- ...
      encryptionConfig:
      gcePdKmsKeyName: projects/project-id/locations/region/keyRings/key-ring-name/cryptoKeys/key-name
      ...
      
    • To encrypt cluster persistent disk data and job argument data using your key, pass the resource ID of the key to the --kms-key flag when you create the cluster. See Cluster.EncryptionConfig.kmsKey for a list of job types and arguments that are encrypted with the --kms-key flag.
      gcloud dataproc clusters create CLUSTER_NAME \
          --region=REGION \
          --kms-key='projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME' \
          other arguments ...
        

      You can verify key settings with the gcloud CLI dataproc clusters describe command. The key resource ID is set on gcePdKmsKeyName and kmsKey to use your key with the encryption of cluster persistent disk and job argument data.

      gcloud dataproc clusters describe CLUSTER_NAME \
          --region=REGION
        

      Command output snippet:

      ...
      configBucket: dataproc- ...
      encryptionConfig:
      gcePdKmsKeyName: projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME
      kmsKey: projects/PROJECT_ID/locations/REGION/keyRings/key-KEY_RING_NAME-name/cryptoKeys/KEY_NAME
      ...
      

    • To encrypt cluster metadata, job driver, and other output data written to your Dataproc staging bucket in Cloud Storage:
      gcloud dataproc clusters create CLUSTER_NAME \
          --region=REGION \
          --bucket=CMEK_BUCKET_NAME \
          other arguments ...
          

      You can also pass CMEK-enabled buckets to the `gcloud dataproc jobs submit` command if your job takes bucket arguments, as shown in the following `cmek-bucket` example:

      gcloud dataproc jobs submit pyspark gs://cmek-bucket/wordcount.py \
          --region=region \
          --cluster=cluster-name \
          -- gs://cmek-bucket/shakespeare.txt gs://cmek-bucket/counts
        

    REST API

    • To encrypt cluster VM persistent disk data using your key, include the ClusterConfig.EncryptionConfig.gcePdKmsKeyName field as part of a cluster.create request.

      You can verify the key setting with the gcloud CLI dataproc clusters describe command.

      gcloud dataproc clusters describe CLUSTER_NAME \
          --region=REGION
      

      Command output snippet:

      ...
      configBucket: dataproc- ...
      encryptionConfig:
      gcePdKmsKeyName: projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME
      ...
      
    • To encrypt cluster VM persistent disk data and job argument data using your key, include the Cluster.EncryptionConfig.kmsKey field as part of a cluster.create request. See Cluster.EncryptionConfig.kmsKey for a list of job types and arguments that are encrypted with the --kms-key field.

      You can verify key settings with the gcloud CLI dataproc clusters describe command. The key resource ID is set on gcePdKmsKeyName and kmsKey to use your key with the encryption of cluster persistent disk and job argument data.

      gcloud dataproc clusters describe CLUSTER_NAME \
          --region=REGION
      

      Command output snippet:

      ...
      configBucket: dataproc- ...
      encryptionConfig:
      gcePdKmsKeyName: projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME
      kmsKey: projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME
      
    • To encrypt cluster metadata, job driver, and other output data written to your Dataproc staging bucket in Cloud Storage:
      gcloud dataproc clusters create CLUSTER_NAME \
          --region=REGION \
          --bucket=CMEK_BUCKET_NAMEt \
          other arguments ...
      

      You can also pass CMEK-enabled buckets to the `gcloud dataproc jobs submit` command if your job takes bucket arguments, as shown in the following `cmek-bucket` example:

      gcloud dataproc jobs submit pyspark gs://cmek-bucket/wordcount.py \
          --region=region \
          --cluster=cluster-name \
          -- gs://cmek-bucket/shakespeare.txt gs://cmek-bucket/counts
        

Use CMEK with workflow template data

Dataproc workflow template job argument data, such as the query string of a Spark SQL job, can be encrypted using CMEK. Follow steps 1, 2, and 3 in this section to use CMEK with your Dataproc workflow template. See WorkflowTemplate.EncryptionConfig.kmsKey for a list of workflow template job types and arguments that are encrypted using CMEK when this feature is enabled.

  1. Create a key using the Cloud Key Management Service (Cloud KMS). The resource name of the key, which you use in the next steps, name is constructed as follows:
    projects/project-id/locations/region/keyRings/key-ring-name/cryptoKeys/key-name
    
  2. To enable the Dataproc service accounts to use your key:

    1. Assign the Cloud KMS CryptoKey Encrypter/Decrypter role to the Dataproc Service Agent service account. You can use the gcloud CLI to assign the role:

       gcloud projects add-iam-policy-binding KMS_PROJECT_ID \
       --member serviceAccount:service-PROJECT_NUMBER@dataproc-accounts.iam.gserviceaccount.com \
       --role roles/cloudkms.cryptoKeyEncrypterDecrypter
      

      Replace the following:

      KMS_PROJECT_ID: the ID of your Google Cloud project that runs Cloud KMS. This project can also be the project that runs Dataproc resources.

      PROJECT_NUMBER: the project number (not the project ID) of your Google Cloud project that runs Dataproc resources.

    2. Enable the Cloud KMS API on the project that runs Dataproc resources.

    3. If the Dataproc Service Agent role is not attached to the Dataproc Service Agent service account, then add the serviceusage.services.use permission to the custom role attached to the Dataproc Service Agent service account. If the Dataproc Service Agent role is attached to the Dataproc Service Agent service account, you can skip this step.

  3. You can use the Google Cloud CLI or the Dataproc API to set the key you created in Step 1 on a workflow. Once the key is set on a workflow, all the workflow job arguments and queries are encrypted using the key for any of the job types and arguments listed in WorkflowTemplate.EncryptionConfig.kmsKey.

    gcloud CLI

    Pass resource ID of your key to the --kms-key flag when you create the workflow template with the gcloud dataproc workflow-templates create command.

    Example:

    gcloud dataproc workflow-templates create my-template-name \
        --region=region \
        --kms-key='projects/project-id/locations/region/keyRings/key-ring-name/cryptoKeys/key-name' \
        other arguments ...
    
    You can verify the key setting from the gcloud command-line tool.
    gcloud dataproc workflow-templates describe TEMPLATE_NAME \
        --region=REGION
    
    ...
    id: my-template-name
    encryptionConfig:
    kmsKey: projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME
    ...
    

    REST API

    Use WorkflowTemplate.EncryptionConfig.kmsKey as part of a workflowTemplates.create request.

    You can verify the key setting by issuing a workflowTemplates.get request. The returned JSON contains lists the kmsKey:

    ...
    "id": "my-template-name",
    "encryptionConfig": {
      "kmsKey": "projects/project-id/locations/region/keyRings/key-ring-name/cryptoKeys/key-name"
    },
    

Cloud External Key Manager

Cloud External Key Manager (Cloud EKM) (EKM) lets you protect Dataproc data using keys managed by a supported external key management partner. The steps you follow to use EKM in Dataproc are the same as as those you use to set up CMEK keys, with the following difference: your key points to a URI for the externally managed key (see Cloud EKM Overview).

Cloud EKM errors

When you use Cloud EKM, an attempt to create a cluster can fail due to errors associated with inputs, Cloud EKM, the external key management partner system, or communications between EKM and the external system. If you use the REST API or the Google Cloud console, errors are logged in Logging. You can examine the failed cluster's errors from the View Log tab.