When you use Dataproc, cluster and job data is stored on Persistent Disks (PDs) associated with the Compute Engine VMs in your cluster and in a Cloud Storage staging bucket. This PD and bucket data is encrypted using a Google-generated data encryption key (DEK) and key encryption key (KEK). The CMEK feature allows you to create, use, and revoke the key encryption key (KEK). Google still controls the data encryption key (DEK). For more information on Google data encryption keys, see Encryption at Rest.
Using CMEK
You can use CMEK to encrypt data on the PDs associated with the VMs in your Dataproc cluster and/or the cluster metadata and job output written to the Dataproc staging bucket. Follow Steps 1 and 2, then follow Steps 3, 4, or 5, below, to use CMEK with your cluster's PDs, Cloud Storage bucket, or both, respectively.
- Create a key using the Cloud Key Management Service (Cloud KMS).
Copy the resource name, which you can use in the next steps. The resource
name is constructed as follows:
projects/project-id/locations/region/keyRings/key-ring-name/cryptoKeys/key-name
To enable the Compute Engine and Cloud Storage service accounts to use your key:
- Follow Item #5 in
Compute Engine→Protecting Resources with Cloud KMS Keys→Before you begin
to assign the Cloud KMS
CryptoKey Encrypter/Decrypter
role to the Compute Engine service agent. - Assign the Cloud KMS
CryptoKey Encrypter/Decrypter
role to the Cloud Storage service agent.
- Follow Item #5 in
Compute Engine→Protecting Resources with Cloud KMS Keys→Before you begin
to assign the Cloud KMS
You can use the Google Cloud CLI or the Dataproc API to set the key you created in Step 1 on the PDs associated with the VMs in the Dataproc cluster.
gcloud Command
Pass the Cloud KMS resource ID obtained in Step 1 to the
--gce-pd-kms-key
flag when you create the cluster with the gcloud dataproc clusters create command.Example:
gcloud dataproc clusters create my-cluster-name \ --region=region \ --gce-pd-kms-key='projects/project-id/locations/region/keyRings/key-ring-name/cryptoKeys/key-name' \ other args ...
You can verify the key setting from the
gcloud
command-line tool.gcloud dataproc clusters describe cluster-name \ --region=region
... configBucket: dataproc- ... encryptionConfig: gcePdKmsKeyName: projects/project-id/locations/region/keyRings/key-ring-name/cryptoKeys/key-name ...
REST API
Use ClusterConfig.EncryptionConfig.gcePdKmsKeyName as part of a cluster.create request.
You can verify the key setting by issuing a clusters.get request. The returned JSON contains lists the
gcePdKmsKeyName
:... { "projectId": "project-id", "clusterName": "cluster-name", "config": { "encryptionConfig": { "gcePdKmsKeyName": "projects/project-id/locations/region/keyRings/key-ring-name/cryptoKeys/key-name" } },
To use CMEK on the Cloud Storage bucket used by Dataproc to read/write cluster and job data, create a bucket with CMEK. Note: Use the key created in Step 1 when adding the key on the bucket. Then, pass the bucket name to the gcloud dataproc clusters create command when you create the cluster.
Example:gcloud dataproc clusters create my-cluster \ --region=region \ --bucket=name-of-CMEK-bucket \ other args ...
You can also pass a CMEK-enabled buckets to the
Example:gcloud dataproc jobs submit
command if your job takes bucket arguments (see the...cmek-bucket...
bucket arguments in the following PySpark job submission example).gcloud dataproc jobs submit pyspark gs://cmek-bucket/wordcount.py \ --region=region \ --cluster=cluster-name \ -- gs://cmek-bucket/shakespeare.txt gs://cmek-bucket/counts
To use CMEK on the PDs in your cluster and the Cloud Storage bucket used by Dataproc, pass both the
--gce-pd-kms-key
and the--bucket
flags to thegcloud dataproc clusters create
command as explained in Steps 3 and 4. You can create and use a separate key for PD data and bucket data.
Cloud External Key Manager
Cloud External Key Manager (EKM) allows you to protect Dataproc data using keys managed by a supported external key management partner. The steps you follow to use EKM in Dataproc are the same as as those you use to set up CMEK keys, with the following difference: your key points to a URI for the externally managed key (see Cloud EKM Overview).
Cloud EKM errors
When you use Cloud EKM, an attempt to create a cluster may fail due to errors associated with inputs, Cloud EKM, the external key management partner system, or communications between EKM and the external system. If you use the REST API or the Google Cloud console, errors are logged in Logging. You can examine the failed cluster's errors from the "View Logs" tab.