Customer Managed Encryption Keys (CMEK) for Dataproc is now generally available
Cloud Dataproc Software Engineer
Group Product Manager, Google
Encryption of data at rest is a foundational control in any data protection strategy. Google Cloud Platform encrypts data at rest, with no additional action from customers required. In addition to providing encryption by default, we are working to offer customers additional encryption and key management options for even greater levels of control across multiple services.
It’s up to you which controls you place on data that resides in Cloud Dataproc, our fully-managed cloud service for running Apache Spark and Apache Hadoop clusters on GCP, and we want to provide tools that make it easy to exert that control. The latest is Cloud Dataproc Customer Managed Encryption Keys (CMEK), a feature that is now generally available. With this feature, you can create, use, and revoke the key encryption key (KEK) for Compute Engine VMs in your cluster and the Cloud Storage buckets used with Cloud Dataproc. Within your managed Hadoop and Spark environments, using CMEK, you can now:
Create encryption keys with Cloud Key Management Service (Cloud KMS)
Rotate the keys according to your policies and preferences
Set up auto-rotation
Destroy encryption keys at the end of their lifecycle.
In addition to Google Clould’s default encryption, the Hadoop Distributed File System (HDFS) has its own encryption mechanisms. Therefore, most Dataproc customers won’t need to manage their own keys to maintain security. However, if you want additional control, or you have a specific policy or regulatory demand that specifies that you must manage keys or perform operations in a hardware security module (HSM), CMEK on Cloud Dataproc offers an easy way for you to meet these objectives.
The CMEK integration will let you rotate and revoke keys in a manner that aligns with your existing enterprise policies. It will also provide an audit trail in Cloud Audit Logging that you can use to identify possible data misuse. In addition, for specific compliance mandates requiring that keys and crypto operations be performed within a hardware environment, the Cloud KMS integration with Cloud HSM makes it simple to create a key protected by a FIPS 140-2 Level 3 device. Because Cloud HSM uses Cloud KMS as its front end, you can leverage the convenience and features that Cloud KMS provides and still meet security requirements without the administrative overhead typically associated with sophisticated security hardware.
Getting started with Customer Managed Encryption Keys (CMEK) in Cloud Dataproc
It is important to understand the data intersection points between Cloud Dataproc and the rest of GCP. While your Cloud Dataproc environment may pull from additional at-rest locations, there are at least three storage locations that you should consider when it comes to Customer Managed Encryption Keys in Cloud Dataproc.
Google Cloud Storage
It will be obvious to those that use Cloud Storage as their primary data warehouse that this data source is both a key source of data as well as the output destination for many Dataproc jobs.
However, it may be less obvious that in each region with a Cloud Dataproc cluster, there is also a staging bucket that contains miscellaneous configuration and control files needed by your cluster, including the job driver output. These buckets also receive output from the Diagnosis command, which often contains sensitive data contained in the logs of your cluster.
For customers that intend to use CMEK, the best practice is to not only protect your own data objects with CMEK, but to also create your a per-region staging bucket with CMEK and then explicitly specify that bucket in your cluster creation process as shown in the screenshot below.
Persistent Disks (PDs) on the Cloud Dataproc Cluster
While HDFS offers its own transparent encryption, it can be difficult to keep up with all of the locations where open source big data applications utilize disk space outside of HDFS. Because Google Cloud simply encrypts the underlying persistent disk in hardware, you can avoid having to reconfigure encryption with each application or deal with the slow performance and management associated with software level encryption on the file systems. Having at-rest encryption built into the platform means that you can avoid complicated setups and inefficient performance that comes with implementing applications such as LUKS to obtain full disk encryption of your cluster. You can simply let Google handle your at-rest encryption requirements.
Cloud Dataproc users can use Customer Managed Encryption Keys (CMEK) to access protected BigQuery datasets and tables (see Writing a MapReduce Job with the BigQuery Connector for an example).
For each at-rest source, you can define a unique key in Cloud Key Management Service. Each source can have its own administrators and/or rotation schedule. You may also choose to simply reuse the same key for each source. As a best practice, you should also consider creating a separate Google Cloud Project for the KMS management to help enforce separation of duties.
Having each at-rest source controlled independently frees you to choose the encryption strategy that is right for your data. For instance, a common Extract Transform Load (ETL) pattern is
In this familiar ETL cycle, you may want to use CMEK on the Cloud Storage and PD storage locations to meet the compliance mandates of the PII data that you have, but then simply rely on Google’s managed encryption for BigQuery since it does not contain PII. You may also find yourself in the opposite situation. You may have PII available in BigQuery so you would prefer to use CMEK there.
Whichever at-rest encryption methodology you decide to implement, Google Cloud has both flexibility as well as the tools to help you implement your controls.
Our goal is to help you keep your data secure and provide the tooling needed to make sure you can meet your security policy and compliance objectives. To learn more about at-rest encryption, see our comprehensive documentation on Encryption at Rest on the Google Cloud Platform.
When you are ready to get started with CMEK in Cloud Dataproc, you can follow the five simple steps here to setup your environment.