Data Analytics

Introducing advanced security options for Cloud Dataproc, now generally available

dataproc

Google Cloud Platform (GCP) offers security and governance products that help you meet your policy, regulatory, and business objectives. The controls and capabilities we offer are always expanding. We’re pleased to announce that we’ve expanded the security capabilities of Cloud Dataproc, our fully managed Hadoop and Spark service, by making Kerberos and Hadoop secure mode security configurations generally available. 

Cloud Dataproc’s new security configurations give you the best of two worlds: access to modern, best-in-class security features and infrastructure, and the familiar controls you’ve already developed for your Hadoop and Spark environments. 

Moving on-prem Hadoop clusters securely 
With Kerberos and Hadoop secure mode, you can migrate your existing Hadoop security controls directly into the cloud without having to make changes to your security policies and procedures. You can now enable new tools in Cloud Dataproc, including: 

  • Connecting Cloud Dataproc back to Microsoft Active Directory
  • Encrypting data in flight between nodes in a cluster 
  • Supporting multi-tenant clusters

Here’s a look at a common customer setup for Kerberos on Cloud Dataproc.

Kerberos on Cloud Dataproc.png
  1. Each GCP user is associated with a cloud identity. This authentication mechanism gives users the ability to SSH into a cluster, run jobs via the API and to create cloud resources (i.e., a Cloud Dataproc cluster).
  2. If you want to use a Kerberized “Hadoop” application, you have to obtain a Kerberos principal. Microsoft Active Directory is used as a cross-realm trust to users and groups that map into Cloud Dataproc Kerberos principals.
    Note: This setup requires Active Directory to be source of truth for user identities. Cloud Identity is only a synchronized copy.  
  3. When the “Hadoop” application needs to obtain data from Cloud Storage, a Cloud Storage Connector is invoked. The Cloud Storage Connector allows “Hadoop” to access Cloud Storage data at the block level as if it were a native part of Hadoop. This connector relies on a service account to authenticate against Cloud Storage.

Standing on the shoulders of GCP security
Kerberos and Hadoop secure mode provides you parity with legacy Hadoop security platforms, making it easy to port your existing procedures and policies. However, you may find that even though you maintain existing security practices, the overall security posture of your Hadoop and Spark environments greatly improves with the migration to GCP. 

This is because Cloud Dataproc and GCP take advantage of the same secure-by-design infrastructure, built-in protection, and global network that Google uses to protect your information, identities, applications, and devices. In addition, GCP and Cloud Dataproc offer additional security features that help protect your data. Some of the most commonly used GCP-specific security features used with Cloud Dataproc include: 

  • Default at-rest encryption, where GCP encrypts customer data stored at rest by default, with no additional action required from you. We offer a continuum of encryption key management options, including a CMEK feature that lets you create, use, and revoke the key encryption key (KEK). 

  • Stackdriver Monitoring provides visibility into the performance, uptime, and overall health of cloud-powered applications. Stackdriver collects and ingests metrics, events, and metadata from Cloud Dataproc clusters to bring you insights via dashboards and charts.

  • VPC Service Controls allow you to define a security perimeter around Cloud Dataproc and the data stored in Cloud Storage buckets. Datasets can be constrained within a VPC to help mitigate data exfiltration risks. With VPC Service Controls, you can keep sensitive data private and still take advantage of the fully managed storage and data processing capabilities of GCP.

These features and many others are certified by third-party auditors. Cloud Dataproc certifications include the most widely recognized, internationally accepted independent security standards, including ISO for security controls, cloud security and privacy, as well as SOC 1, 2, and 3. These certifications help us meet the demands of industry standards such as HIPAA and PCI. We continue to expand our list of certifications globally to assist our customers with their compliance obligations.

End-to-end authorization with GCP Token Broker
As a typical cloud best practice, we recommend that the GCP service accounts associated with the virtual machines (or cloud infrastructure) access datasets on behalf of a user. Many Cloud Dataproc customers choose to provision small autoscaling clusters for each Cloud Dataproc user. This way, there is a clear audit log to see who was on which cluster when it accessed a Cloud Storage dataset. 

However, we also hear that many enterprise customers would prefer to use multi-tenant clusters and have strict compliance requirements that dictate that access to GCP resources (Cloud Storage, BigQuery, Cloud Bigtable, etc.) must be attributable to the individual user who initiated the request. In addition, to meet compliance requirements, this should be done in a way that ensures no long-lived credentials are stored on client machines or worker nodes.

To meet these customer goals, Google Cloud created an open source GCP Token Broker. The GCP Token Broker enables end-to-end Kerberos security and Cloud IAM integration for Hadoop workloads on GCP. You can use this open source software to bridge the gap between Kerberos and Cloud IAM to allow users to log in with Kerberos and access GCP resources.

The following diagram illustrates the overall architecture for direct authentication.

overall architecture for direct authentication.png

For more on how the GCP Token Broker extends the functionality of the generally available Kerberos and Hadoop secure mode in Cloud Dataproc, check out the joint Google and Cloudera session from Google Cloud Next ’19: Building and Securing Data Lakes.  

Getting started with secure mode
To get started with Kerberos and Hadoop secure mode, check “Enable Kerberos and Hadoop secure mode” in the Cloud Dataproc console, as shown here:

Cloud Dataproc console.png

To securely exchange a secret key and administrator password, you will first need to create those files outside of the console and encrypt them using Cloud Key Management Service

By default, Cloud Dataproc will turn on all the features of Hadoop secure mode, including in-flight encryption. Cloud Dataproc will auto-generate a self-signed certificate for the encryption, or you can upload your own. 

Any default setting can be overwritten using a cluster property. For example, if you want to enable multi-tenant Cloud Dataproc but don’t have compliance requirements that warrant the performance penalty associated with in-transit encryption within a VPC, you can disable the in-transit encryption by setting the following Cloud Dataproc properties: 

  core:hadoop.rpc.protection=authentication
hdfs:dfs.encrypt.data.transfer=false,
hdfs:dfs.data.transfer.protection=authentication

You can set these properties from gcloud or in the cluster properties page, as shown here:

cluster properties page.png

A cross-realm trust option is also available if you want to rely on an external directory like Microsoft Active Directory. 

For complete instructions on setting up different types of security configurations, check out Cloud Dataproc security configuration.