Dataproc Security Configuration

When you create a Dataproc cluster, you can enable Hadoop Secure Mode via Kerberos to provide multi-tenancy via user authentication, isolation, and encryption inside a Dataproc cluster.

User Authentication and Other Google Cloud Platform Services. Per-user authentication via Kerberos only applies within the cluster. Interactions with other Google Cloud services, such as Cloud Storage, continue to be authenticated as the service account for the cluster.

Enabling Hadoop Secure Mode via Kerberos

Enabling Kerberos and Hadoop Secure Mode for a cluster will include the MIT distribution of Kerberos and configure Apache Hadoop YARN, HDFS, Hive, Spark, and related components to use it for authentication.

Enabling Kerberos creates an on-cluster Key Distribution Center (KDC), that contains service principals and a root principal. The root principal is the account with administrator permissions to the on-cluster KDC. It can also contain standard user principals or be connected via cross-realm trust to another KDC that contains the user principals.

Create a Kerberos cluster

You can use the Google Cloud CLI, the Dataproc API, or the Google Cloud console to enable Kerberos on clusters that use Dataproc image version 1.3 and later.

gcloud command

To automatically configure a new Kerberos Dataproc cluster (image version 1.3 and later), use the gcloud dataproc clusters create command.

gcloud dataproc clusters create cluster-name \
    --image-version=2.0 \
    --enable-kerberos

Cluster property: Instead of using the --enable-kerberos flag as shown above, you can automatically configure Kerberos by passing the --properties "dataproc:kerberos.beta.automatic-config.enable=true" flag to the clusters create command (see Dataproc service properties).

REST API

Kerberos clusters can be created through the ClusterConfig.SecurityConfig.KerberosConfig as part of a clusters.create request. You must set enableKerberos to true.

Console

You can automatically configure Kerberos on a new cluster by selecting "Enable" from the Kerberos and Hadoop Secure Mode section of the Manage security panel on the Dataproc Create a cluster page of the Google Cloud console.

Create a Kerberos cluster With Your Own Root Principal Password

Follow the steps below to set up a Kerberos cluster that uses your root principal password.

Set up your Kerberos root principal password

The Kerberos root principal is the account with administrator permissions to the on-cluster KDC. To securely provide the password for The Kerberos root principal, users can encrypt it with a Key Management Service (KMS) key, and then store it in a Google Cloud Storage bucket that the cluster service account can access. The cluster service account must be granted the cloudkms.cryptoKeyDecrypter IAM role.

  1. Grant the Cloud KMS CryptoKey Encrypter/Decrypter role to the cluster service account:

    gcloud projects add-iam-policy-binding project-id \
        --member serviceAccount:project-number-compute@developer.gserviceaccount.com \
        --role roles/cloudkms.cryptoKeyDecrypter
    

  2. Create a key ring:

    gcloud kms keyrings create my-keyring --location global
    

  3. Create a key in the key ring:

    gcloud kms keys create my-key \
        --location global \
        --keyring my-keyring \
        --purpose encryption
    

  4. Encrypt your Kerberos root principal password:

    echo "my-password" | \
      gcloud kms encrypt \
        --location=global \
        --keyring=my-keyring \
        --key=my-key \
        --plaintext-file=- \
        --ciphertext-file=kerberos-root-principal-password.encrypted
    

    1. Upload the encrypted password to a Cloud Storage bucket in your project.
      1. Example:
        gcloud storage cp kerberos-root-principal-password.encrypted gs://my-bucket
        

Create the cluster

You can use the gcloud command or the Dataproc API to enable Kerberos on clusters with your own root principal password.

gcloud command

To create a Kerberos Dataproc cluster (image version 1.3 and later), use the gcloud dataproc clusters create command.

gcloud dataproc clusters create cluster-name \
    --region=region \
    --image-version=2.0 \
    --kerberos-root-principal-password-uri=gs://my-bucket/kerberos-root-principal-password.encrypted \
    --kerberos-kms-key=projects/project-id/locations/global/keyRings/my-keyring/cryptoKeys/my-key

Use a YAML (or JSON) config file. Instead of passing kerberos-*flags to the gcloud command as shown above, you can place kerberos settings in a YAML (or JSON) config file, then reference the config file to create the kerberos cluster.

  1. Create a config file (see SSL Certificates, Additional Kerberos Settings, and Cross-realm trust for additional config settings that can be included in the file):
    root_principal_password_uri: gs://my-bucket/kerberos-root-principal-password.encrypted
    kms_key_uri: projects/project-id/locations/global/keyRings/mykeyring/cryptoKeys/my-key
    
  2. Use the following gcloud command to create the kerberos cluster:
    gcloud dataproc clusters create cluster-name \
        --region=region \
        --kerberos-config-file=local path to config-file \
        --image-version=2.0
    

Security Considerations. Dataproc discards the decrypted form of the password after adding the root principal to the KDC. For security purposes, after creating the cluster you may decide to delete the password file and the key used to decrypt the secret, and remove the service account from the kmsKeyDecrypter role. Don't do this if you plan on scaling the cluster up, which requires the password file and key, and the service account role.

REST API

Kerberos clusters can be created through the ClusterConfig.SecurityConfig.KerberosConfig as part of a clusters.create request. Set enableKerberos to true and set the rootPrincipalPasswordUri and kmsKeyUri fields.

Console

When creating a cluster with image version 1.3+, select "Enable" from the Kerberos and Hadoop Secure Mode section of the Manage security panel on the Dataproc Create a cluster page of the Google Cloud console, then complete the security options (discussed in the following sections).

OS Login

On-cluster KDC management can be performed with the kadmin command using the root Kerberos user principal or using sudo kadmin.local. Enable OS Login to control who can run superuser commands.

SSL Certificates

As part of enabling Hadoop Secure Mode, Dataproc creates a self-signed certificate to enable cluster SSL encryption. As an alternative, you can provide a certificate for cluster SSL encryption by adding the following settings to the configuration file when you create a kerberos cluster:

  • ssl:keystore_password_uri: Location in Cloud Storage of the KMS-encrypted file containing the password to the keystore file.
  • ssl:key_password_uri: Location in Cloud Storage of the KMS-encrypted file containing the password to the key in the keystore file.
  • ssl:keystore_uri: Location in Cloud Storage of the keystore file containing the wildcard certificate and the private key used by cluster nodes.
  • ssl:truststore_password_uri: Location in Cloud Storage of the KMS-encrypted file that contains the password to the truststore file.
  • ssl:truststore_uri: Location in Cloud Storage of the trust store file containing trusted certificates.

Sample config file:

root_principal_password_uri: gs://my-bucket/kerberos-root-principal-password.encrypted
kms_key_uri: projects/project-id/locations/global/keyRings/mykeyring/cryptoKeys/my-key
ssl:
  key_password_uri: gs://bucket/key_password.encrypted
  keystore_password_uri: gs://bucket/keystore_password.encrypted
  keystore_uri: gs://bucket/keystore.jks
  truststore_password_uri: gs://bucket/truststore_password.encrypted
  truststore_uri: gs://bucket/truststore.jks

Additional Kerberos Settings

To specify a Kerberos realm, create a kerberos cluster with the following property added in the Kerberos configuration file:

  • realm: The name of the on-cluster Kerberos realm.

If this property is not set, the hostnames' domain (in uppercase) will be the realm.

To specify the master key of the KDC database, create a kerberos cluster with the following property added in the Kerberos configuration file:

  • kdc_db_key_uri: Location in Cloud Storage of the KMS-encrypted file containing the KDC database master key.

If this property is not set, Dataproc will generate the master key.

To specify the ticket granting ticket's maximum lifetime (in hours), create a kerberos cluster with the following property added in the Kerberos configuration file:

  • tgt_lifetime_hours: Max life time of the ticket granting ticket in hours.

If this property is not set, Dataproc will set the ticket granting ticket's life time to 10 hours.

Cross-realm trust

The KDC on the cluster initially contains only the root administrator principal and service principals. You can add user principals manually or establish a cross-realm trust with an external KDC or Active Directory server that holds user principals. Cloud VPN or Cloud Interconnect is recommended to connect to an on-premise KDC/Active Directory,.

To create a kerberos cluster that supports cross-realm trust, add the settings listed below to the Kerberos configuration file when you create a kerberos cluster. Encrypt the shared password with KMS and store it in a Cloud Storage bucket that the cluster service account can access.

  • cross_realm_trust:admin_server: hostname/address of the remote admin server.
  • cross_realm_trust:kdc: hostname/address of the remote KDC.
  • cross_realm_trust:realm: name of the remote realm to be trusted.
  • cross_realm_trust:shared_password_uri: Location in Cloud Storage of the KMS-encrypted shared password.

Sample config file:

root_principal_password_uri: gs://my-bucket/kerberos-root-principal-password.encrypted
kms_key_uri: projects/project-id/locations/global/keyRings/mykeyring/cryptoKeys/my-key
cross_realm_trust:
  admin_server: admin.remote.realm
  kdc: kdc.remote.realm
  realm: REMOTE.REALM
  shared_password_uri: gs://bucket/shared_password.encrypted

To enable cross-realm trust to a remote KDC:

  1. Add the following in the /etc/krb5.conf file in the remote KDC:

    [realms]
    DATAPROC.REALM = {
      kdc = MASTER-NAME-OR-ADDRESS
      admin_server = MASTER-NAME-OR-ADDRESS
    }
    

  2. Create the trust user:

    kadmin -q "addprinc krbtgt/DATAPROC.REALM@REMOTE.REALM"
    

  3. When prompted, enter the user's password. The password should match the contents of the encrypted shared password file

To enable cross-realm trust with Active Directory, run the following commands in a PowerShell as Administrator:

  1. Create a KDC definition in Active Directory.

    ksetup /addkdc DATAPROC.REALM DATAPROC-CLUSTER-MASTER-NAME-OR-ADDRESS
    

  2. Create trust in Active Directory.

    netdom trust DATAPROC.REALM /Domain AD.REALM /add /realm /passwordt:TRUST-PASSWORD
    
    The password should match the contents of the encrypted shared password file.

dataproc principal

When you submit a job via the Dataproc jobs API to a Dataproc kerberos cluster, it runs as the dataproc kerberos principal from the cluster's kerberos realm.

Multi-tenancy is supported within a Dataproc kerberos cluster if you submit a job directly, to the cluster, for example via SSH. However, if the job reads or writes to other Google Cloud services, such as Cloud Storage, the job acts as the cluster's service account.

Default and Custom Cluster Properties

Hadoop secure mode is configured with properties in config files. Dataproc sets default values for these properties.

You can override the default properties when you create the cluster with the gcloud dataproc clusters create --properties flag or by calling the clusters.create API and setting SoftwareConfig properties (see cluster properties examples).

High-Availability Mode

In High Availability (HA) mode, a kerberos cluster will have 3 KDCs: one on each master. The KDC running on the "first" master ($CLUSTER_NAME-m-0) will be the Master KDC and also serve as the Admin Server. The Master KDC's database will be synced to the two replica KDCs at 5 minute intervals through a cron job, and the 3 KDCs will serve read traffic.

Kerberos does not natively support real-time replication or automatic failover if the master KDC is down. To perform a manual failover:

  1. On all KDC machines, in /etc/krb5.conf, change admin_server to the new Master's FQDN (Fully Qualified Domain Name). Remove the old Master from the KDC list.
  2. On the new Master KDC, set up a cron job to propagate the database.
  3. On the new Master KDC, restart the admin_server process (krb5-admin-server).
  4. On all KDC machines, restart the KDC process (krb5-kdc).

Network Configuration

To make sure that worker nodes can talk to the KDC and Kerberos Admin Server running on the master(s), verify that the VPC firewall rules allow ingress TCP and UDP traffic on port 88 and ingress TCP traffic on port 749 on the master(s). In High-Availability mode, make sure that VPC firewall rules allow ingress TCP traffic on port 754 on the masters to allow the propagation of changes made to the master KDC. Kerberos requires reverse DNS to be properly set up. Also, for host-based service principal canonicalization, make sure reverse DNS is properly set up for the cluster's network.

For More Information

See the MIT Kerberos Documentation.