When you create a Dataproc cluster, you can enable Hadoop Secure Mode via Kerberos to provide multi-tenancy via user authentication, isolation, and encryption inside a Dataproc cluster.
User Authentication and Other Google Cloud Platform Services. Per-user authentication via Kerberos only applies within the cluster. Interactions with other Google Cloud services, such as Cloud Storage, continue to be authenticated as the service account for the cluster.
Enabling Hadoop Secure Mode via Kerberos
Enabling Kerberos and Hadoop Secure Mode for a cluster will include the MIT distribution of Kerberos and configure Apache Hadoop YARN, HDFS, Hive, Spark, and related components to use it for authentication.
Enabling Kerberos creates an on-cluster Key Distribution Center (KDC), that contains service principals and a root principal. The root principal is the account with administrator permissions to the on-cluster KDC. It can also contain standard user principals or be connected via cross-realm trust to another KDC that contains the user principals.
Create a Kerberos cluster
You can use the Google Cloud CLI, the Dataproc API, or the Google Cloud console to enable Kerberos on clusters that use Dataproc image version 1.3 and later.
gcloud command
To automatically configure a new Kerberos Dataproc cluster (image version 1.3 and later), use the gcloud dataproc clusters create command.
gcloud dataproc clusters create cluster-name \ --image-version=2.0 \ --enable-kerberos
Cluster property: Instead of using the --enable-kerberos
flag as shown above, you can automatically configure Kerberos by passing the
--properties "dataproc:kerberos.beta.automatic-config.enable=true"
flag to the clusters create command
(see
Dataproc service properties).
REST API
Kerberos clusters can be created through the
ClusterConfig.SecurityConfig.KerberosConfig
as part of a
clusters.create
request. You must set enableKerberos
to true
.
Console
You can automatically configure Kerberos on a new cluster by selecting "Enable" from the Kerberos and Hadoop Secure Mode section of the Manage security panel on the Dataproc Create a cluster page of the Google Cloud console.
Create a Kerberos cluster With Your Own Root Principal Password
Follow the steps below to set up a Kerberos cluster that uses your root principal password.
Set up your Kerberos root principal password
The Kerberos root principal is the account
with administrator permissions to the on-cluster KDC. To securely provide the
password for The Kerberos root principal, users can encrypt it with a
Key Management Service (KMS) key, and then
store it in a Google Cloud Storage bucket that the
cluster service account
can access. The cluster service account must be granted the
cloudkms.cryptoKeyDecrypter
IAM role.
Grant the Cloud KMS CryptoKey Encrypter/Decrypter role to the cluster service account:
gcloud projects add-iam-policy-binding project-id \ --member serviceAccount:project-number-compute@developer.gserviceaccount.com \ --role roles/cloudkms.cryptoKeyDecrypter
Create a key ring:
gcloud kms keyrings create my-keyring --location global
Create a key in the key ring:
gcloud kms keys create my-key \ --location global \ --keyring my-keyring \ --purpose encryption
Encrypt your Kerberos root principal password:
echo "my-password" | \ gcloud kms encrypt \ --location=global \ --keyring=my-keyring \ --key=my-key \ --plaintext-file=- \ --ciphertext-file=kerberos-root-principal-password.encrypted
- Upload the encrypted password to a
Cloud Storage bucket in your project.
- Example:
gcloud storage cp kerberos-root-principal-password.encrypted gs://my-bucket
- Example:
- Upload the encrypted password to a
Cloud Storage bucket in your project.
Create the cluster
You can use the gcloud
command or the Dataproc API to
enable Kerberos on clusters with your own root principal password.
gcloud command
To create a Kerberos Dataproc cluster (image version 1.3 and later), use the gcloud dataproc clusters create command.
gcloud dataproc clusters create cluster-name \ --region=region \ --image-version=2.0 \ --kerberos-root-principal-password-uri=gs://my-bucket/kerberos-root-principal-password.encrypted \ --kerberos-kms-key=projects/project-id/locations/global/keyRings/my-keyring/cryptoKeys/my-key
Use a YAML (or JSON) config file. Instead of passing kerberos-*
flags to the gcloud
command as shown above, you can place kerberos settings in a YAML (or JSON) config file, then reference the config
file to create the kerberos cluster.
- Create a config file (see
SSL Certificates,
Additional Kerberos Settings, and
Cross-realm trust
for additional config settings that can be included in the file):
root_principal_password_uri: gs://my-bucket/kerberos-root-principal-password.encrypted kms_key_uri: projects/project-id/locations/global/keyRings/mykeyring/cryptoKeys/my-key
- Use the following
gcloud
command to create the kerberos cluster:gcloud dataproc clusters create cluster-name \ --region=region \ --kerberos-config-file=local path to config-file \ --image-version=2.0
Security Considerations. Dataproc discards the decrypted form of the
password after adding the root principal to the KDC. For security
purposes, after creating the cluster you may decide to delete the password file and the key used to decrypt the secret, and remove the service account from the kmsKeyDecrypter
role. Don't do this if you plan on scaling the cluster up, which requires the password file and key, and the service account role.
REST API
Kerberos clusters can be created through the
ClusterConfig.SecurityConfig.KerberosConfig
as part of a
clusters.create
request. Set enableKerberos
to true and set the
rootPrincipalPasswordUri
and kmsKeyUri
fields.
Console
When creating a cluster with image version 1.3+, select "Enable" from the Kerberos and Hadoop Secure Mode section of the Manage security panel on the Dataproc Create a cluster page of the Google Cloud console, then complete the security options (discussed in the following sections).
OS Login
On-cluster KDC management can be performed with the kadmin
command
using the root Kerberos user principal or using sudo kadmin.local
.
Enable OS Login
to control who can run superuser commands.
SSL Certificates
As part of enabling Hadoop Secure Mode, Dataproc creates a self-signed certificate to enable cluster SSL encryption. As an alternative, you can provide a certificate for cluster SSL encryption by adding the following settings to the configuration file when you create a kerberos cluster:
ssl:keystore_password_uri
: Location in Cloud Storage of the KMS-encrypted file containing the password to the keystore file.ssl:key_password_uri
: Location in Cloud Storage of the KMS-encrypted file containing the password to the key in the keystore file.ssl:keystore_uri
: Location in Cloud Storage of the keystore file containing the wildcard certificate and the private key used by cluster nodes.ssl:truststore_password_uri
: Location in Cloud Storage of the KMS-encrypted file that contains the password to the truststore file.ssl:truststore_uri
: Location in Cloud Storage of the trust store file containing trusted certificates.
Sample config file:
root_principal_password_uri: gs://my-bucket/kerberos-root-principal-password.encrypted kms_key_uri: projects/project-id/locations/global/keyRings/mykeyring/cryptoKeys/my-key ssl: key_password_uri: gs://bucket/key_password.encrypted keystore_password_uri: gs://bucket/keystore_password.encrypted keystore_uri: gs://bucket/keystore.jks truststore_password_uri: gs://bucket/truststore_password.encrypted truststore_uri: gs://bucket/truststore.jks
Additional Kerberos Settings
To specify a Kerberos realm, create a kerberos cluster with the following property added in the Kerberos configuration file:
realm
: The name of the on-cluster Kerberos realm.
If this property is not set, the hostnames' domain (in uppercase) will be the realm.
To specify the master key of the KDC database, create a kerberos cluster with the following property added in the Kerberos configuration file:
kdc_db_key_uri
: Location in Cloud Storage of the KMS-encrypted file containing the KDC database master key.
If this property is not set, Dataproc will generate the master key.
To specify the ticket granting ticket's maximum lifetime (in hours), create a kerberos cluster with the following property added in the Kerberos configuration file:
tgt_lifetime_hours
: Max life time of the ticket granting ticket in hours.
If this property is not set, Dataproc will set the ticket granting ticket's life time to 10 hours.
Cross-realm trust
The KDC on the cluster initially contains only the root administrator principal and service principals. You can add user principals manually or establish a cross-realm trust with an external KDC or Active Directory server that holds user principals. Cloud VPN or Cloud Interconnect is recommended to connect to an on-premise KDC/Active Directory,.
To create a kerberos cluster that supports cross-realm trust, add the settings listed below to the Kerberos configuration file when you create a kerberos cluster. Encrypt the shared password with KMS and store it in a Cloud Storage bucket that the cluster service account can access.
cross_realm_trust:admin_server
: hostname/address of the remote admin server.cross_realm_trust:kdc
: hostname/address of the remote KDC.cross_realm_trust:realm
: name of the remote realm to be trusted.cross_realm_trust:shared_password_uri
: Location in Cloud Storage of the KMS-encrypted shared password.
Sample config file:
root_principal_password_uri: gs://my-bucket/kerberos-root-principal-password.encrypted kms_key_uri: projects/project-id/locations/global/keyRings/mykeyring/cryptoKeys/my-key cross_realm_trust: admin_server: admin.remote.realm kdc: kdc.remote.realm realm: REMOTE.REALM shared_password_uri: gs://bucket/shared_password.encrypted
To enable cross-realm trust to a remote KDC:
Add the following in the
/etc/krb5.conf
file in the remote KDC:[realms] DATAPROC.REALM = { kdc = MASTER-NAME-OR-ADDRESS admin_server = MASTER-NAME-OR-ADDRESS }
Create the trust user:
kadmin -q "addprinc krbtgt/DATAPROC.REALM@REMOTE.REALM"
When prompted, enter the user's password. The password should match the contents of the encrypted shared password file
To enable cross-realm trust with Active Directory, run the following commands in a PowerShell as Administrator:
Create a KDC definition in Active Directory.
ksetup /addkdc DATAPROC.REALM DATAPROC-CLUSTER-MASTER-NAME-OR-ADDRESS
Create trust in Active Directory.
netdom trust DATAPROC.REALM /Domain AD.REALM /add /realm /passwordt:TRUST-PASSWORD
The password should match the contents of the encrypted shared password file.
dataproc
principal
When you submit a job via the Dataproc
jobs API
to a Dataproc kerberos cluster, it runs as the dataproc
kerberos principal from the cluster's kerberos realm.
Multi-tenancy is supported within a Dataproc kerberos cluster if you submit a job directly, to the cluster, for example via SSH. However, if the job reads or writes to other Google Cloud services, such as Cloud Storage, the job acts as the cluster's service account.
Default and Custom Cluster Properties
Hadoop secure mode is configured with properties in config files. Dataproc sets default values for these properties.
You can override the default properties when you create the cluster with the
gcloud dataproc clusters create
--properties
flag or by calling the clusters.create API and setting
SoftwareConfig properties (see
cluster properties examples).
High-Availability Mode
In High Availability (HA) mode,
a kerberos cluster will have 3 KDCs: one on each master. The KDC running on
the "first" master ($CLUSTER_NAME-m-0
) will be the Master KDC and also serve as the Admin Server.
The Master KDC's database will be synced to the two replica KDCs at 5 minute intervals
through a cron job, and the 3 KDCs will serve read traffic.
Kerberos does not natively support real-time replication or automatic failover if the master KDC is down. To perform a manual failover:
- On all KDC machines, in
/etc/krb5.conf
, changeadmin_server
to the new Master's FQDN (Fully Qualified Domain Name). Remove the old Master from the KDC list. - On the new Master KDC, set up a cron job to propagate the database.
- On the new Master KDC, restart the admin_server process (
krb5-admin-server
). - On all KDC machines, restart the KDC process (
krb5-kdc
).
Network Configuration
To make sure that worker nodes can talk to the KDC and Kerberos Admin Server running on the master(s), verify that the VPC firewall rules allow ingress TCP and UDP traffic on port 88 and ingress TCP traffic on port 749 on the master(s). In High-Availability mode, make sure that VPC firewall rules allow ingress TCP traffic on port 754 on the masters to allow the propagation of changes made to the master KDC. Kerberos requires reverse DNS to be properly set up. Also, for host-based service principal canonicalization, make sure reverse DNS is properly set up for the cluster's network.
For More Information
See the MIT Kerberos Documentation.