This page explains how to configure Kerberos for your Dataproc Metastore service that uses the Thrift endpoint protocol. If your Dataproc Metastore service uses the gRPC endpoint protocol, see Configure Kerberos for gRPC endpoints.
Before you begin
Understand the basics of Kerberos.
In these instructions, you use a Dataproc cluster to create the following Kerberos assets:
- A Keytab file.
- A
krb5.conf
file. - A Kerberos principal.
For more information about how these Kerberos assets work with a Dataproc Metastore service, see About Kerberos.
Create and host your own Kerberos KDC or learn how to use the local KDC of a Dataproc cluster.
Create a Cloud Storage bucket or get access to an existing one. You must store your
krb5.conf
file in this bucket.
Network considerations
Before configuring Kerberos, consider the following network settings:
Set up an IP connection between your VPC network and KDC. You must do this to authenticate your KDC file with Dataproc Metastore service.
Set up any necessary Firewall rules on your KDC. These rules are necessary to permit traffic from Dataproc Metastore. For more information, see Firewall rules for your services.
If you're using VPC Service Controls, then the Secret Manager secret and
krb5.conf
Cloud Storage object must belong to a project that resides in the same service perimeter as the Dataproc Metastore service.Decide what VPC peering network that you want to use. You must configure your Dataproc cluster and Dataproc Metastore service with the same VPC peering network.
Required Roles
To get the permission that you need to create a Dataproc Metastore with Kerberos, ask your administrator to grant you the following IAM roles on your project, based on the principle of least privilege:
-
Grant full control of Dataproc Metastore resources (
roles/metastore.editor
) -
Grant full access to all Dataproc Metastore resources, including IAM policy administration (
roles/metastore.admin
)
For more information about granting roles, see Manage access to projects, folders, and organizations.
This predefined role contains the
metastore.services.create
permission,
which is required to
create a Dataproc Metastore with Kerberos.
You might also be able to get this permission with custom roles or other predefined roles.
For more information about specific Dataproc Metastore roles and permissions, see Manage access with IAM.For more information, see Dataproc Metastore IAM and access control.
Enable Kerberos for Dataproc Metastore
The following instructions show you how to configure Kerberos for a Dataproc Metastore service that is attached to a Dataproc cluster.
Create a Dataproc cluster and enable Kerberos
gcloud
To set up a Dataproc cluster with Kerberos,
run the following gcloud dataproc clusters create
command:
gcloud dataproc clusters create CLUSTER_NAME \ --image-version=2.0 \ --enable-kerberos \ --scopes 'https://www.googleapis.com/auth/cloud-platform'
Replace the following:
CLUSTER_NAME
: the name of your Dataproc cluster.
Configure the Dataproc cluster for kerberos
The following instructions show you how to use SSH to connect to a primary Dataproc cluster that's associated with your Dataproc Metastore service.
After, you modify the hive-site.xml
file and configure Kerberos for your service.
- In the Google Cloud console go to the VM Instances page.
In the list of virtual machine instances, click SSH in the row of the Dataproc primary node (
your-cluster-name-m
).A browser window opens in your home directory on the node.
Open the
/etc/hive/conf/hive-site.xml
file.sudo vim /etc/hive/conf/hive-site.xml
You see an output similar to the following:
<property> <name>hive.metastore.kerberos.principal</name> <value>PRINCIPAL_NAME</value> </property> <property> <name>hive.metastore.kerberos.keytab.file</name> <value>METASTORE_PRINCPAL_KEYTAB</value> </property>
Replace:
PRINCIPAL_NAME
: A principal name, in the following formatprimary/instance@REALM
. For example,hive/test@C.MY-PROJECT.INTERNAL
.METASTORE_PRINCIPAL_KEYTAB
: the location of your Hive Metastore keytab file. Use the following value/etc/security/keytab/metastore.service.keytab
.
Create a keytab file
The following instructions show you how to create a keytab file.
A keytab file contains a pair of Kerberos principals and a pair of encrypted keys. This is used to authenticate a service principal with a Kerberos KDC.
To create a keytab file
In the Dataproc SSH session, create the keytab and principal.
sudo kadmin.local -q "addprinc -randkey PRINCIPAL_NAME" sudo kadmin.local -q "ktadd -k /etc/security/keytab/metastore.service.keytab PRINCIPAL_NAME"
In the Dataproc SSH session, create and upload the keytab file to Secret Manager.
gcloud secrets create SECRET_NAME --replication-policy automatic sudo gcloud secrets versions add SECRET_NAME --data-file /etc/security/keytab/metastore.service.keytab
Replace the following:
SECRET_NAME
: the name of your secret.
Update the krb5.conf file
Next, you must update the krb5.conf
file to associate it with your
Dataproc cluster.
Determine the primary internal IP address of the Dataproc cluster's primary instance.
gcloud compute instances list
For example, running this command produces a similar output:
~$ gcloud compute instances list --project kerberos-project NAME ZONE MACHINE_TYPE PREEMPTIBLE INTERNAL_IP EXTERNAL_IP STATUS cluster-236-m us-central1-c n2-standard-4 192.0.2.2 *.*.*.* RUNNING ...
In this case, the internal IP address of the cluster is
192.0.2.2
.Open the
krb5.conf
file.sudo vim /etc/krb5.conf
In the file, replace the existing
KDC
parameter andadmin_server
parameter with the internal IP address of your Dataproc cluster.For example, using the IP address value from the previous steps looks similar to this output.
[realms] US-CENTRAL1-A.C.MY-PROJECT.INTERNAL = { kdc = 192.0.2.2 admin_server = 192.0.2.2 }
Upload the
/etc/krb5.conf
file from your Dataproc primary VM to your Cloud Storage bucket.gcloud storage cp /etc/krb5.conf gs://PATH_TO_KRB5
Replace:
PATH_TO_KRB5
: The Cloud Storage URI that contains yourkrb5.conf
file.
After the upload finishes, copy the path of the upload. You need to use it when you create your Dataproc Metastore service.
Grant IAM roles and permissions
Provide the Dataproc Metastore service account with permission to access the keytab file. This account is Google-managed and listed on the IAM permissions UI page by selecting Include Google-provided role grants.
gcloud projects add-iam-policy-binding PROJECT_ID \ --member serviceAccount:service-PROJECT_NUMBER@gcp-sa-metastore.iam.gserviceaccount.com \ --role roles/secretmanager.secretAccessor
Provide the Dataproc Metastore service account with permission to access the
krb5.conf
file.gcloud projects add-iam-policy-binding PROJECT_ID \ --member serviceAccount:service-PROJECT_NUMBER@gcp-sa-metastore.iam.gserviceaccount.com \ --role roles/storage.objectViewer
Create a Dataproc Metastore service with Kerberos
Create a new Dataproc Metastore service that's configured with your Kerberos files.
Make sure you create your service in the VPC network as your Dataproc cluster.
gcloud metastore services create SERVICE \ --location=LOCATION \ --instance-size=medium \ --network=VPC_NETWORK \ --kerberos-principal=KERBEROS_PRINCIPAL \ --krb5-config=KRB5_CONFIG \ --keytab=CLOUD_SECRET
Replace the following:
SERVICE
: the name of your Dataproc Metastore service.LOCATION
: the location of your Dataproc Metastore service.VPC_NETWORK
:the name of your VPC network. Use the same network that's configured in your Dataproc cluster.KERBEROS_PRINCIPAL
: the name of the kerberos principal that you created earlier.KRB5_CONFIG
: the location of thekrb5.config
file. Use the Cloud Storage object URI that points to your file.CLOUD_SECRET
: the relative resource name of a Secret Manager secret version.
After you create your cluster, Dataproc Metastore attempts to connect
with your kerberos credentials using the provided principal, keytab, and krb5.conf
file. If the connection fails, then the Dataproc Metastore creation also fails.
Link the Dataproc cluster to Dataproc Metastore
After the Dataproc Metastore service has been created, find your Thrift endpoint URI and warehouse directory.
SSH into the primary instance of your Dataproc cluster.
In the SSH session, open the
/etc/hive/conf/hive-site.xml
file.sudo vim /etc/hive/conf/hive-site.xml
Modify
/etc/hive/conf/hive-site.xml
on the Dataproc cluster.<property> <name>hive.metastore.uris</name> <!-- Update this value. --> <value>ENDPOINT_URI</value> </property> <!-- Add this property entry. --> <property> <name>hive.metastore.warehouse.dir</name> <value>WAREHOUSE_DIR</value> </property>
Restart HiveServer2:
sudo systemctl restart hive-server2.service
Configure Dataproc before submitting jobs
To run your Dataproc jobs,
you must add the hive
user to the allowed.system.users
property in the
Hadoop container-executor.cfg
file. This lets users run queries to access data,
such as select * from
.
In the ssh session, open the Hadoop
container-executor.cfg
file.sudo vim /etc/hadoop/conf/container-executor.cfg
Add the following line on every Dataproc node.
allowed.system.users=hive
Get a kerberos ticket
Get the kerberos ticket, before connecting to the Dataproc Metastore instance.
sudo klist -kte /etc/security/keytab/metastore.service.keytab sudo kinit -kt /etc/security/keytab/metastore.service.keytab PRINCIPAL_NAME sudo klist # gets the ticket information. sudo hive
Replace the following:
PRINCIPAL_NAME
: the name of your principal.