Configure Kerberos for a service

Kerberos is a network authentication protocol that is designed to provide strong authentication for client/server applications by using secret-key cryptography. It's commonly used among the Hadoop stack for authentication throughout the software ecosystem.

Dataproc Metastore supports Kerberos through a customer-hosted Key Distribution Center (KDC). The API requirements to support Kerberos are a keytab file, a principal, and a krb5.conf file.

This page explains how to enable and configure Kerberos for your Dataproc Metastore Hive metastore service.

Before you begin

  • If you'd like to enable Kerberos for your Hive metastore instance, you must have the following set up:

    • Your own hosted Kerberos KDC.

      The KDC is an application that issues Kerberos tickets. It's responsible for authenticating users when Kerberos is used.

    • IP connectivity between the VPC network and your KDC in order to perform the initial authentication.

    • Firewall rules on your KDC to permit traffic from Dataproc Metastore. Also see Firewall rules for your services.

    • A Secret Manager secret that contains the contents of a keytab file.

      A keytab file contains pairs of Kerberos principals and encrypted keys, which can be used to authenticate a service principal with a Kerberos KDC. You must generate a keytab file with Dataproc's on-cluster KDC and use it to configure your Dataproc Metastore service.

      This keytab file must contain the entry for the service principal created for a Hive metastore. The Secret Manager secret provided must be pinned to a specific secret version. The latest version is not supported.

    • A principal that is in both the KDC and the keytab file.

      A valid Kerberos keytab file and principal are required to start the Hive metastore. The principal exists in both the KDC and the keytab. The principal must contain three parts: primary/instance@REALM. The \_HOST instance is not supported.

    • A krb5.conf file in a Cloud Storage bucket.

      A valid krb5.conf file contains Kerberos configuration information, such as user input like KDC IP, port, and realm name. You must specify the KDC IP, and not the KDC FQDN.

      Dataproc Metastore takes the entire krb5.conf as a Cloud Storage object. You must provide the Cloud Storage URI that specifies the path to your krb5.conf file during service creation. A typical URI is of the form gs://{bucket_name}/path/to/krb5.conf.

    • For best results, use Cloud Storage buckets that are located in the same region as your Dataproc Metastore service. Although Dataproc Metastore doesn't enforce region restrictions, co-located resources and global resources perform better. For example, a global bucket is fine for any service region, but an EU multi-region bucket doesn't work well with a us-central1 service. Cross-region access results in higher latency, lack of regional failure isolation, and charges for cross-region network bandwidth.

Access control

  • To create a service, you must request an IAM role containing the metastore.services.create IAM permission. The Dataproc Metastore specific roles roles/metastore.admin and roles/metastore.editor include create permission.

  • You can give create permission to users or groups by using the roles/owner and roles/editor legacy roles.

  • If you're using VPC Service Controls, then the Secret Manager secret and krb5.conf Cloud Storage object must belong to a project that resides in the same service perimeter as the Dataproc Metastore service.

For more information, see Dataproc Metastore IAM and access control.

Enable Kerberos for a service

The following instructions demonstrate how to enable Kerberos for a Dataproc Metastore service that is integrated with Dataproc.

  1. Set up a Dataproc cluster with Kerberos enabled in the same VPC network that is going to be peered with the Dataproc Metastore service.

    1. Enable Project access when creating the Dataproc cluster in order to allow API access to all Google Cloud services in the same project. This can be done by passing --scopes 'https://www.googleapis.com/auth/cloud-platform' in the Dataproc cluster creation gcloud command.
  2. SSH into the Dataproc cluster's master instance. You can do this from either a browser or from the command line. Run the following commands on the primary instance:

    1. Modify /etc/hive/conf/hive-site.xml on the Dataproc cluster. Select a principal name (it should be of the format primary/instance@REALM). Look for the pre-existing hive.metastore.keberos.principal in /etc/hive/conf/hive-site.xml to find the REALM and replace the primary and instance segments. An example principal name is hive/test@C.MY-PROJECT.INTERNAL.

      Make note of the principal name to use during Dataproc Metastore service creation:

      <property>
        <name>hive.metastore.kerberos.principal</name>
        <!-- Update this value. -->
        <value>PRINCIPAL_NAME</value>
      </property>
      <property>
        <name>hive.metastore.kerberos.keytab.file</name>
        <!-- Update to this value. -->
        <value>/etc/security/keytab/metastore.service.keytab</value>
      </property>
      
    2. Create the keytab/principal combination on the Dataproc cluster's primary VM:

      sudo kadmin.local -q "addprinc -randkey PRINCIPAL_NAME"
      sudo kadmin.local -q "ktadd -k /etc/security/keytab/metastore.service.keytab PRINCIPAL_NAME"
      
    3. Upload the keytab to the Secret Manager from the Dataproc cluster's primary VM. This requires the identity running the Dataproc VM to have secret manager admin role for creating secrets. Make note of the secret version created to use during Dataproc Metastore service creation.

        gcloud secrets create SECRET_NAME --replication-policy automatic
        sudo gcloud secrets versions add SECRET_NAME --data-file /etc/security/keytab/metastore.service.keytab
        

    4. Determine the primary internal IP address of the Dataproc cluster's primary instance (from Compute Engine UI or by gcloud compute instances list) and populate it as the cluster realm's kdc and admin_server in /etc/krb5.conf.

      For example (say the internal IP address of the primary is 192.0.2.2):

      [realms]
        US-CENTRAL1-A.C.MY-PROJECT.INTERNAL = {
          kdc = 192.0.2.2
          admin_server = 192.0.2.2
        }
      
    5. Upload the /etc/krb5.conf file from the Dataproc primary VM to Cloud Storage. Make note of the Cloud Storage path to use during Dataproc Metastore service creation.

      gsutil cp /etc/krb5.conf gs://bucket-name/path/to/krb5.conf
      
  3. Provide the Dataproc Metastore service account (this account is Google-managed and listed on IAM permissions UI page by selecting Include Google-provided role grants) with permission to access the keytab:

       gcloud projects add-iam-policy-binding PROJECT_ID \
           --member serviceAccount:service-PROJECT_NUMBER@gcp-sa-metastore.iam.gserviceaccount.com \
           --role roles/secretmanager.secretAccessor
       

  4. Provide the Dataproc Metastore service account with permission to access the krb5.conf file.

       gcloud projects add-iam-policy-binding PROJECT_ID \
           --member serviceAccount:service-PROJECT_NUMBER@gcp-sa-metastore.iam.gserviceaccount.com \
           --role roles/storage.objectViewer
       

  5. Make sure you have configured ingress firewall rules for the KDC. These firewall rules must be configured on the VPC network used to create the Dataproc cluster to allow ingress of TCP/UDP traffic.

  6. Create a new Dataproc Metastore service or update an existing one with the above principal name, Secret Manager secret version, and krb5.conf Cloud Storage object URI. Make sure to specify the same VPC network that you used during the Dataproc cluster creation.

    The Dataproc Metastore service creation or update operation will test that a successful login occurs using the provided principal, keytab, and krb5.conf file. If the test fails, then the operation will fail.

  7. Once the Dataproc Metastore service has been created, make note of its Thrift endpoint URI and warehouse directory. The Thrift endpoint URI looks like thrift://10.1.2.3:9083, and the warehouse directory looks like gs://gcs-bucket-service-name-deadbeef/hive-warehouse. SSH into the Dataproc cluster's master instance again and perform the following:

    1. Modify /etc/hive/conf/hive-site.xml on the Dataproc cluster:

      <property>
        <name>hive.metastore.uris</name>
        <!-- Update this value. -->
        <value>ENDPOINT_URI</value>
      </property>
      <!-- Add this property entry. -->
      <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>WAREHOUSE_DIR</value>
      </property>
      
    2. Restart HiveServer2:

      sudo systemctl restart hive-server2.service
      
    3. Modify /etc/hadoop/conf/container-executor.cfg to add the following line on every Dataproc node:

       allowed.system.users=hive
      
    4. Get the kerberos ticket, before connecting to the Dataproc Metastore instance.

      sudo klist -kte /etc/security/keytab/metastore.service.keytab
      sudo kinit -kt /etc/security/keytab/metastore.service.keytab PRINCIPAL_NAME
      sudo klist # gets the ticket information.
      

What's next