Cloud Dataproc Optional Components

When you create a cluster, standard Apache Hadoop ecosystem components are automatically installed on the cluster (see Cloud Dataproc Version List). You can install additional components on the cluster when you create the cluster using the Cloud Dataproc Optional Components feature described on this page. Adding components to a cluster using the Optional Components feature is similar to adding components through the use of initialization actions, but has the following advantages:

  • Faster cluster startup times
  • Tested compatibility with specific Cloud Dataproc versions
  • Use of a cluster parameter instead of an initialization action script
  • Optional components are integrated. For example, when Anaconda and Zeppelin are installed on a cluster using the Optional Components feature, Zeppelin will make use of Anaconda's Python interpreter and libraries.

Optional components can be added to clusters created with Cloud Dataproc version 1.3 and later.

Using optional components

gcloud command

To create a Cloud Dataproc cluster that uses Optional Components, use the gcloud beta dataproc clusters create cluster-name command with the --optional-components flag (using image version 1.3 or later).

gcloud beta dataproc clusters create cluster-name \
  --optional-components=OPTIONAL_COMPONENT(s) \
  --image-version=1.3 \
  ... other flags

REST API

Optional components can be specified through the Cloud Dataproc API using SoftwareConfig.Component as part of a clusters.create request.

Console

Currently, the Cloud Dataproc Optional Components feature is not supported in the Google Cloud Platform Console.

Optional components

The following optional components and Web interfaces are available for installation on Cloud Dataproc clusters.

Anaconda

Anaconda (Anaconda2-5.1.0) is a Python distribution and Package Manager with over 1000 popular data science packages. Anaconda is installed on all cluster nodes in /opt/conda/anaconda, and becomes the default Python interpreter.

gcloud beta dataproc clusters create cluster-name \
  --optional-components=ANACONDA \
  --image-version=1.3

Hive WebHCat

The Hive WebHCat server (2.3.2) provides a REST API for HCatalog. The REST service is available on port 50111 on the cluster's first master node.

gcloud beta dataproc clusters create cluster-name \
  --optional-components=HIVE_WEBHCAT \
  --image-version=1.3

Jupyter Notebook

Jupyter (4.4.0), a Web-based notebook for interactive data analytics. The Jupyter Web UI is available on port 8123 on the cluster's first master node (see Connecting to web interfaces to set up an SSH tunnel from your local machine to the Jupyter notebook running on the cluster). The notebook provides a Python kernel to run Spark code, and a PySpark kernel. By default, notebooks are saved in Cloud Storage in the Cloud Dataproc staging bucket (specified by user or auto-created). The location can be changed at cluster creation time via the dataproc:jupyter.notebook.gcs.dir property property.

gcloud beta dataproc clusters create cluster-name \
  --optional-components=ANACONDA,JUPYTER \
  --image-version=1.3

Zeppelin Notebook

Zeppelin Notebook (0.8.0) is a Web-based notebook for interactive data analytics. The Zeppelin Web UI is available on port 8080 on the cluster's first master node.

By default, notebooks are saved in Cloud Storage in the Cloud Dataproc staging bucket (specified by user or auto-created). The location can be changed at cluster creation time via the zeppelin:zeppelin.notebook.gcs.dir property property.

Zeppelin can be configured by providing zeppelin and zeppelin-env prefixed cluster properties.

gcloud beta dataproc clusters create cluster-name \
  --optional-components=ZEPPELIN \
  --image-version=1.3

Presto

Presto (0.215) is an open source distributed SQL query engine. The Presto server and Web UI are available on port 8060 (or port 7778 if Kerberos is enabled) on the cluster's first master node. The Presto CLI (Command Line Interface) can be invoked with the presto command from a terminal window on the cluster's first master node.

gcloud beta dataproc clusters create cluster-name \
  --optional-components=PRESTO \
  --image-version=1.3

Kerberos

This component enables Kerberos/Hadoop Secure Mode, providing user authentication, isolation, and encryption inside a Cloud Dataproc cluster. The cluster will include the MIT distribution of Kerberos (1.15.1), and configure Apache Hadoop YARN, HDFS, Hive, Spark, and related components to use it for authentication.

This component creates an on-cluster Key Distribution Center (KDC), that contains service principals and a root principal. The root principal is the account with administrator permissions to the on-cluster KDC. It can also contain standard user principals or be connected via cross-realm trust to another KDC that contains the user principals.

You must provide a password for the Kerberos root principal, which is the account with administrator permissions to the on-cluster KDC. To provide the password securely, it should be encrypted with Key Management Service (KMS) key, and stored in a Google Cloud Storage bucket that the cluster service account has access to. The cluster service account must be granted the cloudkms.cryptoKeyDecrypter IAM role.

Creating a kerberized Cluster

To create a kerberized cluster with the gcloud command-line tool:

gcloud beta dataproc clusters create cluster-name \
    --optional-components=KERBEROS \
    --properties="dataproc:kerberos.root.principal.password.uri=Cloud Storage URI of KMS-encrypted password for Kerberos root principal" \
    --properties="dataproc:kerberos.kms.key.uri=The URI of the KMS key used to decrypt the root password" \
    --image-version=1.3

OS Login

On-cluster KDC management can be performed with the kadmin command using the root Kerberos user principal or using sudo kadmin.local. Enable OS Login to control who can run superuser commands.

SSL Certificates

As part of enabling Hadoop Secure Mode, Cloud Dataproc creates self-signed certificate to enable cluster SSL encryption. As an alternative, you can provide a certificate for cluster SSL encryption by adding the following properties to the --properties flag when you create a kerberized cluster (see Cloud Dataproc service properties for more information).

Default Properties

Hadoop secure mode is configured with properties in config files. Cloud Dataproc sets default values for these properties.

Custom Properties

You can override the default properties when you create the cluster with the gcloud beta dataproc clusters create --properties flag or by calling the clusters.create API and setting SoftwareConfig properties (see cluster properties examples).

Additional Properties

To specify the master key of the KDC database, create a kerberized cluster using the --properties flag to set the dataproc:kerberos.kdc.db.key.uri property. If this property is not set, Cloud Dataproc will generate the master key.

To specify the ticket granting ticket's maximum life time (in hours), create a kerberized cluster using the --properties flag to set the dataproc:kerberos.tgt.lifetime.hours property. If this property is not set, Cloud Dataproc will set the ticket granting ticket's life time to 10 hours.

Cross-realm trust

The KDC on the cluster initially contains only the root administrator principal and service principals. You can add user principals manually or establish a cross-realm trust with an external KDC or Active Directory server that holds user principals. To connect to an on-premise KDC/Active Directory, Cloud VPN or Cloud Interconnect is recommended.

To create a kerberized cluster that supports cross-realm trust, add the properties listed below to the --properties flag when you create a kerberized cluster. The shared password should be encrypted with KMS and stored in a Cloud Storage bucket that the cluster service account has access to (see Cloud Dataproc service properties for more information).

To enable cross-realm trust to a remote KDC:

  1. Add the following in the /etc/krb5.conf file in the remote KDC

    [realms]
    DATAPROC.REALM = {
      kdc = MASTER-NAME-OR-ADDRESS
      admin_server = MASTER-NAME-OR-ADDRESS
    }
    

  2. Create the trust user

    kadmin -q "addprinc krbtgt/DATAPROC.REALM@REMOTE.REALM"
    

  3. When prompted, enter the password for the user. The password should match the contents of the encrypted shared password file.

To enable cross-realm trust with a Active Directory, run the following commands in a PowerShell as Administrator:

  1. Create a KDC definition in Active Directory

    ksetup /addkdc DATAPROC.REALM DATAPROC-CLUSTER-MASTER-NAME-OR-ADDRESS
    

  2. Create trust in Active Directory

    netdom trust DATAPROC.REALM /Domain AD.REALM /add /realm /passwordt:TRUST-PASSWORD
    
    The password should match the contents of the encrypted shared password file.

dataproc user

A kerberized Cloud Dataproc cluster is multi-tenant only within the cluster. When it reads or writes to other Google Cloud Platform services, the cluster acts as the cluster service account. When you submit jobs to a kerberized cluster, they run as a single dataproc user.

High-Availability Mode

In High Availability (HA) mode, a kerberized cluster will have 3 KDCs: one on each master. The KDC running on the "first" master ($CLUSTER_NAME-m-0) will be the Master KDC and also serve as the Admin Server. The Master KDC's database will be synced to the two slave KDCs at 5 minute intervals through a cron job, and the 3 KDCs will serve read traffic.

Kerberos does not natively support real-time replication or automatic failover if the master KDC is down. To perform a manual failover:

  1. On all KDC machines, in /etc/krb5.conf, change admin_server to the new Master's FQDN (Fully Qualified Domain Name). Remove the old Master from the KDC list.
  2. On the new Master KDC, set up a cron job to propagate the database.
  3. On the new Master KDC, restart the admin_server process (krb5-admin-server).
  4. On all KDC machines, restart the KDC process (krb5-kdc).

For More Information

See the MIT Kerberos Documentation.

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Dataproc Documentation
Need help? Visit our support page.