Cloud Dataproc Optional Components

When you create a cluster, standard Apache Hadoop ecosystem components are automatically installed on the cluster (see Cloud Dataproc Version List). You can install additional components on the cluster when you create the cluster using the Cloud Dataproc Optional Components feature described on this page. Adding components to a cluster using the Optional Components feature is similar to adding components through the use of initialization actions, but has the following advantages:

  • Faster cluster startup times
  • Tested compatibility with specific Cloud Dataproc versions
  • Use of a cluster parameter instead of an initialization action script
  • Optional components are integrated. For example, when Anaconda and Zeppelin are installed on a cluster using the Optional Components feature, Zeppelin will make use of Anaconda's Python interpreter and libraries.

Optional components can be added to clusters created with Cloud Dataproc version 1.3 and later.

Using Optional Components

gcloud command

To create a Cloud Dataproc cluster that uses Optional Components, use the gcloud beta dataproc clusters create cluster-name command with the --optional-components flag (using image version 1.3 or later).

gcloud beta dataproc clusters create cluster-name \
  --optional-components=OPTIONAL_COMPONENT(s) \
  --image-version=1.3-deb9 \
  ... other flags

REST API

Optional components can be specified through the Cloud Dataproc API using SoftwareConfig.Component as part of a clusters.create request.

Console

Currently, the Cloud Dataproc Optional Components feature is not supported in the Google Cloud Platform Console.

Optional components

The following optional components and Web interfaces are available for installation on Cloud Dataproc clusters.

Anaconda

Anaconda (Anaconda2-5.1.0) is a Python distribution and Package Manager with over 1000 popular data science packages. Anaconda is installed on all cluster nodes in /opt/conda/anaconda, and becomes the default Python interpreter.

gcloud beta dataproc clusters create cluster-name \
  --optional-components=ANACONDA \
  --image-version=1.3-deb9

Hive WebHCat

The Hive WebHCat server (2.3.2) provides a REST API for HCatalog. The REST service is available on port 50111 on the cluster's first master node.

gcloud beta dataproc clusters create cluster-name \
  --optional-components=HIVE_WEBHCAT \
  --image-version=1.3-deb9

Jupyter Notebook

Jupyter (4.4.0), a Web-based notebook for interactive data analytics. The Jupyter Web UI is available on port 8123 on the cluster's first master node. By default, notebooks are saved in Cloud Storage] in the Cloud Dataproc staging bucket (specified by user or auto-created). Python2 and PySpark kernels are available for Jupyter notebooks.

gcloud beta dataproc clusters create cluster-name \
  --optional-components=ANACONDA,JUPYTER \
  --image-version=1.3-deb9

Zeppelin Notebook

Zeppelin Notebook (0.8.0) is a Web-based notebook for interactive data analytics. The Zeppelin Web UI is available on port 8080 on the cluster's first master node.

gcloud beta dataproc clusters create cluster-name \
  --optional-components=ZEPPELIN \
  --image-version=1.3-deb9

Presto

Presto (0.206) is an open source distributed SQL query engine. The Presto server and Web UI are available on port 8060 on the cluster's first master node. The Presto CLI (Command Line Interface) can be invoked with the presto command from a terminal window on the cluster's first master node.

gcloud beta dataproc clusters create cluster-name \
  --optional-components=PRESTO \
  --image-version=1.3-deb9
Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Dataproc Documentation