Dataproc components

Dataproc clusters feature the following types of components:

Installed components: Components that are installed in the image and activated when the cluster is created.
Optional components: Components that you select to install and use on your cluster when you create the cluster. Dataproc installs and activates optional components depending on the cluster image version as follows:
- 2.2 and earlier image versions: Optional components are automatically installed. Selected optional components are activated and non-selected optional components are uninstalled at cluster creation.
- 2.3 and later image versions: All optional components are installed during cluster creation except the Jupyter, Iceberg, and Delta Lake optional components, which are pre-installed in 2.3 and later image versions. Pre-installed optional components are removed from a 2.3 or later image version cluster if they are not enabled when the cluster is created. For more information, see Dataproc 2.3.x release versions.
  To avoid increased startup time for 2.3 and later image version clusters, create a custom image with optional components pre-installed. You can do this by running generate_custom_image.py with the --optional-components flag.
Initialization action components: Components installed on a cluster as part of an initialization action that you specify when you create a cluster.

Optional components are installed on a cluster before initialization actions are run on the cluster.

The Dataproc image version pages list the components and component types available in the latest Dataproc image releases.

Optional components have the following advantages over initialization actions used to install components:

Optional components are tested as compatible with specific Dataproc versions.
Optional components are enabled with a cluster creation parameter; initialization actions require a script.

Available optional components

Optional component	Component name in Google Cloud CLI commands and API requests	Image Version	Release Stage
Delta Lake	DELTA	2.2.46 and later	GA
Docker	DOCKER	1.5 and later	GA
Flink	FLINK	1.5 and later	GA
HBase	HBASE	1.5 and later (not available in 2.1 and later)	Deprecated
Hive WebHCat	HIVE_WEBHCAT	1.3 and later	GA
Hudi	HUDI	1.5 and later	GA
Iceberg	ICEBERG	2.2 and later	GA
Jupyter Notebook	JUPYTER	1.3 and later	GA
Pig	PIG	1.5^* and later	GA
Presto	PRESTO	1.3 and later (not available in 2.1 and later)	GA
Ranger	RANGER	1.3 and later	GA
Solr	SOLR	1.3 and later	GA
Trino	TRINO	2.1 and later	GA
Zeppelin Notebook	ZEPPELIN	1.3 and later	GA
Zookeeper	ZOOKEEPER	1.0 and later	GA

Notes:

Apache Pig is an optional component in image versions 2.3 and later. It was pre-installed in 2.2 and earlier image versions.

Add optional components

Console

In the Google Cloud console, go to the Dataproc Create a cluster page.
Go to Create a cluster

The Set up cluster panel is selected.
In the Components section, under Optional components, select one or more components to install on your cluster.

Google Cloud CLI

To create a Dataproc cluster and install one or more optional components on the cluster, use the gcloud beta dataproc clusters create cluster-name command with the --optional-components flag.

gcloud dataproc clusters create CLUSTER_NAME \
  --optional-components=COMPONENT-NAME(s) \
  ... other flags

REST API

Optional components can be specified through the Dataproc API using SoftwareConfig.Component as part of a clusters.create request.