Dataproc services

This page lists services that Dataproc image versions run on Dataproc cluster nodes.

All nodes

The following services run on all nodes in a cluster.

Node type	Service	Image versions	Description
All nodes	google-dataproc-agent	all	Receives jobs from Dataproc and launches job drivers
All nodes	google-fluentd	all	Collects and pushes logs to Logging

Standard clusters

The following services run on standard clusters.

Node type	Service	Image versions	Description
All nodes	hadoop-hdfs-namenode	all	Manages the HDFS filesystem
	hadoop-hdfs-secondarynamenode	all	Checkpoints the NameNode
	hadoop-mapreduce-historyserver	all	Serves mapreduce application history information
	hadoop-yarn-resourcemanager	all	Schedules and manages YARN applications
	hadoop-yarn-timelineserver	1.3+	Serves YARN application history information
	hive-metastore	all	Manages Hive table metadata. As a default, uses the local `mariadb` (image versions < 1.5) or `mysql` (image versions 1.5+) database on the master node as the Hive table metadata store. Using the default database is not recommended because these databases are tied to the cluster's lifecycle. Instead, use either of the following as the Hive metastore database (in recommendation order): Dataproc Metastore Cloud SQL instance
	hive-server2	all	Serves queries received from clients (primarily beeline shell queries) against Hive
	mariadb	< 1.5	A relational database used as the default underlying database for Hive metastore in Dataproc < 1.5 images
	mysql	1.5+	A relational database used as the default underlying database for Hive metastore in Dataproc 1.5+ images
	nfs-kernel-server	< 1.3	NFS is the Network File System.
	spark-history-server	all	Serves Spark application history information
All Workers	hadoop-yarn-nodemanager	all	Launches and manages YARN containers
Primary Workers only	hadoop-hdfs-datanode	all	Stores HDFS blocks

HA Clusters

In Dataproc High Availability (HA) clusters, different services run on different master nodes, as show below. HA cluster worker node services are the same as those listed for standard clusters.

Node type	Service	Image versions	Description
All masters	hadoop-hdfs-journalnode	all	A quorum of journal nodes maintains an edit log of HDFS namespace modifications. If a failover occurs, the Standby NameNode reads the edit log and takes control from the Active NameNode.
	hadoop-yarn-resourcemanager	all	Schedules and manages YARN applications
	hive-metastore	all	Manages Hive table metadata. As a default, uses the local `mariadb` (image versions < 1.5) or `mysql` (image versions 1.5+) database on the master node as the Hive table metadata store. Using the default database is not recommended because these databases are tied to the cluster's lifecycle. Instead, use either of the following as the Hive metastore database (in recommendation order): Dataproc Metastore Cloud SQL instance
	hive-server2	all	Serves queries received from clients (primarily beeline shell queries) against Hive
	zookeeper-server	all	A ZooKeeper quorum is used for distributed coordination. In High Availability (HA) clusters, it is used for HDFS NameNodes and YARN resource managers leader election.
Masters 0 and 1 only	hadoop-hdfs-namenode	all	Manages the HDFS filesystem
Masters 0 and 1 only	hadoop-hdfs-zkfc	all	ZKFC is the `ZKFailoverController` process, which runs with the HDFS NameNode. It monitors the health of the NameNode, and manages leader election via ZooKeeper in the event of a failover.
Master 0 only	hadoop-mapreduce-historyserver	all	Serves mapreduce application history information
	hadoop-yarn-timelineserver	1.3+	Serves YARN application history information
	mariadb	< 1.5	A relational database used as the default underlying database for Hive metastore in Dataproc < 1.5 images
	mysql	1.5+	A relational database used as the default underlying database for Hive metastore in Dataproc 1.5+ images
	nfs-kernel-server	< 1.3	NFS is the Network File System.
	spark-history-server	all	Serves Spark application history information