Dataproc services

This page lists services that Dataproc image versions run on Dataproc cluster nodes.

All nodes

The following services run on all nodes in a cluster.

Node type Service Image versions Description
All nodes google-dataproc-agent all Receives jobs from Dataproc and launches job drivers
google-fluentd all Collects and pushes logs to Logging

Standard clusters

The following services run on standard clusters.

Node type Service Image versions Description
All nodes hadoop-hdfs-namenode all Manages the HDFS filesystem
hadoop-hdfs-secondarynamenode all Checkpoints the NameNode
hadoop-mapreduce-historyserver all Serves mapreduce application history information
hadoop-yarn-resourcemanager all Schedules and manages YARN applications
hadoop-yarn-timelineserver 1.3+ Serves YARN application history information
hive-metastore all Manages Hive table metadata. As a default, uses the local mariadb (image versions < 1.5) or mysql (image versions 1.5+) database on the master node as the Hive table metadata store. Using the default database is not recommended because these databases are tied to the cluster's lifecycle. Instead, use either of the following as the Hive metastore database (in recommendation order):
  1. Dataproc Metastore
  2. Cloud SQL instance
hive-server2 all Serves queries received from clients (primarily beeline shell queries) against Hive
mariadb < 1.5 A relational database used as the default underlying database for Hive metastore in Dataproc < 1.5 images
mysql 1.5+ A relational database used as the default underlying database for Hive metastore in Dataproc 1.5+ images
nfs-kernel-server < 1.3 NFS is the Network File System.
spark-history-server all Serves Spark application history information
All Workers hadoop-yarn-nodemanager all Launches and manages YARN containers
Primary Workers only hadoop-hdfs-datanode all Stores HDFS blocks

HA Clusters

In Dataproc High Availability (HA) clusters, different services run on different master nodes, as show below. HA cluster worker node services are the same as those listed for standard clusters.

Node type Service Image versions Description
All masters hadoop-hdfs-journalnode all A quorum of journal nodes maintains an edit log of HDFS namespace modifications. If a failover occurs, the Standby NameNode reads the edit log and takes control from the Active NameNode.
hadoop-yarn-resourcemanager all Schedules and manages YARN applications
hive-metastore all Manages Hive table metadata. As a default, uses the local mariadb (image versions < 1.5) or mysql (image versions 1.5+) database on the master node as the Hive table metadata store. Using the default database is not recommended because these databases are tied to the cluster's lifecycle. Instead, use either of the following as the Hive metastore database (in recommendation order):
  1. Dataproc Metastore
  2. Cloud SQL instance
hive-server2 all Serves queries received from clients (primarily beeline shell queries) against Hive
zookeeper-server all A ZooKeeper quorum is used for distributed coordination. In High Availability (HA) clusters, it is used for HDFS NameNodes and YARN resource managers leader election.
Masters 0 and 1 only hadoop-hdfs-namenode all Manages the HDFS filesystem
hadoop-hdfs-zkfc all ZKFC is the ZKFailoverController process, which runs with the HDFS NameNode. It monitors the health of the NameNode, and manages leader election via ZooKeeper in the event of a failover.
Master 0 only hadoop-mapreduce-historyserver all Serves mapreduce application history information
hadoop-yarn-timelineserver 1.3+ Serves YARN application history information
mariadb < 1.5 A relational database used as the default underlying database for Hive metastore in Dataproc < 1.5 images
mysql 1.5+ A relational database used as the default underlying database for Hive metastore in Dataproc 1.5+ images
nfs-kernel-server < 1.3 NFS is the Network File System.
spark-history-server all Serves Spark application history information