High Availability Mode

When creating a Dataproc cluster, you can put the cluster into Hadoop High Availability (HA) mode by specifying the number of master instances in the cluster. The number of masters can only be specified at cluster creation time.

Currently, Dataproc supports two master configurations:

  • 1 master (default, non HA)
  • 3 masters (Hadoop HA)

Comparing default and Hadoop High Availability mode

  • Compute Engine failure: In the rare case of an unexpected Compute Engine failure, Dataproc instances will experience a machine reboot. The default single-master configuration for Dataproc is designed to recover and continue processing new work in such cases, but in-flight jobs will necessarily fail and need to be retried, and HDFS will be inaccessible until the single NameNode fully recovers on reboot. In HA mode, HDFS High Availability and YARN High Availability are configured to allow uninterrupted YARN and HDFS operations despite any single-node failures/reboots.

  • Job driver termination: The driver/main program of any jobs you run still represents a potential single point of failure if the correctness of your job depends on the driver program running successfully. Jobs submitted through the Dataproc Jobs API are not considered "high availability," and will still be terminated on failure of the master node that runs the corresponding job driver programs. For individual jobs to be resilient against single-node failures using a HA Cloud Dataproc cluster, the job must either 1) run without a synchronous driver program or 2) it must run the driver program itself inside a YARN container and be written to handle driver-program restarts. See Launching Spark on YARN for an example of how restartable driver programs can run inside YARN containers for fault tolerance.

  • Zonal failure: As is the case with all Dataproc clusters, all nodes in a High Availability cluster reside in the same zone. If there is a failure that impacts all nodes in a zone, the failure will not be mitigated.

Instance Names

The default master is named cluster-name-m; HA masters are named cluster-name-m-0, cluster-name-m-1, cluster-name-m-2.

Apache ZooKeeper

In an HA Dataproc cluster, the Zookeeper component is automatically installed on cluster master nodes. All masters participate in a ZooKeeper cluster, which enables automatic failover for other Hadoop services.

HDFS

In a standard Dataproc cluster:

  • cluster-name-m runs:
    • NameNode
    • Secondary NameNode

In a High Availability Dataproc cluster:

  • cluster-name-m-0 and cluster-name-m-1 run:
    • NameNode
    • ZKFailoverController
  • All masters run JournalNode
  • There is no Secondary NameNode

Please see the HDFS High Availability documentation for additional details on components.

YARN

In a standard Dataproc cluster, cluster-name-m runs ResourceManager.

In a High Availability Dataproc cluster, all masters run ResourceManager.

Please see the YARN High Availability documentation for additional details on components.

Creating a High Availability cluster

gcloud command

To create an HA cluster with gcloud dataproc clusters create, run the following command:
gcloud dataproc clusters create cluster-name \
    --region=region \
    --num-masters=3 \
    ... other args

REST API

To create an HA cluster, use the clusters.create API, setting masterConfig.numInstances to 3.

Console

To create an HA cluster, select High Availability (3 masters, N workers) in the Cluster type section of the Set up cluster panel on the Dataproc Create a cluster page.