When creating a Dataproc cluster, you can put the cluster into Hadoop High Availability (HA) mode by specifying the number of master instances in the cluster. The number of masters can only be specified at cluster creation time.
Currently, Dataproc supports two master configurations:
- 1 master (default, non HA)
- 3 masters (Hadoop HA)
Differences between default and Hadoop High Availability mode
In the rare case of an unexpected Compute Engine failure, Dataproc instances will experience a machine reboot. The default single-master configuration for Dataproc is designed to recover and continue processing new work in such cases, but in-flight jobs will necessarily fail and need to be retried, and HDFS will be inaccessible until the single NameNode fully recovers on reboot.
Note that the driver/main program of any jobs you run still represents a potential single point of failure if the correctness of your job depends on the driver program running successfully. Jobs submitted through the Dataproc Jobs API are not considered "high availability," and will still be terminated on failure of the master node that runs the corresponding job driver programs. For individual jobs to be resilient against single-node failures using a HA Cloud Dataproc cluster, the job must either 1) run without a synchronous driver program or 2) it must run the driver program itself inside a YARN container and be written to handle driver-program restarts. See Launching Spark on YARN for an example of how restartable driver programs can run inside YARN containers for fault tolerance.
The default master is named
cluster-name-m; HA masters are named
In an HA Dataproc cluster, the Zookeeper component is automatically installed on cluster master nodes. All masters participate in a ZooKeeper cluster, which enables automatic failover for other Hadoop services.
In a standard Dataproc cluster:
- Secondary NameNode
In a High Availability Dataproc cluster:
- All masters run JournalNode
- There is no Secondary NameNode
Please see the HDFS High Availability documentation for additional details on components.
In a standard Dataproc cluster,
cluster-name-m runs ResourceManager.
In a High Availability Dataproc cluster, all masters run ResourceManager.
Please see the YARN High Availability documentation for additional details on components.
Creating a High Availability cluster
To create an HA cluster with gcloud dataproc clusters create, run the following command:
gcloud dataproc clusters create cluster-name \ --region=region \ --num-masters=3 \ ... other args
To create an HA cluster, select "High Availability (3 masters, N workers)" from the Cluster mode selector on the Dataproc Create a cluster page.