When creating a Google Cloud Dataproc cluster, you can put the cluster into Hadoop High Availability (HA) mode by specifying the number of master instances in the cluster. The number of masters can only be specified at cluster creation time.
Currently, Cloud Dataproc supports two master configurations:
- 1 master (default, non HA)
- 3 masters (Hadoop HA)
Differences between default and Hadoop High Availability mode
In the rare case of an unexpected Google Compute Engine failure, Cloud Dataproc instances will experience a machine reboot. The default single-master configuration for Cloud Dataproc is designed to recover and continue processing new work in such cases, but in-flight jobs will necessarily fail and need to be retried, and HDFS will be inaccessible until the single NameNode fully recovers on reboot.
Note that the driver/main program of any jobs you run still represents a potential single point of failure if the correctness of your job depends on the driver program running successfully. Jobs submitted through the Cloud Dataproc Jobs API are not considered "high availability," and will still be terminated on failure of the master node that runs the corresponding job driver programs. For individual jobs to be resilient against single-node failures using a HA Cloud Dataproc cluster, the job must either 1) run without a synchronous driver program or 2) it must run the driver program itself inside a YARN container and be written to handle driver-program restarts. See Launching Spark on YARN for an example of how restartable driver programs can run inside YARN containers for fault tolerance.
The default master is named
cluster-name-m; HA masters are named
In an HA Cloud Dataproc cluster, the Zookeeper component is automatically installed on cluster master nodes. All masters participate in a ZooKeeper cluster, which enables automatic failover for other Hadoop services.
In a standard Cloud Dataproc cluster:
- Secondary NameNode
In a High Availability Cloud Dataproc cluster:
- All masters run JournalNode
- There is no Secondary NameNode
Please see the HDFS High Availability documentation for additional details on components.
In a standard Cloud Dataproc cluster,
cluster-name-m runs ResourceManager.
In a High Availability Cloud Dataproc cluster, all masters run ResourceManager.
Please see the YARN High Availability documentation for additional details on components.
Creating a High Availability cluster
To create an HA cluster with gcloud dataproc clusters create, run the following command:
gcloud dataproc clusters create cluster-name --num-masters 3
To create an HA cluster, select "High Availability (3 masters, N workers)" from the Cluster mode selector on the Cloud Dataproc Create a cluster page.