Dataproc reliability guide

Last reviewed 2023-07-20 UTC

Dataproc is a fully managed, scalable service for running Apache Hadoop and Spark jobs. With Dataproc, virtual machines can be customized and scaled up and down as needed. Dataproc integrates tightly with Cloud Storage, BigQuery, Bigtable, and other Google Cloud services.

Best practices

  • Dataproc High Availability mode - compare Hadoop High Availability (HA) mode with the default non-HA mode in terms of instance names, Apache ZooKeeper, Hadoop Distributed File System (HDFS), and Yet Another Resource Negotiator (YARN). Also, how to create a high availability cluster.
  • Autoscaling clusters - when to use Dataproc autoscaling, how to create an autoscaling policy, multi-cluster policy usage, reliability best practices for autoscaling configuration, and metrics and logs.
  • Dataproc Enhanced Flexibility Mode (EFM) - examples of using Enhanced Flexibility Mode to minimize job progress delays, advanced configuration such as partitioning and parallelism, and YARN graceful decommissioning on EFM clusters.
  • Graceful decomissioning - using graceful decomissioning to minimize the impact of removing workers from a cluster, how to use this feature with secondary workers, and command examples for graceful decomissioning.
  • Restartable jobs - by using optional settings, you can set jobs to restart on failure to mitigate common types of job failure, including out-of-memory issues and unexpected Compute Engine virtual machine reboots.