Dataproc Hadoop data storage

Dataproc integrates with Apache Hadoop and the Hadoop Distributed File System (HDFS). The following features and considerations can be important when selecting compute and data storage options for Dataproc clusters and jobs:

  • HDFS with Cloud Storage: Dataproc uses the Hadoop Distributed File System (HDFS) for storage. Additionally, Dataproc automatically installs the HDFS-compatible Cloud Storage connector, which enables the use of Cloud Storage in parallel with HDFS. Data can be moved in and out of a cluster through upload and download to HDFS or Cloud Storage.
  • VM disks:
    • By default, when no local SSDs are provided, HDFS data and intermediate shuffle data is stored on VM boot disks, which are Persistent Disks.
    • If you use local SSDs, HDFS data and intermediate shuffle data is stored on the SSDs.
    • Persistent disk (PD) size and type affect performance and VM size, whether using HDFS or Cloud Storage for data storage.
    • VM Boot disks are deleted when the cluster is deleted.