What is Dataproc Metastore?

Dataproc Metastore is a fully managed, highly available, autohealing serverless Apache Hive metastore (HMS) that runs on Google Cloud. It supports HMS, serves as a critical component for managing the metadata of relational entities, and provides interoperability between data processing applications in the open source data ecosystem.

Why use Dataproc Metastore?

Dataproc Metastore use cases

Dataproc Metastore use cases include:

  • A centralized metadata repository that can be shared among various ephemeral Dataproc clusters running different open source engines, such as Apache Hive, Apache Spark, and Presto.

  • A unified view of your open source tables across Google Cloud, providing interoperability between cloud-native services like Dataproc and various other open source-based partner offerings on Google Cloud.

Dataproc Metastore features

Dataproc Metastore provides:

  • OSS compatibility — Dataproc Metastore offers a full OSS compatible metastore. It can integrate seamlessly with your existing data processing stack, such as Apache Hive, Apache Spark, and Presto. This provides more interoperability between Google Cloud services and open source-centric partners.

  • Management — Dataproc Metastore offloads the burden of managing your HMS. You can create or update an HMS instance in minutes with fully configured monitoring and operations tasks.

  • Integration — In addition to simplifying the service management of HMS, Dataproc Metastore can integrate with existing Google Cloud products such as Dataproc. You can use a running Dataproc Metastore service as the source of metadata for a Dataproc cluster.

  • Simple import — The import feature allows you to import existing metadata stored in an external database to Dataproc Metastore.

  • Security — You can secure Dataproc Metastore services with Google Cloud security solutions. You can also set up Cloud IAM permissions and use Kerberos authentication.

  • Reliability — Dataproc Metastore service is regularly backed up so you don't have to worry about your HMS data durability.

  • High performance — Each tier provides guaranteed resource allocations for high intensive workloads that can respond to spikes in HMS calls without requiring pre-warming or caching.

  • Scalability as your data lake grows — You can easily move between tiers when your data lake is ready or quickly create new metastores.

  • Reduced downtime and increased productivity — Google Cloud provides SLAs and support.

What is included in Dataproc Metastore?

For information on the open source (Apache Hive) versions supported by Dataproc Metastore, see the Dataproc Metastore version policy.

Getting Started with Dataproc Metastore

To quickly get started with Dataproc Metastore, see Quickstart for deploying Dataproc Metastore. You can access Dataproc Metastore in the following ways: