Dataproc Metastore overview

Stay organized with collections Save and categorize content based on your preferences.

Dataproc Metastore is a fully managed, highly available, autohealing, serverless, Apache Hive metastore (HMS) that runs on Google Cloud.

Dataproc Metastore provides you with a fully compatible Hive Metastore (HMS), which is the established standard in the open source big data ecosystem for managing technical metadata. This service helps you manage the metadata of your data lakes and provides interoperability between the various data processing tools you're using.

How Dataproc Metastore works

You can leverage a Dataproc Metastore service by connecting it to a Dataproc cluster. A Dataproc cluster includes components that rely on an HMS to drive query planning and execution.

This integration lets you keep your table information between jobs or make metadata available to other clusters and other processing engines.

For example, implementing a metastore might help you designate that a subset of your files contains revenue data, as opposed to manually tracking the filenames. In this case, you can define a table for those files and store the metadata in Dataproc Metastore. After, you can connect it to a Dataproc cluster and query the table for information using Hive, Spark SQL, or other query services.

Common use cases

  • Assign meaning to your data. Create a centralized metadata repository that is shared among ephemeral Dataproc clusters. Use using different open source software (OSS) engines, such as Apache Hive, Apache Spark, and Presto.

  • Build a unified view of your data. Provide interoperability between Google Cloud services, such as Dataproc, Dataplex, and BigQuery, or use other open source-based partner offerings on Google Cloud.

Features and benefits

  • OSS compatibility. Connect to your existing data processing stacks, such as Apache Hive, Apache Spark, and Presto.

  • Management. Create or update a metastore within minutes, complete with fully configured monitoring and operation tasks.

  • Integration. Integrate with existing Google Cloud products, such as using a Dataproc Metastore service as the source of metadata for a Dataproc cluster.

  • Built-in security. Use established Google Cloud security protocols, such as Identity and Access Management (IAM) and Kerberos authentication.

  • Simple import. Import existing metadata stored in an external database into a metastore.

  • Automatic Backups. Configure automatic metastore backups to help avoid data loss.

  • Performance monitoring. Set performance tiers to dynamically respond to highly intensive workloads and spikes, without pre-warming or caching.

  • Scalability. Move between performance tiers when you need more resources or create new metastores to handle the workload.

  • Support. Benefit from standard Google Cloud SLAs and support channels.

Integrations

  • Dataproc: Connect to a Dataproc cluster, so you can serve metadata for OSS big data workloads.
  • Dataplex: Query structured and semi-structured data discovered in a Dataplex lake.
  • Data Catalog: Sync Dataproc Metastore with Data Catalog to enable search and discovery of metadata.
  • Logging and Monitoring: Integrate Dataproc Metastore with Cloud Monitoring and Logging products.
  • Authentication and IAM: Rely on standard OAuth authentication used by other Google Cloud products, which supports using granular IAM roles to enable access control for individual resources.

Supported Apache Hive versions

Dataproc Metastore supports a limited set of Apache Hive versions. For more information, see the Dataproc Metastore version policy.

Core concepts

For more getting started information about Dataproc Metastore, see Core concepts.

Next steps