Dataproc Metastore core concepts

Stay organized with collections Save and categorize content based on your preferences.

Dataproc Metastore uses mappings between Apache Hadoop Distributed File System (HDFS) or Hive-compatible storage system files and Apache Hive tables to fully manage your metadata.

The following concepts and considerations are important when attaching a Dataproc or other self-managed cluster to use a Dataproc Metastore service as its Hive metastore.

Common terms

Services

  • Apache Hive. Hive is a popular open source data warehouse system built on Apache Hadoop. Hive offers a SQL-like query language called HiveQL, which is used to analyze large, structured datasets.
  • Apache Hive metastore. The Hive metastore holds metadata about Hive tables, such as their schema and location.
  • Dataproc. Dataproc is a fast, easy-to-use, fully managed service on Google Cloud for running Apache Spark and Apache Hadoop workloads in a simple, cost-efficient way. After you create a Dataproc Metastore, you can connect to it from a Dataproc cluster.
  • Dataproc Metastore service. The name of the metastore instance you create in Google Cloud. You can have one or many different metastore services in your implementation.

Concepts

  • Tables. All Hive applications have either managed internal or unmanaged external tables that store your data.
  • Dataproc cluster. After you create a Dataproc Metastore service, you can connect to it from a cluster. Dataproc Metastore can be used with various clusters, including Dataproc clusters and self-managed Apache Hive, Apache Spark, or Presto clusters.
  • Artifacts bucket. A Cloud Storage bucket that is created in your project automatically with every metastore service that you create. This bucket can be used to store your service artifacts, such as exported metadata and managed table data. By default, the artifacts bucket stores the default warehouse directory of your Dataproc Metastore service.
  • Endpoints. A Dataproc Metastore service provides clients access to the stored Hive Metastore metadata through one or more network endpoints. Dataproc Metastore provides you with URIs for these endpoints.
  • Endpoint protocol. The over-the-wire network protocol used for communication between Dataproc Metastore and Hive Metastore clients. Dataproc Metastore supports Apache Thrift and gRPC.
  • Metadata Federation. A feature that lets you access metadata that is stored in multiple Dataproc Metastore instances.
  • Auxiliary versions. A feature that lets you connect multiple Hive client versions to the same Dataproc Metastore service.
  • Hive warehouse directory. The default location where managed table data is stored.

Data storage for internal tables

Hive metastore concepts

All Hive applications can have managed internal or unmanaged external tables. For internal tables, Hive metastore manages not only the metadata, but also the data for those tables. For external tables, Hive metastore does not manage the data, but instead, only the metadata for those tables.

For example, when you delete a table definition using the DROP TABLE Hive SQL statement:

drop table foo
  • For internal tables — Hive metastore removes those metadata entries in Hive metastore, and it also deletes the files associated with this particular table.

  • For external tables — Hive metastore only deletes the metadata, and keeps the data associated with this particular table.

Artifacts Cloud Storage buckets

You use the artifacts bucket to store your service artifacts, such as exported metadata and managed table data.

Creating your bucket

When you create a Dataproc Metastore service, a Cloud Storage bucket is automatically created for you in your project. By default, the artifacts bucket stores the default warehouse directory of your Dataproc Metastore service.

Deleting your bucket

Deleting your Dataproc Metastore service doesn't delete your Cloud Storage artifacts bucket. Your bucket isn't automatically deleted because it might contain useful post-service data. To delete your bucket, you must run a manual deletion operation.

Hive warehouse directory

In order for Hive metastore to function correctly, whether for internal or external tables, you may need to provide Dataproc Metastore with the Hive warehouse directory at the time of service creation. This directory contains subdirectories that correspond to internal tables and store the actual data in the table. They are often Cloud Storage buckets.

Whether you provide Dataproc Metastore with a warehouse directory or not, Dataproc Metastore creates a Cloud Storage artifacts bucket to store service artifacts for you to consume.

If you choose to provide a Hive warehouse directory:

  • Ensure that your Dataproc Metastore service has permission to access the warehouse directory. This is done by granting the Dataproc Metastore control service agent with object read/write access, that is roles/storage.objectAdmin. This grant must be set at the bucket level or higher. The service account that you must grant access to is the service-customer-project-number@gcp-sa-metastore.iam.gserviceaccount.com.

    See Cloud Storage access control to learn how to grant read/write permissions to Dataproc Metastore service agents on your warehouse directory.

  • Configure Dataproc Metastore to use the warehouse directory by providing the path to the bucket and object prefix in the hive.metastore.warehouse.dir config override. For example, gs://my-bucket/path/to/location.

If you choose not to provide a Hive warehouse directory:

  • Dataproc Metastore automatically creates a warehouse directory for you at a default location, gs://your-artifacts-bucket/hive-warehouse.

  • Ensure that your Dataproc Metastore VM service account has permission to access the warehouse directory. This is done by granting the Dataproc Metastore VM service agent with object read/write access, that is roles/storage.objectAdmin. This grant must be set at the bucket level or higher.

Network Requirements

The Dataproc Metastore service requires networking access to work correctly. For more information, see Configure network requirements.

Project configurations

The following diagram provides an overview of the possible project configurations when deploying a Dataproc cluster and a Dataproc Metastore that uses the Apache Thrift endpoint protocol.

Overview of the possible project configurations when deploying a Dataproc Metastore and Dataproc cluster

What's next