Dataproc Metastore uses mappings between Apache Hadoop Distributed File System (HDFS) or Hive-compatible storage system files and Apache Hive tables to fully manage your metadata. The following concepts and considerations are important when attaching a Dataproc or other self-managed cluster to use a Dataproc Metastore service as its Hive metastore.
Hive metastore concepts
All Hive applications can have managed internal or unmanaged external tables. For internal tables, Hive metastore manages not only the metadata, but also the data for those tables. For external tables on the other hand, Hive metastore does not manage the data, but instead, only the metadata for those tables.
For example, when you delete a table definition using the DROP TABLE
Hive SQL
statement:
drop table foo
For internal tables — Hive metastore removes those metadata entries in Hive metastore, and it also deletes the files associated with this particular table.
For external tables — Hive metastore only deletes the metadata, and keeps the data associated with this particular table.
Creation of a Cloud Storage bucket
When you create a Dataproc Metastore service, a Cloud Storage bucket is automatically created for you in your project. This bucket is used as a permanent storage to store the service artifacts and metadata, like dump files and debug logs.
Hive warehouse directory
In order for Hive metastore to function correctly, whether for internal or external tables, you may need to provide Dataproc Metastore with the Hive warehouse directory at the time of service creation. This directory contains subdirectories that correspond to internal tables and store the actual data in the table. They are often Cloud Storage buckets.
Whether you provide Dataproc Metastore with a warehouse directory or not, Dataproc Metastore creates a Cloud Storage artifacts bucket to store service artifacts for you to consume.
If you choose to provide a Hive warehouse directory:
Ensure that your Dataproc Metastore service has permission to access the warehouse directory. This is done by granting the Dataproc Metastore control service agent with object read/write access, that is
roles/storage.objectAdmin
. This grant must be set at the bucket level or higher. The service account that you need to grant access to isservice-customer-project-number@gcp-sa-metastore.iam.gserviceaccount.com
.See Cloud Storage access control to learn how to grant read/write permissions to Dataproc Metastore service agents on your warehouse directory.
If you choose not to provide a Hive warehouse directory:
Dataproc Metastore automatically creates a warehouse directory for you at a default location,
gs://your-artifacts-bucket/hive-warehouse
.Ensure that your Dataproc VM service account has permission to access the warehouse directory. This is done by granting the Dataproc VM service agent with object read/write access, that is
roles/storage.objectAdmin
. This grant must be set at the bucket level or higher.
Deletion of your Cloud Storage bucket
Deleting your Dataproc Metastore service doesn't delete your Cloud Storage artifacts bucket. This is because your bucket may contain useful post-service data. You must explicitly delete your bucket if you'd like to.