Data storage for internal tables

Dataproc Metastore uses mappings between Apache Hadoop Distributed File System (HDFS) or Hive-compatible storage system files and Apache Hive tables to fully manage your metadata. The following concepts and considerations are important when attaching a Dataproc or other self-managed cluster to use a Dataproc Metastore service as its Hive metastore.

Hive metastore concepts

All Hive applications can have managed internal or unmanaged external tables. For internal tables, Hive metastore manages not only the metadata, but also the data for those tables. For external tables on the other hand, Hive metastore does not manage the data, but instead, only the metadata for those tables.

For example, when you delete a table definition using the DROP TABLE Hive SQL statement:

drop table foo
  • For internal tables — Hive metastore removes those metadata entries in Hive metastore, and it also deletes the files associated with this particular table.

  • For external tables — Hive metastore only deletes the metadata, and keeps the data associated with this particular table.

Creating a Cloud Storage bucket

When you create a Dataproc Metastore service, a Cloud Storage bucket is automatically created for you in your project. This bucket is used as a permanent storage to store the service artifacts and metadata, like dump files and debug logs.

Hive warehouse directory

In order for Hive metastore to function correctly, whether for internal or external tables, you may need to provide Dataproc Metastore with the Hive warehouse directory at the time of service creation. This directory contains subdirectories that correspond to internal tables and store the actual data in the table. They are often Cloud Storage buckets.

Whether you provide Dataproc Metastore with a warehouse directory or not, Dataproc Metastore creates a Cloud Storage artifacts bucket to store service artifacts for you to consume.

If you choose to provide a Hive warehouse directory:

  • Ensure that your Dataproc Metastore service has permission to access the warehouse directory. This is done by granting the Dataproc Metastore control service agent with object read/write access, that is roles/storage.objectAdmin. This grant must be set at the bucket level or higher. The service account that you need to grant access to is service-customer-project-number@gcp-sa-metastore.iam.gserviceaccount.com.

    See Cloud Storage access control to learn how to grant read/write permissions to Dataproc Metastore service agents on your warehouse directory.

If you choose not to provide a Hive warehouse directory:

  • Dataproc Metastore automatically creates a warehouse directory for you at a default location, gs://your-artifacts-bucket/hive-warehouse.

  • Ensure that your Dataproc VM service account has permission to access the warehouse directory. This is done by granting the Dataproc VM service agent with object read/write access, that is roles/storage.objectAdmin. This grant must be set at the bucket level or higher.

Deleting your Cloud Storage bucket

Deleting your Dataproc Metastore service does not delete your Cloud Storage artifacts bucket. This is because your bucket may contain useful post-service data. You must explicitly delete your bucket if you'd like to.

What's next