Data storage for internal tables

Dataproc Metastore uses mappings between Apache Hadoop Distributed File System (HDFS) or Hive-compatible storage system files and Apache Hive tables to fully manage your metadata. The following concepts and considerations are important when attaching a Dataproc or other self-managed cluster to use a Dataproc Metastore service as its Hive metastore.

Hive metastore concepts

All Hive applications can have managed internal or unmanaged external tables. For internal tables, Hive metastore manages not only the metadata, but also the data for those tables. For external tables on the other hand, Hive metastore does not manage the data, but instead, only the metadata for those tables.

For example, when you delete a table definition using the DROP TABLE Hive SQL statement:

drop table foo
  • For internal tables — Hive metastore removes those metadata entries in Hive metastore, and it will also go ahead and delete the files associated with this particular table.

  • For external tables — Hive metastore will only delete the metadata, and keep the data associated with this particular table.

Creating a Cloud Storage bucket

When you create a Dataproc Metastore service, a Cloud Storage bucket is automatically created for you in your project. This bucket is used as a permanent storage to store the service artifacts and metadata, like dump files and debug logs.

Hive warehouse directory

In order for Hive metastore to function correctly, whether for internal or external tables, you may need to provide Dataproc Metastore with the Hive warehouse directory at the time of service creation. This directory contains subdirectories that correspond to internal tables and store the actual data in the table. They are often Cloud Storage buckets.

Whether you provide Dataproc Metastore with a warehouse directory or not, Dataproc Metastore creates a Cloud Storage artifacts bucket to store service artifacts for you to consume.

If you choose to provide a Hive warehouse directory:

  • Ensure that your Dataproc Metastore service has read/write access to the warehouse directory. This is done by granting the Dataproc Metastore control service agent with read/write access, that is roles/storage.objectAdmin. This grant must be set at the bucket level or higher. The service account that you need to grant access to is

    See Cloud Storage access control to learn how to grant read/write permissions to Dataproc Metastore service agents on your warehouse directory.

If you choose not to provide a Hive warehouse directory:

  • Dataproc Metastore will automatically create a warehouse directory for you at a default location, gs://your-artifacts-bucket/hive-warehouse.

Deleting your Cloud Storage bucket

Deleting your Dataproc Metastore service does not delete your Cloud Storage artifacts bucket. This is because your bucket may contain useful post-service data. You must explicitly delete your bucket if you'd like to.

What's next?