Jump to Content
Data Analytics

Dataproc Metastore: Fully managed Hive metastore now in public preview

December 16, 2020
Feng Lu

Tech Lead and Manager

James Malone

Sr. Product Manager

The Apache Hive metastore service has become a building block for data lakes that utilize the diverse world of open-source software, such as Apache Spark and Presto. We’re launching the Dataproc Metastore into public preview today, so these powerful tools are now easy to use by any Google Cloud customer with fewer distractions and delays. The Dataproc Metastore is a fully managed, highly available, auto-healing, open-source Apache Hive metastore service that simplifies technical metadata management for customers building data lakes on Google Cloud. And for a limited time only, it’s free! This launch exemplifies our commitment to fast-paced innovation and delivery, combining cloud technology with open source, and closely follows our announcement of the private preview in June of this year

Before we go into more detail, we would also like to thank our private preview users for testing and providing rich feedback—the launch today has been made better with your valuable input.  

What does this mean for my data lake?

If you are familiar with the Hive Metastore, you likely already know it is a critical component of many data lakes because it acts as a central repository of metadata. In fact, a whole ecosystem of tools, open-source and otherwise, are built around the Hive Metastore, some of which this diagram illustrates.

https://storage.googleapis.com/gweb-cloudblog-publish/images/Dataproc_Metastore.max-1600x1600.jpg

The Dataproc Metastore is a serverless Hive Metastore that unlocks several key data lake use cases in Google Cloud, including:

  • Many ephemeral Dataproc clusters can utilize a Dataproc Metastore at the same time, allowing many users of open-source tools, such as Spark , Hive, and Presto, to access consistent metadata at the same time. 

  • Unifying metadata between open-source tables and Data Fusion, so ETL and ELT on those tables is easier and code-free.

  • Tying together metadata into a central store so cloud-natiove services like Dataproc can seamlessly interoperate with other open-source tools or partner technologies.

The Dataproc Metastore now means your data lake is easier to manage, more unified, and increasingly serverless for fewer distractions. 

New features in Dataproc Metastore

Throughout the private preview period, and since our initial announcement in June, we have added many new features to the Dataproc Metastore. Several of these new features are launching with this release today.

  • IAM and Kerberos—Fine-grained Cloud Identity and Access Management (Cloud IAM) support, along with out-of-the-box support for Kerberos and other security tools such as Apache Ranger.

  • Import/export—Metadata can be imported and exported to enable bidirectional integration with and migration from other Hive Metastores, such as those on-premises.

  • VPC-SC—Support for Google Cloud VPC Service Controls to mitigate data exfiltration risks.

  • ACID transactions—Dataproc Metastore supports ACID transactions using Hive's ACID transaction capabilities.

  • Cloud Monitoring integration—Logging and monitoring of Dataproc Metastore instances seamlessly inside of Cloud Monitoring and Logging.

  • Broad Dataproc compatibility—Compatible with a broad range of Dataproc releases, including the Dataproc 2.0 preview release with Spark, Hadoop, and Hive 3.x. 

  • Service updates—You can transactionally update elements of the hive Metastore service including configurations, tiers, ports, maintenance window, and more.

  • Cloud Console and Cloud SDK—Dataproc Metastore supports both the Cloud Console and the Cloud SDK command line (gcloud beta metastore).

We will continue to move quickly to get the Dataproc Metastore into general availability while also adding highly requested features such as customer-managed encryption keys.

Dataproc Metastore public preview pricing

During the public preview period, which starts today and lasts until GA, the Dataproc Metastore will be offered at a 100% discount. This discount is intended for you to use and test the technology without incurring costs for the testing. 

The Dataproc Metastore is offered in two service tiers, developer and enterprise, each of which offer different features, service levels, and pricing because they are intended for different use cases.

https://storage.googleapis.com/gweb-cloudblog-publish/images/Dataproc_Metastore_1.max-700x700.jpg

This pricing allows you to create developer instances for quick testing and prototyping without needing to test against your production environment or create multiple copies of your production database. The enterprise tier is intended for production deployments that require high availability, performance, and stability. Future releases will also incorporate features targeted at specific tiers, such as Data Catalog integration.

You can find more information in the pricing documentation for Dataproc Metastore.

Serverless open source

The Dataproc Metastore is a good example of how the best of Google Cloud infrastructure can be used to run managed open source. As a result of innovations in how we run, secure, and scale the Hive Metastore, we have been able to make the Dataproc Metastore serverless. This launch is the beginning of how we’re reshaping managed open source for data analytics in cloud. As a team passionate about both cloud and open source, it is our goal to bring the very things that make the Hive Metastore uniquely great, including no infrastructure to manage, automated scalability, enhanced hands-off high availability, and easier pricing to other popular open source components in the future. 

Get started

Any Google Cloud customer can use the Dataproc Metastore, for free during preview, starting today. You can follow the quickstart guide or review the full documentation for more information on how to get started.

Posted in