Data Analytics

Data Lake management just got easier with Dataproc Metastore GA

March 31, 2021

Chris Crosbie

Product Manager, Data Analytics

Wilson Lian

Software Engineer

https://storage.googleapis.com/gweb-cloudblog-publish/images/Dataproc_Metastore.max-1100x1100.png

Today, we are excited to announce the general availability of Dataproc Metastore. A fully managed, serverless technical metadata repository based on the Apache Hive metastore.

Enterprises building and migrating open source data lakes to Google Cloud now have a central and persistent metastore for their open source data analytics frameworks. A completely serverless, no hassle setup allows enterprises to migrate their open source metadata without having to worry about the overhead of setting up highly available architectures, backups, and performing maintenance tasks.

Dataproc Metastore provides a unified view of your open source tables across Google Cloud, and provides interoperability between data lake processing frameworks like Apache Hadoop, Apache Spark, Apache Hive, Trino, Presto, and many others. If it works with Hive Metastore, it will most likely work with Dataproc Metastore. It’s fully integrated with other Google Cloud services like Dataproc, Data Catalog, Data Fusion, and partner services like Databricks and Collibra.

Using Dataproc Metastore with your Google Cloud Data Lake

You can easily attach the Dataproc Metastore to one or more Dataproc clusters or even self-managed clusters to share Hive tables across various open source processing engines.

Dataproc Metastore provides a native integration with the Google Cloud Data Catalog, allowing you to search your open source tables right alongside your BigQuery and Cloud Pub/Sub topics and tag them with business metadata.

https://storage.googleapis.com/gweb-cloudblog-publish/images/Data_Catalog.max-700x700.png

Dataproc Metastore also connects directly with Cloud Data Fusion so you can use the Dataproc Metastore as either a source or a sink in a visual point-and-click code-free ETL/ELT pipeline.

In addition to Google Cloud native services, the Dataproc Metastore is also committed to Google’s mission of being the most open cloud and offers compatibility with various other open partner offerings including Collibra, Qubole, and now Databricks as an external metastore in remote mode.

Databricks integrates with Dataproc Metastore so our joint customers can continue to consolidate their data, analytics, and metadata on one open platform. This integration makes it easier for on-prem data lakes to migrate to Google Cloud, unify their open source metastore, and deploy analytics and machine learning using Databricks in order to unlock new business value.

Ajay Singh, AVP, Field and Partner Engineering at Databricks

Tweet this quote

This generally available release comes coupled with additional capabilities above and beyond our initial preview features to power additional metadata management use cases, including;

The first use case provides customers support for open source Delta Lake tables using metastore-defined Delta Lake tables as a way to write from multiple clusters while maintaining ACID transactions. You simply need to generate a manifest file from the Delta Lake format and store that in the Dataproc Metastore as a Hive Metastore table location.
The second use case makes it easy to set up multi-region disaster recovery for the Dataproc Metastore. The Dataproc Metastore Enterprise tier provides high availability within a region. However, for customers that need to keep processing available despite the unlikely failure of an entire region, the below multi-regional approach employs multi-region buckets for storing Hive datasets along with Dataproc metadata exports.

https://storage.googleapis.com/gweb-cloudblog-publish/images/DPMS.max-1600x1600.png

With this general availability launch, you can confidently run mission critical data lake applications that rely on a centralized metadata repository without having to deal with the toil of backups, testing, updates, database configurations, and creating highly available architectures. You can rely on Google’s expertise running data applications at a global scale.

How to get started

To learn more about how Google Cloud has modernized the Apache Hive Metastore for cloud native applications in the Dataproc Metastore, check out the replay from the open source Meetup, “How to Collaborate with Data Lake Management Communities”. This session also includes a talk on “How Delta Lake Addresses Data Lake Challenges”. We hope that you will also join us for future meetups to share Data Lake management best practices.

The Dataproc Metastore is offered in two service tiers, developer and enterprise, each of which offer different features, service levels, and pricing because they are intended for different use cases. Customers who have been using the Public Preview will automatically be upgraded to the generally available version and will receive charges according to the pricing documentation.

Get started with Dataproc Metastore today by importing your metadata into the Dataproc Metastore!

Posted in