Data Analytics

Introducing BigQuery metastore, a unified metadata service with Apache Iceberg support

January 22, 2025

Yuri Volobuev

Principal Engineer

Vinod Ramachandran

Senior Product Manager

Join us at Google Cloud Next

Early bird pricing available now through Feb 14th.

Does your organization use multiple data processing engines like BigQuery, Apache Spark, Apache Flink and Apache Hive? Wouldn’t it be great if you could provide a single source of truth for all of your analytics workloads? Now you can, with the public preview of BigQuery metastore, a fully managed, unified metadata service that provides processing engine interoperability while enabling consistent data governance.

BigQuery metastore is a highly scalable runtime metadata service that works with multiple engines, for example, BigQuery, Apache Spark, Apache Hive and Apache Flink, and supports the open Apache Iceberg table format. This allows analytics engines to query one copy of the data with a single schema, whether the data is stored in BigQuery storage tables, BigQuery tables for Apache Iceberg, or BigLake external tables. BigQuery metastore serves as a critical component for customers looking to migrate and modernize from legacy data lakes to a modern lakehouse architecture. Integrated deeply with BigQuery’s enterprise capabilities, this solution provides built-in security and governance for user interactions with data.

The challenges of metadata management

Traditionally, metastores and other metadata management systems are tightly coupled with data processing engines. If you are using multiple processing engines, that means maintaining multiple copies of the data and metadata persisted in different metastores. For example, when you create a table definition in Hive Metastore for querying from an open-source engine like Spark, you have to recreate the table definition to query the same data in BigQuery. You also have to build pipelines to keep table definitions synchronized across different metastores. This fragmentation can result in stale metadata, lack of visibility into data lineage, security and access challenges, and a subpar user experience.

A metastore for the lakehouse era

BigQuery metastore is designed for the lakehouse architecture, which combines the benefits of data lakes and data warehouses without having to manage both a data lake and a data warehouse — any data, any user, any workload, on a unified platform. It supports open data formats such as Apache Iceberg that are accessible by a variety of processing engines, including BigQuery, Spark, Flink and Hive. The unification of metadata across engines makes it easier to discover and use data, supporting self-service BI and ML tools to drive innovation, while maintaining data governance.

Furthermore, BigQuery metastore is serverless with no setup or configuration required and automatically scales with your workloads. This no-ops environment reduces TCO and democratizes your data for data analysts, data engineers and data scientists.

https://storage.googleapis.com/gweb-cloudblog-publish/images/image1_irraJcJ.max-2000x2000.png

Key benefits of BigQuery metastore include:

Cross-engine interoperability: BigQuery metastore provides a single shared metastore for the lakehouse architecture, with a unified view of all metadata for all data sources in the lakehouse, making it easy for your users to find and understand the data they need. This enables query processing and DML for data stored in open and proprietary formats across object stores, BigQuery storage, and across analytics runtimes.
Support for open formats and catalogs: BigQuery metastore provides support for BigQuery storage tables, BigQuery tables for Apache Iceberg and external tables.
Built-in governance: BigQuery metastore is integrated with key governance capabilities provided in BigQuery, such as automated cataloging and universal search, business metadata, data profiling, data quality, fine-grained access controls, data masking, sharing, data lineage and audit logging.
Fully managed at BigQuery scale: Being a serverless, fully managed service, BigQuery metastore is very easy to use and has integrations with key engines (BigQuery, Spark, Hive and Flink). The infrastructure foundation used for BigQuery metastore ensures that it scales to the growing query processing volume of your application and can handle traffic at BigQuery scale.

BigQuery metastore in action

Now, let’s take a look at how to use BigQuery metastore. The PySpark script below sets up a Spark environment to interact with a BigQuery storage table, a BigQuery table for Apache Iceberg, and a BigQuery external table. Detailed documentation is provided here.

from pyspark.sql import SparkSession

# Create a spark session
spark = SparkSession.builder \
.appName("BigQuery Metastore Iceberg") \
.config("spark.sql.catalog.CATALOG_NAME", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.CATALOG_NAME.catalog-impl", "org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog") \
.config("spark.sql.catalog.CATALOG_NAME.gcp_project", "PROJECT_ID") \
.config("spark.sql.catalog.CATALOG_NAME.gcp_location", "LOCATION") \
.config("spark.sql.catalog.CATALOG_NAME.warehouse", "WAREHOUSE_DIRECTORY") \
.getOrCreate()
spark.conf.set("viewsEnabled","true")

# Use the CATALOG_NAME
spark.sql("USE `CATALOG_NAME`;")
spark.sql("USE NAMESPACE DATASET_NAME;")

# Configure spark for temp results
spark.sql("CREATE NAMESPACE IF NOT EXISTS MATERIALIZATION_NAMESPACE");
spark.conf.set("materializationDataset","MATERIALIZATION_NAMESPACE")

# List the tables in the dataset
df = spark.sql("SHOW TABLES;")
df.show();

# Query a BigQuery storage table
sql = """SELECT * FROM DATASET_NAME.TABLE_NAME"""
df = spark.read.format("bigquery").load(sql)
df.show()

# Query a BigQuery table for Apache Iceberg
sql = """SELECT * FROM DATASET_NAME.ICEBERG_TABLE_NAME"""
df = spark.read.format("bigquery").load(sql)
df.show()

# Query a BigQuery read-only Apache Iceberg external table
sql = """SELECT * FROM DATASET_NAME.READONLY_ICEBERG_TABLE_NAME"""
df = spark.read.format("bigquery").load(sql)
df.show()

To customize this script for your environment, simply replace the following variables:

WAREHOUSE_DIRECTORY: the URI of the Cloud Storage folder that contains your data warehouse
CATALOG_NAME: the name of the catalog that you're using
MATERIALIZATION_NAMESPACE: the namespace for storing temporary results

Learn more

With the BigQuery metastore, you now have a modern, serverless solution to meet your metadata management needs, enabling cross-engine interoperability with built-in governance. To try out BigQuery metastore today, see the documentation. If you would like to migrate from Dataproc Metastore to BigQuery metastore, see the documentation on migration tooling.

Posted in

Data Analytics

Cloud Pub/Sub 2024 highlights: Native integrations, sharing and more

By Prateek Duble • 7-minute read

Data Analytics

Looker now available in the AWS Marketplace, bringing AI for BI to multi-cloud environments

By Rishabh Dhingra • 3-minute read

Business Intelligence

Boost your Looker Studio Pro skills with new on-demand course from Google Cloud

By Rishabh Dhingra • 2-minute read

Data Analytics

Google Cloud named a Leader in the 2024 Gartner Magic Quadrant for Data Integration Tools

By Chai Pydimukkala • 6-minute read

Introducing BigQuery metastore, a unified metadata service with Apache Iceberg support

Yuri Volobuev

Vinod Ramachandran

Join us at Google Cloud Next

The challenges of metadata management

A metastore for the lakehouse era

BigQuery metastore in action

Learn more

Related articles

Cloud Pub/Sub 2024 highlights: Native integrations, sharing and more

Looker now available in the AWS Marketplace, bringing AI for BI to multi-cloud environments

Boost your Looker Studio Pro skills with new on-demand course from Google Cloud

Google Cloud named a Leader in the 2024 Gartner Magic Quadrant for Data Integration Tools