BigQuery now supports manifest files for querying open table formats
Jinpeng Zhou
Software Engineer
Lakshmi Bobba
Data & Analytics Consultant
In this blog we will give you an overview of the manifest support for BigQuery and also explain how it enables querying open table formats like Apache Hudi and Delta Lake in BigQuery.
Open table formats rely on embedded metadata to provide transactionally consistent DML and time travel features. They keep different versions of the data files and are capable of generating manifests, which are lists of data files that represent a point-in-time snapshot. Many data runtimes like Delta Lake and Apache Hudi can generate manifests, which can be used for load and query use cases. BigQuery now supports manifest files, which will make it easier to query open table formats with BigQuery.
BigQuery supports manifest files in SymLinkTextInputFormat, which is simply a newline-delimited list of URIs. Customers can now set the file_set_spec_type flag to NEW_LINE_DELIMITED_MANIFEST in table options to indicate that the provided URIs are newline-delimited manifest files, with one URI per line. This feature also supports partition pruning for hive-style partitioned tables which leads to better performance and lower cost.
Here is an example of creating a BigLake table using a manifest file.
Querying Apache Hudi using Manifest Support
Apache Hudi is an open-source data management framework for big data workloads. It's built on top of Apache Hadoop and provides a mechanism to manage data in a Hadoop Distributed File System (HDFS) or any other cloud storage system.
Hudi tables can be queried from BigQuery as external tables using the Hudi-BigQuery Connector. The Hudi-BigQuery integration only works for hive-style partitioned Copy-On-Write tables. The implementation precludes the use of some important query processing optimizations, which hurts performance and increases slot consumption.
To overcome these pain points, the Hudi-BigQuery Connector is upgraded to leverage BigQuery’s manifest file support. Here is a step by step process to query Apache Hudi workloads using the Connector.
Step 1: Download and build the BigQuery Hudi connector
Download and build the latest hudi-gcp-bundle to run the BigQuerySyncTool.
Step 2: Run the spark application to generate a BigQuery external table
Here are the steps to use the connector using manifest approach:
Drop the existing view that represents the Hudi table in BigQuery [if old implementation is used]
The Hudi connector looks for the table name and if one exists it just updates the manifest file. Queries will start failing because of a schema mismatch. Make sure you drop the view before triggering the latest connector.
Run the latest Hudi Connector to trigger the manifest approach
Run the BigQuerySyncTool with the --use-bq-manifest-file flag.
If you are transitioning from the old implementation, append --use-bq-manifest-file flag to the current spark submit that runs the existing connector. Using the same table name is recommended as it will allow keeping the existing downstream pipeline code.
Running the connector with the use-bq-manifest-file flag will export a manifest file in a format supported by BigQuery and use it to create an external table with the name specified in the --table parameter.
Here is a sample spark submit for the manifest approach.
Step 3: Recommended: Upgrade to an accelerated BigLake table
Customers running large-scale analytics can upgrade external tables to BigLake tables to set appropriate fine-grained controls and accelerate the performance of these workloads by taking advantage of metadata caching and materialized views.
Querying Delta Lake using Manifest Support
Delta Lake is an open-source storage framework that enables building a lakehouse architecture. It extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling. It also provides an option to export a manifest file that contains a list of data files that represent the point-in-time snapshot.
With the manifest support, users can create a BigLake table to query the Delta Lake table on GCS. It is the responsibility of the user to generate the manifest whenever the underlying Delta Lake table changes and this approach only supports querying Delta Lake reader v1 tables.
Here is a step by step process to query Delta Lake tables using manifest support.
Step 1: Generate the Delta table’s manifests using Apache Spark
Delta Lake supports exporting manifest files. The generate command generates manifest files at <path-to-delta-table>/_symlink_format_manifest/. The files in this directory will contain the names of the data files (that is, Parquet files) that should be read for reading a snapshot of the Delta table.
Step 2: Create a BigLake table on the generated manifests
Create a manifest file based BigLake table using the manifest files generated from the previous step. If the underlying Delta Lake table is partitioned, you can create a hive style partitioned BigLake table.
Step 3: Recommended: Upgrade to an accelerated BigLake table
Customers running large-scale analytics on Delta Lake workloads can accelerate the performance by taking advantage of metadata caching and materialized views.
What’s Next?
If you are an OSS customer looking to query your Delta lake or Apache Hudi workloads on GCS, please leverage the manifest support and if you are also looking to further accelerate the performance, you can do that by taking advantage of metadata caching and materialized views.
For more Information
Accelerate BigLake performance to run large-scale analytics workloads
Introduction to BigLake tables.
Visit BigLake on Google Cloud.
Acknowledgments: Micah Kornfield, Brian Hulette, Silvian Calman, Mahesh Bogadi, Garrett Casto, Yuri Volobuev, Justin Levandoski, Gaurav Saxena and the rest of the BigQuery Engineering team.