Use BigQuery metastore with Dataproc

This document explains how to use BigQuery metastore with Dataproc on Compute Engine. This connection provides you with a single, shared metastore that works across open source software engines, such as Apache Spark.

Before you begin

  1. Enable billing for your Google Cloud project. Learn how to check if billing is enabled on a project.
  2. Enable the BigQuery, and Dataproc APIs.

    Enable the APIs

  3. Optional: Understand how BigQuery metastore works and why you should use it.

Required roles

To get the permissions that you need to use Spark and Dataproc with BigQuery metastore as a metadata store, ask your administrator to grant you the following IAM roles:

  • Create a Dataproc cluster: Dataproc Worker (roles/dataproc.worker) on the Compute Engine default service account in the project
  • Create BigQuery metastore tables in Spark:
  • Query BigQuery metastore tables in BigQuery:

For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

General workflow

To use Dataproc on Compute Engine with BigQuery metastore, you follow these general steps:

  1. Create a Dataproc cluster or configure an existing cluster.
  2. Connect to your preferred open source software engine, such as Spark.
  3. Use a JAR file to install the Apache Iceberg catalog plugin on the cluster.
  4. Create and manage your BigQuery metastore resources as needed, depending on the open source software engine that you're using.
  5. In BigQuery, access and use your BigQuery metastore resources.

Connect BigQuery metastore to Spark

The following instructions show you how to connect Dataproc to BigQuery metastore using interactive Spark SQL.

Download the Iceberg catalog plugin

To connect BigQuery metastore with Dataproc and Spark, you must use the BigQuery metastore Iceberg catalog plugin jar file.

This file is included by default in Dataproc image version 2.2. If your Dataproc clusters don't have direct access to the internet, you must download the plugin and upload it to a Cloud Storage bucket that your Dataproc cluster can access.

Download the BigQuery metastore Apache Iceberg catalog plugin.

Configure a Dataproc cluster

Before you connect to BigQuery metastore, you must set up a Dataproc cluster.

To do this, you can create a new cluster or use an existing cluster. After, you use this cluster to run interactive Spark SQL and manage your BigQuery metastore resources.

  • The subnet in the region where the cluster is created must have Private Google Access (PGA) enabled. By default, Dataproc cluster VMs, created with a 2.2 (default) or later image version, have internal IP addresses only. To allow cluster VMs to communicate with Google APIs, enable Private Google Access on the default (or user-specified network name, if applicable) network subnet in the region where the cluster is created.

  • If you want to run the Zeppelin web interface example in this guide, you must use or create a Dataproc cluster with the Zeppelin optional component enabled.

New cluster

To create a new Dataproc cluster, run the following gcloud dataproc clusters create command. This configuration contains the settings that you need to use the BigQuery metastore.

gcloud dataproc clusters create CLUSTER_NAME \
  --project=PROJECT_ID \
  --region=LOCATION \
  --optional-components=ZEPPELIN \
  --enable-component-gateway \
  --single-node

Replace the following:

  • CLUSTER_NAME: a name for your Dataproc cluster.
  • PROJECT_ID: the ID of the Google Cloud project where you're creating the cluster.
  • LOCATION: the Google Cloud region where you're creating the cluster.

Existing cluster

To configure an existing cluster, add the following Iceberg Spark runtime to your cluster.

org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2

You can add the runtime using one of the following options:

Submit a Spark job

To submit a Spark job, use one of the following methods:

gcloud CLI

gcloud dataproc jobs submit spark-sql \
--project=PROJECT_ID \
--cluster=CLUSTER_NAME \
--region==REGION \
--jars=https://storage-download.googleapis.com/maven-central/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.5.2/iceberg-spark-runtime-3.5_2.12-1.5.2.jar,gs://spark-lib/bigquery/iceberg-bigquery-catalog-1.5.2-1.0.0-beta.jar \
--properties=spark.sql.catalog.CATALOG_NAME=org.apache.iceberg.spark.SparkCatalog, \
spark.sql.catalog.CATALOG_NAME.catalog-impl=org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog, \
spark.sql.catalog.CATALOG_NAME.gcp_project=PROJECT_ID, \
spark.sql.catalog.CATALOG_NAME.gcp_location=LOCATION, \
spark.sql.catalog.CATALOG_NAME.warehouse=WAREHOUSE_DIRECTORY \
--execute="SPARK_SQL_COMMAND"

Replace the following:

  • PROJECT_ID: the ID of the Google Cloud project that contains the Dataproc cluster.
  • CLUSTER_NAME: the name of the Dataproc cluster that you're using to run the Spark SQL job.
  • REGION: the Compute Engine region where your cluster is located.
  • LOCATION: the location of the BigQuery resources.
  • CATALOG_NAME: the name of the Spark catalog to that you're using with your SQL job.
  • WAREHOUSE_DIRECTORY: the Cloud Storage folder that contains your data warehouse. This value starts with gs://.
  • SPARK_SQL_COMMAND: the Spark SQL query that you want to run. This query includes the commands to create your resources. For example, to create a namespace and table.

Interactive Spark

Connect to Spark and install the catalog plugin

To install the catalog plugin for BigQuery metastore, connect to your Dataproc cluster using SSH.

  1. In the Google Cloud console, go to the VM Instances page.
  2. To connect to a Dataproc VM instance, click SSH in the list of virtual machine instances. The output is similar to the following:

    Connected, host fingerprint: ssh-rsa ...
    Linux cluster-1-m 3.16.0-0.bpo.4-amd64 ...
    ...
    example-cluster@cluster-1-m:~$
    
  3. In the terminal, run the following BigQuery metastore initialization command:

    spark-sql \
    --jars https://storage-download.googleapis.com/maven-central/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.5.2/iceberg-spark-runtime-3.5_2.12-1.5.2.jar,gs://spark-lib/bigquery/iceberg-bigquery-catalog-1.5.2-1.0.0-beta.jar \
    --conf spark.sql.catalog.CATALOG_NAME=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.CATALOG_NAME.catalog-impl=org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog \
    --conf spark.sql.catalog.CATALOG_NAME.gcp_project=PROJECT_ID \
    --conf spark.sql.catalog.CATALOG_NAME.gcp_location=LOCATION \
    --conf spark.sql.catalog.CATALOG_NAME.warehouse=WAREHOUSE_DIRECTORY

    Replace the following:

    • CATALOG_NAME: the name of the Spark catalog to that you're using with your SQL job.
    • PROJECT_ID: the Google Cloud project ID of the BigQuery metastore catalog that your Spark catalog links with.
    • LOCATION: the Google Cloud location of the BigQuery metastore.
    • WAREHOUSE_DIRECTORY: the Cloud Storage folder that contains your data warehouse. This value starts with gs://.

    After you successfully connect to a cluster, your Spark terminal displays the spark-sql prompt.

    spark-sql (default)>
    

Manage BigQuery metastore resources

You are now connected to the BigQuery metastore. You can view your existing resources or create new resources based on your metadata stored in BigQuery metastore.

For example, try running the following commands in the interactive Spark SQL session to create an Iceberg namespace and table.

  • Use the custom Iceberg catalog:

    USE `CATALOG_NAME`;
  • Create a namespace:

    CREATE NAMESPACE IF NOT EXISTS NAMESPACE_NAME;
  • Use the created namespace:

    USE NAMESPACE_NAME;
  • Create an Iceberg table:

    CREATE TABLE TABLE_NAME (id int, data string) USING ICEBERG;
  • Insert a table row:

    INSERT INTO TABLE_NAME VALUES (1, "first row");
  • Add a table column:

    ALTER TABLE TABLE_NAME ADD COLUMNS (newDoubleCol double);
  • View table metadata:

    DESCRIBE EXTENDED TABLE_NAME;
  • List tables in the namespace:

    SHOW TABLES;

Zeppelin notebook

  1. In the Google Cloud console, go to the Dataproc Clusters page.

    Go to Dataproc clusters

  2. Click the name of cluster that you want to use.

    The Cluster Details page opens.

  3. In the navigation menu, click Web interfaces.

  4. Under Component gateway, click Zeppelin. The Zeppelin notebook page opens.

  5. In the navigation menu, click Notebook and then click +Create new note.

  6. In the dialog, enter a notebook name. Leave Spark selected as the default interpreter.

  7. Click Create. A new notebook is created.

  8. In the notebook, click the settings menu and then click Interpreter.

  9. In the Search interpreters field, search for Spark.

  10. Click Edit.

  11. In the Spark.jars field, enter the URI of the Spark jar.

    https://storage-download.googleapis.com/maven-central/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.5.2/iceberg-spark-runtime-3.5_2.12-1.5.2.jar,gs://spark-lib/bigquery/iceberg-bigquery-catalog-1.5.2-1.0.0-beta.jar
    
  12. Click Save.

  13. Click OK.

  14. Copy the following PySpark code into your Zeppelin notebook.

    %pyspark
    from pyspark.sql import SparkSession
    spark = SparkSession.builder \
    .appName("BigQuery Metastore Iceberg") \
    .config("spark.sql.catalog.CATALOG_NAME", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.CATALOG_NAME.catalog-impl", "org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog") \
    .config("spark.sql.catalog.CATALOG_NAME.gcp_project", "PROJECT_ID") \
    .config("spark.sql.catalog.CATALOG_NAME.gcp_location", "LOCATION") \
    .config("spark.sql.catalog.CATALOG_NAME.warehouse", "WAREHOUSE_DIRECTORY") \
    .getOrCreate()
    spark.sql("select version()").show()
    spark.sql("USE `CATALOG_NAME`;")
    spark.sql("CREATE NAMESPACE IF NOT EXISTS NAMESPACE_NAME;")
    spark.sql("USE NAMESPACE_NAME;")
    spark.sql("CREATE TABLE TABLE_NAME (id int, data string) USING ICEBERG;")
    spark.sql("DESCRIBE TABLE_NAME;").show()

    Replace the following:

    • CATALOG_NAME: the name of the Spark catalog to use for the SQL job.
    • PROJECT_ID: the ID of the Google Cloud project that contains the Dataproc cluster.
    • WAREHOUSE_DIRECTORY: the Cloud Storage folder that contains your data warehouse. This value starts with gs://.
    • NAMESPACE_NAME: the namespace name that references your Spark table.
    • WAREHOUSE_DIRECTORY: the URI of the Cloud Storage folder where your data warehouse is stored.
    • TABLE_NAME: a table name for your Spark table.
  15. Click the run icon or press Shift-Enter to run the code. When the job completes, the status message shows "Spark Job Finished", and the output displays the table contents:

What's next