Use BigQuery metastore with Dataproc Serverless

This document explains how to use BigQuery metastore with Dataproc Serverless.

Before you begin

  1. Enable billing for your Google Cloud project. Learn how to check if billing is enabled on a project.
  2. Enable the BigQuery, and Dataproc APIs.

    Enable the APIs

  3. Optional: Understand how BigQuery metastore works and why you should use it.

Required roles

To get the permissions that you need to use Spark and Dataproc Serverless with BigQuery metastore as a metadata store, ask your administrator to grant you the following IAM roles:

  • Create BigQuery metastore tables in Spark:
    • Dataproc Worker (roles/dataproc.worker) on the Dataproc Serverless service account in the project
    • BigQuery Data Editor (roles/bigquery.dataEditor) on the Dataproc Serverless service account in the project
    • Storage Object Admin (roles/storage.objectAdmin) on the Dataproc Serverless service account in the project
  • Query BigQuery metastore tables in BigQuery:

For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

General workflow

To use BigQuery with Dataproc Serverless, you follow these general steps:

  1. Create a file with the commands that you want to run in BigQuery metastore.
  2. Connect to your open source software engine of choice.
  3. Submit a batch job using the method of your choice, such as Spark SQL or PySpark.

Connect BigQuery metastore with Spark

The following instructions show you how to connect Dataproc Serverless to BigQuery metastore:

SparkSQL

To submit a Spark SQL batch job, complete the following steps.

  1. Create a SQL file with the Spark SQL commands that you want to run in BigQuery metastore. For example, this command creates a namespace and a table.

    CREATE NAMESPACE `CATALOG_NAME`.NAMESPACE_NAME;
    CREATE TABLE `CATALOG_NAME`.NAMESPACE_NAME.TABLE_NAME (id int, data string) USING ICEBERG LOCATION 'WAREHOUSE_DIRECTORY';

    Replace the following:

    • CATALOG_NAME: the catalog name that references your Spark table.
    • NAMESPACE_NAME: the namespace name that references your Spark table.
    • TABLE_NAME: a table name for your Spark table.
    • WAREHOUSE_DIRECTORY: the URI of the Cloud Storage folder where your data warehouse is stored.
  2. Submit a Spark SQL batch job, by running the following gcloud dataproc batches submit spark-sql gcloud CLI command:

    gcloud dataproc batches submit spark-sql SQL_SCRIPT_PATH \
    --project=PROJECT_ID \
    --region=REGION \
    --subnet=projects/PROJECT_ID/regions/REGION/subnetworks/SUBNET_NAME \
    --deps-bucket=BUCKET_PATH \
    --properties="spark.sql.catalog.CATALOG_NAME=org.apache.iceberg.spark.SparkCatalog,spark.sql.catalog.CATALOG_NAME.catalog-impl=org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog,spark.sql.catalog.CATALOG_NAME.gcp_project=PROJECT_ID,spark.sql.catalog.CATALOG_NAME.gcp_location=LOCATION,spark.sql.catalog.CATALOG_NAME.warehouse=WAREHOUSE_DIRECTORY"

    Replace the following:

    • SQL_SCRIPT_PATH: the path to the SQL file that the batch job uses.
    • PROJECT_ID: the ID of the Google Cloud project to run the batch job in.
    • REGION: the region where your workload runs.
    • SUBNET_NAME: optional: The name of a VPC subnet in the REGION that has Private Google Access enabled and meets other session subnet requirements.
    • LOCATION: the location to run the batch job in.
    • BUCKET_PATH: the location of the Cloud Storage bucket to upload workload dependencies. The WAREHOUSE_FOLDER is located in this bucket. The gs:// URI prefix of the bucket is not required. You can specify the bucket path or bucket name, for example, mybucketname1.

    For more information on submitting Spark batch jobs, see Run a Spark batch workload.

PySpark

To submit a PySpark batch job, complete the following steps.

  1. Create a python file with the PySpark commands that you want to run in BigQuery metastore.

    For example, the following command sets up a Spark environment to interact with Iceberg tables stored in BigQuery metastore. The command then creates a new namespace and an Iceberg table within that namespace.

    from pyspark.sql import SparkSession
    
    spark = SparkSession.builder
    .appName("BigQuery Metastore Iceberg") \
    .config("spark.sql.catalog.CATALOG_NAME", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.CATALOG_NAME.catalog-impl", "org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog") \
    .config("spark.sql.catalog.CATALOG_NAME.gcp_project", "PROJECT_ID") \
    .config("spark.sql.catalog.CATALOG_NAME.gcp_location", "LOCATION") \
    .config("spark.sql.catalog.CATALOG_NAME.warehouse", "WAREHOUSE_DIRECTORY") \
    .getOrCreate()
    
    spark.sql("USE `CATALOG_NAME`;")
    spark.sql("CREATE NAMESPACE IF NOT EXISTS NAMESPACE_NAME;")
    spark.sql("USE NAMESPACE_NAME;")
    spark.sql("CREATE TABLE TABLE_NAME (id int, data string) USING ICEBERG LOCATION 'WAREHOUSE_DIRECTORY';")

    Replace the following:

    • PROJECT_ID: the ID of the Google Cloud project to run the batch job in.
    • LOCATION: the location where the BigQuery resources are located.
    • CATALOG_NAME: the catalog name that references your Spark table.
    • TABLE_NAME: a table name for your Spark table.
    • WAREHOUSE_DIRECTORY: the URI of the Cloud Storage folder where your data warehouse is stored.
    • NAMESPACE_NAME: the namespace name that references your Spark table.
  2. Submit the batch job using the following gcloud dataproc batches submit pyspark command.

    gcloud dataproc batches submit pyspark PYTHON_SCRIPT_PATH \
     --version=2.2 \
     --project=PROJECT_ID \
     --region=REGION \
     --deps-bucket=BUCKET_PATH \
     --jars=https://storage-download.googleapis.com/maven-central/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.5.2/iceberg-spark-runtime-3.5_2.12-1.5.2.jar,gs://spark-lib/bigquery/iceberg-bigquery-catalog-1.5.2-1.0.0-beta.jar
     --properties="spark.sql.catalog.CATALOG_NAME=org.apache.iceberg.spark.SparkCatalog,spark.sql.catalog.CATALOG_NAME.catalog-impl=org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog,spark.sql.catalog.CATALOG_NAME.gcp_project=PROJECT_ID,spark.sql.catalog.CATALOG_NAME.gcp_location=LOCATION,spark.sql.catalog.CATALOG_NAME.warehouse=WAREHOUSE_DIRECTORY"

    Replace the following:

    • PYTHON_SCRIPT_PATH: the path to the Python script that the batch job uses.
    • PROJECT_ID: the ID of the Google Cloud project to run the batch job in.
    • REGION: the region where your workload runs.
    • BUCKET_PATH: the location of the Cloud Storage bucket to upload workload dependencies. The gs:// URI prefix of the bucket is not required. You can specify the bucket path or bucket name, for example, mybucketname1.

    For more information on submitting PySpark batch jobs, see the PySpark gcloud reference.

What's next