Use BigQuery metastore with Dataproc Serverless
This document explains how to use BigQuery metastore with Dataproc Serverless.
Before you begin
- Enable billing for your Google Cloud project. Learn how to check if billing is enabled on a project.
Enable the BigQuery, and Dataproc APIs.
Optional: Understand how BigQuery metastore works and why you should use it.
Required roles
To get the permissions that you need to use Spark and Dataproc Serverless with BigQuery metastore as a metadata store, ask your administrator to grant you the following IAM roles:
-
Create BigQuery metastore tables in Spark:
-
Dataproc Worker (
roles/dataproc.worker
) on the Dataproc Serverless service account in the project -
BigQuery Data Editor (
roles/bigquery.dataEditor
) on the Dataproc Serverless service account in the project -
Storage Object Admin (
roles/storage.objectAdmin
) on the Dataproc Serverless service account in the project
-
Dataproc Worker (
-
Query BigQuery metastore tables in BigQuery:
-
BigQuery Data Viewer (
roles/bigquery.dataViewer
) on the project -
BigQuery User (
roles/bigquery.user
) on the project -
Storage Object Viewer (
roles/storage.objectViewer
) on the project
-
BigQuery Data Viewer (
For more information about granting roles, see Manage access to projects, folders, and organizations.
You might also be able to get the required permissions through custom roles or other predefined roles.
General workflow
To use BigQuery with Dataproc Serverless, you follow these general steps:
- Create a file with the commands that you want to run in BigQuery metastore.
- Connect to your open source software engine of choice.
- Submit a batch job using the method of your choice, such as Spark SQL or PySpark.
Connect BigQuery metastore with Spark
The following instructions show you how to connect Dataproc Serverless to BigQuery metastore:
SparkSQL
To submit a Spark SQL batch job, complete the following steps.
Create a SQL file with the Spark SQL commands that you want to run in BigQuery metastore. For example, this command creates a namespace and a table.
CREATE NAMESPACE `CATALOG_NAME`.NAMESPACE_NAME; CREATE TABLE `CATALOG_NAME`.NAMESPACE_NAME.TABLE_NAME (id int, data string) USING ICEBERG LOCATION 'WAREHOUSE_DIRECTORY';
Replace the following:
CATALOG_NAME
: the catalog name that references your Spark table.NAMESPACE_NAME
: the namespace name that references your Spark table.TABLE_NAME
: a table name for your Spark table.WAREHOUSE_DIRECTORY
: the URI of the Cloud Storage folder where your data warehouse is stored.
Submit a Spark SQL batch job, by running the following
gcloud dataproc batches submit spark-sql
gcloud CLI command:gcloud dataproc batches submit spark-sql SQL_SCRIPT_PATH \ --project=PROJECT_ID \ --region=REGION \ --subnet=projects/PROJECT_ID/regions/REGION/subnetworks/SUBNET_NAME \ --deps-bucket=BUCKET_PATH \ --properties="spark.sql.catalog.CATALOG_NAME=org.apache.iceberg.spark.SparkCatalog,spark.sql.catalog.CATALOG_NAME.catalog-impl=org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog,spark.sql.catalog.CATALOG_NAME.gcp_project=PROJECT_ID,spark.sql.catalog.CATALOG_NAME.gcp_location=LOCATION,spark.sql.catalog.CATALOG_NAME.warehouse=WAREHOUSE_DIRECTORY"
Replace the following:
SQL_SCRIPT_PATH
: the path to the SQL file that the batch job uses.PROJECT_ID
: the ID of the Google Cloud project to run the batch job in.REGION
: the region where your workload runs.SUBNET_NAME
: optional: The name of a VPC subnet in theREGION
that has Private Google Access enabled and meets other session subnet requirements.LOCATION
: the location to run the batch job in.BUCKET_PATH
: the location of the Cloud Storage bucket to upload workload dependencies. TheWAREHOUSE_FOLDER
is located in this bucket. Thegs://
URI prefix of the bucket is not required. You can specify the bucket path or bucket name, for example,mybucketname1
.
For more information on submitting Spark batch jobs, see Run a Spark batch workload.
PySpark
To submit a PySpark batch job, complete the following steps.
Create a python file with the PySpark commands that you want to run in BigQuery metastore.
For example, the following command sets up a Spark environment to interact with Iceberg tables stored in BigQuery metastore. The command then creates a new namespace and an Iceberg table within that namespace.
from pyspark.sql import SparkSession spark = SparkSession.builder .appName("BigQuery Metastore Iceberg") \ .config("spark.sql.catalog.CATALOG_NAME", "org.apache.iceberg.spark.SparkCatalog") \ .config("spark.sql.catalog.CATALOG_NAME.catalog-impl", "org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog") \ .config("spark.sql.catalog.CATALOG_NAME.gcp_project", "PROJECT_ID") \ .config("spark.sql.catalog.CATALOG_NAME.gcp_location", "LOCATION") \ .config("spark.sql.catalog.CATALOG_NAME.warehouse", "WAREHOUSE_DIRECTORY") \ .getOrCreate() spark.sql("USE `CATALOG_NAME`;") spark.sql("CREATE NAMESPACE IF NOT EXISTS NAMESPACE_NAME;") spark.sql("USE NAMESPACE_NAME;") spark.sql("CREATE TABLE TABLE_NAME (id int, data string) USING ICEBERG LOCATION 'WAREHOUSE_DIRECTORY';")
Replace the following:
PROJECT_ID
: the ID of the Google Cloud project to run the batch job in.LOCATION
: the location where the BigQuery resources are located.CATALOG_NAME
: the catalog name that references your Spark table.TABLE_NAME
: a table name for your Spark table.WAREHOUSE_DIRECTORY
: the URI of the Cloud Storage folder where your data warehouse is stored.NAMESPACE_NAME
: the namespace name that references your Spark table.
Submit the batch job using the following
gcloud dataproc batches submit pyspark
command.gcloud dataproc batches submit pyspark PYTHON_SCRIPT_PATH \ --version=2.2 \ --project=PROJECT_ID \ --region=REGION \ --deps-bucket=BUCKET_PATH \ --jars=https://storage-download.googleapis.com/maven-central/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.5.2/iceberg-spark-runtime-3.5_2.12-1.5.2.jar,gs://spark-lib/bigquery/iceberg-bigquery-catalog-1.5.2-1.0.0-beta.jar --properties="spark.sql.catalog.CATALOG_NAME=org.apache.iceberg.spark.SparkCatalog,spark.sql.catalog.CATALOG_NAME.catalog-impl=org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog,spark.sql.catalog.CATALOG_NAME.gcp_project=PROJECT_ID,spark.sql.catalog.CATALOG_NAME.gcp_location=LOCATION,spark.sql.catalog.CATALOG_NAME.warehouse=WAREHOUSE_DIRECTORY"
Replace the following:
PYTHON_SCRIPT_PATH
: the path to the Python script that the batch job uses.PROJECT_ID
: the ID of the Google Cloud project to run the batch job in.REGION
: the region where your workload runs.BUCKET_PATH
: the location of the Cloud Storage bucket to upload workload dependencies. Thegs://
URI prefix of the bucket is not required. You can specify the bucket path or bucket name, for example,mybucketname1
.
For more information on submitting PySpark batch jobs, see the PySpark gcloud reference.
What's next
- Set up optional BigQuery metastore features.
- View and query tables from Spark in the BigQuery console.