Send feedback
Create an Apache Iceberg table with metadata in BigQuery Metastore
bookmark_border bookmark
Stay organized with collections
Save and categorize content based on your preferences.
Preview
This product or feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section
of the Service Specific Terms .
Pre-GA products and features are available "as is" and might have limited support.
For more information, see the
launch stage descriptions .
This document shows you how to create an Apache Iceberg table with metadata in
BigQuery Metastore using the Dataproc Jobs service,
the Spark SQL CLI or
the Zeppelin web interface
running on a Dataproc cluster.
Before you begin
If you haven't done so, create a Google Cloud project, a
Cloud Storage bucket ,
and a Dataproc cluster.
Set up your project
Sign in to your Google Cloud account. If you're new to
Google Cloud,
create an account to evaluate how our products perform in
real-world scenarios. New customers also get $300 in free credits to
run, test, and deploy workloads.
In the Google Cloud console, on the project selector page,
select or create a Google Cloud project.
Note : If you don't plan to keep the
resources that you create in this procedure, create a project instead of
selecting an existing project. After you finish these steps, you can
delete the project, removing all resources associated with the project.
Go to project selector
Make sure that billing is enabled for your Google Cloud project .
Enable the Dataproc, BigQuery, and Cloud Storage APIs.
Enable the APIs
Install the Google Cloud CLI.
To initialize the gcloud CLI, run the following command:
gcloud init
Create a Cloud Storage bucket
in your project.
In the Google Cloud console, go to the Cloud Storage
Buckets page.
Go to Buckets page
Click Create bucket .
On the Create a bucket page, enter your bucket information. To go to the next
step, click Continue .
For Name your bucket , enter a name that meets the bucket naming requirements .
For Choose where to store your data , do the following:
Select a Location type option.
Select a Location option.
For Choose a default storage class for your data , select a storage class .
For Choose how to control access to objects , select an Access control option.
For Advanced settings (optional) , specify
an encryption method ,
a retention policy ,
or bucket labels .
Click Create .
Create a Dataproc cluster .
To save resources and costs, you can
create a single-node Dataproc cluster to run the examples presented
in this document.
The subnet in the region where the cluster is created must have
Private Google Access (PGA) enabled.
By default, Dataproc cluster VMs, created with a
2.2 (default) or later image version, have
internal IP addresses only . To allow cluster VMs to communicate with Google APIs,
enable Private Google Access on the
default
(or user-specified network name, if applicable) network subnet
in the region where the cluster is created. .
If you want to run the Zeppelin web interface example in this guide,
you must use or create a Dataproc cluster with
the Zeppelin optional component enabled .
Grant roles to a custom service account (if needed): By default,
Dataproc cluster VMs use the
Compute Engine default service account
to interact with Dataproc. If you want to specify a
custom service account when you
create your cluster, it must have the
Dataproc Worker role (roles/dataproc.worker
)
role or a custom role with needed Worker role permissions.
OSS database to BigQuery dataset mapping
Note the following mapping between open source database and BigQuery
dataset terms:
OSS database
BigQuery dataset
Namespace, Database
Dataset
Partitioned or Unpartitioned Table
Table
View
View
Create an Iceberg table
This section shows you how to create an Iceberg table with metadata in
BigQuery Metastore by submitting a Spark SQL code to the
Dataproc service ,
the Spark SQL CLI ,
and the Zeppelin component web interface,
which run on a Dataproc cluster.
You can submit a job to the Dataproc service by
submitting the job to a Dataproc cluster
using the Google Cloud console or the
Google Cloud CLI ,
or via a HTTP REST request
or programmatic gRPC Dataproc
Cloud Client Libraries
call to the Dataproc Jobs API.
The examples in this section show you how to submit a Dataproc
Spark SQL job to the Dataproc
service to create an Iceberg table with metadata in BigQuery
using the gcloud CLI, Google Cloud console, or
Dataproc REST API.
Prepare job files
Perform the following steps to create a Spark SQL job file. The file contains
Spark SQL commands to create and update an Iceberg table.
In a local terminal window or in Cloud Shell ,
use a text editor, such as the vi
or nano
, to copy the
following commands into an iceberg-table.sql
file, then save the
file in the current directory.
USE CATALOG_NAME ;
CREATE NAMESPACE IF NOT EXISTS example_namespace;
USE example_namespace;
DROP TABLE IF EXISTS example_table;
CREATE TABLE example_table (id int, data string) USING ICEBERG LOCATION 'gs://BUCKET /WAREHOUSE_FOLDER ';
INSERT INTO example_table VALUES (1, 'first row');
ALTER TABLE example_table ADD COLUMNS (newDoubleCol double);
DESCRIBE TABLE example_table;
Replace the following:
CATALOG_NAME : Iceberg catalog name.
BUCKET and WAREHOUSE_FOLDER : Cloud Storage bucket
and folder used for the Iceberg warehouse.
Use the gsutil
tool to copy the local
iceberg-table.sql
to your bucket in Cloud Storage.
gsutil cp iceberg-table.sql gs://BUCKET /
Next, download and copy the iceberg-spark-runtime-3.5_2.12-1.5.2
JAR file to
Cloud Storage.
In a local terminal window or in Cloud Shell ,
run the following curl
command to download the iceberg-spark-runtime-3.5_2.12-1.5.2
JAR file to the current directory.
curl -o iceberg-spark-runtime-3.5_2.12-1.5.2.jar https://storage-download.googleapis.com/maven-central/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.5.2/iceberg-spark-runtime-3.5_2.12-1.5.2.jar
Use the gsutil
tool to copy the local
iceberg-spark-runtime-3.5_2.12-1.5.2
JAR file from the current directory
to your bucket in Cloud Storage.
gsutil cp iceberg-spark-runtime-3.5_2.12-1.5.2.jar gs://BUCKET /
Submit the Spark SQL job
Select a tab to follow the instructions to submit the Spark SQL job to the
Dataproc service using the gcloud CLI,
Google Cloud console, or Dataproc
REST API.
Run the following gcloud dataproc jobs submit spark-sql
command locally in a local terminal window or in Cloud Shell
to submit the Spark SQL job to create the Iceberg table.
gcloud dataproc jobs submit spark-sql \
--project=PROJECT_ID \
--cluster=CLUSTER_NAME \
--region=REGION \
--jars="gs://BUCKET /1.5.2/iceberg-spark-runtime-3.5_2.12-1.5.2.jar,gs://spark-lib/bigquery/iceberg-bigquery-catalog-1.5.2-1.0.1-beta.jar" \
--properties="spark.sql.catalog.CATALOG_NAME =org.apache.iceberg.spark.SparkCatalog,spark.sql.catalog.CATALOG_NAME .catalog-impl=org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog,spark.sql.catalog.CATALOG_NAME .gcp_project=PROJECT_ID ,spark.sql.catalog.CATALOG_NAME .gcp_location=LOCATION ,spark.sql.catalog.CATALOG_NAME .warehouse=gs://BUCKET /WAREHOUSE_FOLDER " \
-f="gs://BUCKET iceberg-table.sql"
Notes:
PROJECT_ID : Your Google Cloud project ID.
Project IDs are listed in the Project info section on
the Google Cloud console Dashboard .
CLUSTER_NAME : The name of your Dataproc cluster.
REGION : The Compute Engine
region where your cluster is located.
CATALOG_NAME : Iceberg catalog name.
BUCKET and WAREHOUSE_FOLDER : Cloud Storage bucket
and folder used for the Iceberg warehouse.
LOCATION : A supported BigQuery location .
The default location is "US".
--jars
: The listed jars are necessary to create table metadata in
BigQuery Metastore.
--properties
: Catalog properties .
-f
: The iceberg-table.sql
job file you copied to your bucket in Cloud Storage.
View the table description in the terminal output when the job finishes.
Time taken: 2.194 seconds
id int
data string
newDoubleCol double
Time taken: 1.479 seconds, Fetched 3 row(s)
Job JOB_ID finished successfully.
To view table metadata in BigQuery
In the Google Cloud console, go to the BigQuery page.
Go to BigQuery Studio
View Iceberg table metadata.
Perform the following steps to use the Google Cloud console to submit
the Spark SQL job to the Dataproc service to create an
Iceberg table with metadata in BigQuery Metastore.
In the Google Cloud console, go to the Dataproc
Submit a job .
Go to Submit a job
page, then complete the following fields:
Job ID : Accept the suggested ID or insert your own ID.
Region : Select the region where your cluster is located.
Cluster : Select your cluster.
Job type : Select SparkSql
.
Query source type : Select Query file
.
Query file : Insert gs://BUCKET /iceberg-table.sql
Jar files : Insert the following:
gs://BUCKET /iceberg-spark-runtime-3.5_2.12-1.5.2.jar,gs://spark-lib/bigquery/iceberg-bigquery-catalog-1.5.2-1.0.1-beta.jar
Properties : Click add Add Property
five times to create a list of five key
value
input fields, then
copy the following Key and Value pairs to define
five properties.
#
Key
Value
1.
spark.sql.catalog.CATALOG_NAME
org.apache.iceberg.spark.SparkCatalog
2.
spark.sql.catalog.CATALOG_NAME .catalog-impl
org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog
3.
spark.sql.catalog.CATALOG_NAME .gcp_project
PROJECT_ID
4.
spark.sql.catalog.CATALOG_NAME .gcp_location
LOCATION
5.
spark.sql.catalog.CATALOG_NAME .warehouse
gs://BUCKET /WAREHOUSE_FOLDER
Notes:
CATALOG_NAME : Iceberg catalog name.
PROJECT_ID : Your Google Cloud project ID.
Project IDs are listed in the Project info section on
the Google Cloud console Dashboard .
region where your cluster is located.
LOCATION : A supported BigQuery location .
The default location is "US".
BUCKET and WAREHOUSE_FOLDER : Cloud Storage bucket
and folder used for the Iceberg warehouse.
Click Submit
To monitor job progress and view job output, go to the Dataproc
Jobs page in the Google Cloud console,
then click the Job ID
to open the Job details page.
To view table metadata in BigQuery
In the Google Cloud console, go to the BigQuery page.
Go to BigQuery Studio
View Iceberg table metadata.
You can use the Dataproc
jobs.submit API
to submit the Spark SQL job to the Dataproc service to create an
Iceberg table with metadata in BigQuery Metastore.
Before using any of the request data,
make the following replacements:
PROJECT_ID : Your Google Cloud project ID.
Project IDs are listed in the Project info section on
the Google Cloud console Dashboard .
CLUSTER_NAME : The name of your Dataproc cluster.
REGION : The Compute Engine
region where your cluster is located.
CATALOG_NAME : Iceberg catalog name.
BUCKET and WAREHOUSE_FOLDER : Cloud Storage bucket
and folder used for the Iceberg warehouse.LOCATION : A supported BigQuery location .
The default location is "US".
jarFileUris
: The listed jars are necessary to create table metadata in
BigQuery Metastore.
properties
: Catalog properties .
queryFileUri
: The iceberg-table.sql
job file you copied to your bucket in Cloud Storage.
HTTP method and URL:
POST https://dataproc.googleapis.com/v1/projects/PROJECT_ID /regions/REGION /jobs:submit
Request JSON body:
{
"projectId": "PROJECT_ID ",
"job": {
"placement": {
"clusterName": "CLUSTER_NAME "
},
"statusHistory": [],
"reference": {
"jobId": "",
"projectId": "PROJECT_ID "
},
"sparkSqlJob": {
"properties": {
"spark.sql.catalog."CATALOG_NAME ": "org.apache.iceberg.spark.SparkCatalog",
"spark.sql.catalog."CATALOG_NAME ".catalog-impl": "org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog",
"spark.sql.catalog."CATALOG_NAME ".gcp_project": "PROJECT_ID ",
"spark.sql.catalog."CATALOG_NAME ".gcp_location": "LOCATION ",
"spark.sql.catalog."CATALOG_NAME ".warehouse": "gs://BUCKET /WAREHOUSE_FOLDER "
},
"jarFileUris": [
"gs://BUCKET /iceberg-spark-runtime-3.5_2.12-1.5.2.jar,gs://spark-lib/bigquery/iceberg-bigquery-catalog-1.5.2-1.0.1-beta.jar"
],
"scriptVariables": {},
"queryFileUri": "gs://BUCKET /iceberg-table.sql"
}
}
}
To send your request, expand one of these options:
curl (Linux, macOS, or Cloud Shell)
Note:
The following command assumes that you have logged in to
the gcloud
CLI with your user account by running
gcloud init
or
gcloud auth login
, or by using Cloud Shell ,
which automatically logs you into the gcloud
CLI
.
You can check the currently active account by running
gcloud auth list
.
Save the request body in a file named request.json
,
and execute the following command:
curl -X POST \ -H "Authorization: Bearer $(gcloud auth print-access-token)" \ -H "Content-Type: application/json; charset=utf-8" \ -d @request.json \ "https://dataproc.googleapis.com/v1/projects/PROJECT_ID /regions/REGION /jobs:submit"
PowerShell (Windows)
Save the request body in a file named request.json
,
and execute the following command:
$cred = gcloud auth print-access-token $headers = @{ "Authorization" = "Bearer $cred" } Invoke-WebRequest ` -Method POST ` -Headers $headers ` -ContentType: "application/json; charset=utf-8" ` -InFile request.json ` -Uri "https://dataproc.googleapis.com/v1/projects/PROJECT_ID /regions/REGION /jobs:submit" | Select-Object -Expand Content
You should receive a JSON response similar to the following:
{
"reference": {
"projectId": "PROJECT_ID ",
"jobId": "..."
},
"placement": {
"clusterName": "CLUSTER_NAME ",
"clusterUuid": "..."
},
"status": {
"state": "PENDING",
"stateStartTime": "..."
},
"submittedBy": "USER ",
"sparkSqlJob": {
"queryFileUri": "gs://BUCKET /iceberg-table.sql",
"properties": {
"spark.sql.catalog.USER _catalog": "org.apache.iceberg.spark.SparkCatalog",
"spark.sql.catalog.USER _catalog.catalog-impl": "org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog",
"spark.sql.catalog.USER _catalog.gcp_project": "PROJECT_ID ",
"spark.sql.catalog.USER _catalog.gcp_location": "LOCATION ",
"spark.sql.catalog.USER _catalog.warehouse": "gs://BUCKET /WAREHOUSE_FOLDER "
},
"jarFileUris": [
"gs://BUCKET /iceberg-spark-runtime-3.5_2.12-1.5.2.jar",
"gs://spark-lib/bigquery/iceberg-bigquery-catalog-1.5.2-1.0.1-beta.jar"
]
},
"driverControlFilesUri": "gs://dataproc-...",
"driverOutputResourceUri": "gs://dataproc-.../driveroutput",
"jobUuid": "...",
"region": "REGION "
}
To monitor job progress and view job output, go to the Dataproc
Jobs page in the Google Cloud console,
then click the Job ID
to open the Job details page.
To view table metadata in BigQuery
In the Google Cloud console, go to the BigQuery page.
Go to BigQuery Studio
View Iceberg table metadata.
The following steps show you how to create an Iceberg table with table metadata
stored in BigQuery Metastore using the Spark SQL CLI running on the
master node of a Dataproc cluster.
Use SSH to connect to the master node
of your Dataproc cluster.
In the SSH session terminal, use the vi
or nano
text editor to copy the
following commands into an iceberg-table.sql
file.
SET CATALOG_NAME = `CATALOG_NAME `;
SET BUCKET = `BUCKET `;
SET WAREHOUSE_FOLDER = `WAREHOUSE_FOLDER `;
USE `${CATALOG_NAME}`;
CREATE NAMESPACE IF NOT EXISTS `${CATALOG_NAME}`.example_namespace;
DROP TABLE IF EXISTS `${CATALOG_NAME}`.example_namespace.example_table;
CREATE TABLE `${CATALOG_NAME}`.example_namespace.example_table (id int, data string) USING ICEBERG LOCATION 'gs://${BUCKET}/${WAREHOUSE_FOLDER}';
INSERT INTO `${CATALOG_NAME}`.example_namespace.example_table VALUES (1, 'first row');
ALTER TABLE `${CATALOG_NAME}`.example_namespace.example_table ADD COLUMNS (newDoubleCol double);
DESCRIBE TABLE `${CATALOG_NAME}`.example_namespace.example_table;
Replace the following:
CATALOG_NAME : Iceberg catalog name.
BUCKET and WAREHOUSE_FOLDER : Cloud Storage bucket
and folder used for the Iceberg warehouse.
In the SSH session terminal, run the following spark-sql
command to create
the iceberg table.
spark-sql \
--packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2 \
--jars https://storage-download.googleapis.com/maven-central/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.5.2/iceberg-spark-runtime-3.5_2.12-1.5.2.jar,gs://spark-lib/bigquery/iceberg-bigquery-catalog-1.5.2-1.0.1-beta.jar \
--conf spark.sql.catalog.CATALOG_NAME =org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.CATALOG_NAME .catalog-impl=org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog \
--conf spark.sql.catalog.CATALOG_NAME .gcp_project=PROJECT_ID \
--conf spark.sql.catalog.CATALOG_NAME .gcp_location=LOCATION \
--conf spark.sql.catalog.CATALOG_NAME .warehouse=gs://BUCKET /WAREHOUSE_FOLDER \
-f iceberg-table.sql
Replace the following:
PROJECT_ID : Your Google Cloud project ID.
Project IDs are listed in the Project info section on
the Google Cloud console Dashboard .
LOCATION : A supported BigQuery location .
The default location is "US".
View table metadata in BigQuery
In the Google Cloud console, go to the BigQuery page.
Go to BigQuery Studio
View Iceberg table metadata.
The following steps show you how to to create an Iceberg table with table
metadata stored in BigQuery Metastore using the Zeppelin web
interface running on the master node of a Dataproc cluster .
In the Google Cloud console, go to the Dataproc Clusters
page.
Go to the Dataproc Clusters page
Select your cluster name to open the Cluster details page.
Click the Web Interfaces tab display a list of Component Gateway
links to the web interfaces of default and optional components
installed on the cluster.
Click the Zeppelin link to open the Zeppelin web interface.
In the Zeppelin web interface, click the anonymous menu, then click
Interpreter to open the Interpreters page.
Add two jars to the Zeppelin Spark interpreter, as follows:
Type "Spark" in Search interpreters
box to scroll to the Spark
interpreter section.
Click edit .
Paste the following in the spark.jars field:
https://storage-download.googleapis.com/maven-central/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.5.2/iceberg-spark-runtime-3.5_2.12-1.5.2.jar,gs://spark-lib/bigquery/iceberg-bigquery-catalog-1.5.2-1.0.1-beta.jar
Click Save at the bottom of the Spark interpreter section, then
click OK to update the interpreter and restart the Spark interpreter
with the new settings.
From the Zeppelin notebook menu, click Create new note .
In the Create new note dialog, input a name for the notebook, and
accept the default spark interpreter. Click Create to open The
notebook.
Copy the following PySpark code into your Zeppelin notebook
after filling in the variables.
%pyspark
from pyspark.sql import SparkSession
project_id = "PROJECT_ID "
catalog = "CATALOG_NAME "
namespace = "NAMESPACE "
location = "LOCATION "
warehouse_dir = "gs://BUCKET /WAREHOUSE_DIRECTORY "
spark = SparkSession.builder \
.appName( "BigQuery Metastore Iceberg" ) \
.config( f"spark.sql.catalog.{catalog}" , "org.apache.iceberg.spark.SparkCatalog" ) \
.config( f"spark.sql.catalog.{catalog}.catalog-impl" , "org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog" ) \
.config( f"spark.sql.catalog.{catalog}.gcp_project" , f"{project_id}" ) \
.config( f"spark.sql.catalog.{catalog}.gcp_location" , f"{location}" ) \
.config( f"spark.sql.catalog.{catalog}.warehouse" , f"{warehouse_dir}" ) \
.getOrCreate()
spark.sql( f"USE `{catalog}`;" )
spark.sql( f"CREATE NAMESPACE IF NOT EXISTS `{namespace}`;" )
spark.sql( f"USE `{namespace}`;" )
\# Create table and display schema ( without LOCATION)
spark.sql( "DROP TABLE IF EXISTS example_iceberg_table" )
spark.sql( "CREATE TABLE example_iceberg_table (id int, data string) USING ICEBERG" )
spark.sql( "DESCRIBE example_iceberg_table;" )
\# Insert table data.
spark.sql( "INSERT INTO example_iceberg_table VALUES (1, 'first row');" )
\# Alter table, then display schema.
spark.sql( "ALTER TABLE example_iceberg_table ADD COLUMNS (newDoubleCol double);" )
\# Select and display the contents of the table.
spark.sql( "SELECT * FROM example_iceberg_table" ) .show()
Replace the following:
PROJECT_ID : Your Google Cloud project ID.
Project IDs are listed in the Project info section on
the Google Cloud console Dashboard .
CATALOG_NAME and NAMESPACE : The Iceberg catalog name
and namespace combine to identify the Iceberg table (catalog.namespace.table_name
).
LOCATION : A supported BigQuery location .
The default location is "US".
BUCKET and WAREHOUSE_DIRECTORY : Cloud Storage bucket
and folder used as Iceberg warehouse directory.
Click the run icon or press Shift-Enter
to run the code. When the job
completes, the status message shows "Spark Job Finished", and the output
displays the table contents:
View table metadata in BigQuery
In the Google Cloud console, go to the BigQuery page.
Go to BigQuery Studio
View Iceberg table metadata.
Send feedback
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License , and code samples are licensed under the Apache 2.0 License . For details, see the Google Developers Site Policies . Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-02-26 UTC.
Need to tell us more?
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-02-26 UTC."],[],[]]