Pour obtenir les autorisations nécessaires pour utiliser Spark et Dataproc avec le metastore BigLake comme magasin de métadonnées, demandez à votre administrateur de vous accorder les rôles IAM suivants :
Créez des tables BigLake Metastore dans Spark :
Nœud de calcul Dataproc (roles/dataproc.worker) sur le compte de service Dataproc sans serveur dans le projet
Éditeur de données BigQuery (roles/bigquery.dataEditor)
sur le compte de service Dataproc sans serveur dans le projet
READONLY_ICEBERG_TABLE_NAME : nom de votre table en lecture seule.
BUCKET_PATH : chemin d'accès au bucket Cloud Storage contenant les données de la table externe, au format ['gs://bucket_name/[folder_name/]file_name'].
À partir de PySpark, interrogez la table standard, la table BigLake pour Apache Iceberg dans BigQuery et la table externe Apache Iceberg.
frompyspark.sqlimportSparkSession# Create a spark sessionspark=SparkSession.builder\.appName("BigLake Metastore Iceberg")\.config("spark.sql.catalog.CATALOG_NAME","org.apache.iceberg.spark.SparkCatalog")\.config("spark.sql.catalog.CATALOG_NAME.catalog-impl","org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog")\.config("spark.sql.catalog.CATALOG_NAME.gcp_project","PROJECT_ID")\.config("spark.sql.catalog.CATALOG_NAME.gcp_location","LOCATION")\.config("spark.sql.catalog.CATALOG_NAME.warehouse","WAREHOUSE_DIRECTORY")\.getOrCreate()spark.conf.set("viewsEnabled","true")# Use the blms_catalogspark.sql("USE `CATALOG_NAME`;")spark.sql("USE NAMESPACE DATASET_NAME;")# Configure spark for temp resultsspark.sql("CREATE namespace if not exists MATERIALIZATION_NAMESPACE");spark.conf.set("materializationDataset","MATERIALIZATION_NAMESPACE")# List the tables in the datasetdf=spark.sql("SHOW TABLES;")df.show();# Query the tablessql="""SELECT * FROM DATASET_NAME.TABLE_NAME"""df=spark.read.format("bigquery").load(sql)df.show()sql="""SELECT * FROM DATASET_NAME.ICEBERG_TABLE_NAME"""df=spark.read.format("bigquery").load(sql)df.show()sql="""SELECT * FROM DATASET_NAME.READONLY_ICEBERG_TABLE_NAME"""df=spark.read.format("bigquery").load(sql)df.show()
Remplacez les éléments suivants :
WAREHOUSE_DIRECTORY : URI du dossier Cloud Storage associé à votre table BigLake Iceberg dans BigQuery et à votre table externe Iceberg.
CATALOG_NAME : nom du catalogue que vous utilisez.
MATERIALIZATION_NAMESPACE : espace de noms pour stocker les résultats temporaires.
Exécutez le script PySpark à l'aide de Spark sans serveur.
SCRIPT_PATH : chemin d'accès au script utilisé par le job par lot.
PROJECT_ID : ID du Google Cloud projet dans lequel exécuter le job par lot.
REGION : région dans laquelle votre charge de travail est exécutée.
YOUR_BUCKET : emplacement du bucket Cloud Storage pour importer les dépendances de charge de travail.
Le préfixe d'URI gs:// du bucket n'est pas obligatoire. Vous pouvez spécifier le chemin d'accès ou le nom du bucket, par exemple mybucketname1.
Sauf indication contraire, le contenu de cette page est régi par une licence Creative Commons Attribution 4.0, et les échantillons de code sont régis par une licence Apache 2.0. Pour en savoir plus, consultez les Règles du site Google Developers. Java est une marque déposée d'Oracle et/ou de ses sociétés affiliées.
Dernière mise à jour le 2025/09/04 (UTC).
[[["Facile à comprendre","easyToUnderstand","thumb-up"],["J'ai pu résoudre mon problème","solvedMyProblem","thumb-up"],["Autre","otherUp","thumb-up"]],[["Difficile à comprendre","hardToUnderstand","thumb-down"],["Informations ou exemple de code incorrects","incorrectInformationOrSampleCode","thumb-down"],["Il n'y a pas l'information/les exemples dont j'ai besoin","missingTheInformationSamplesINeed","thumb-down"],["Problème de traduction","translationIssue","thumb-down"],["Autre","otherDown","thumb-down"]],["Dernière mise à jour le 2025/09/04 (UTC)."],[],[],null,["# Use BigLake metastore with tables in BigQuery\n=============================================\n\nThis document explains how to use BigLake metastore with BigQuery\ntables and Spark.\n\nWith BigLake metastore, you can create and use\n[standard (built-in) tables](/bigquery/docs/tables),\n[BigLake tables for Apache Iceberg in BigQuery](/bigquery/docs/iceberg-tables),\nand [Apache Iceberg external tables](/bigquery/docs/iceberg-external-tables)\nfrom BigQuery.\n\nBefore you begin\n----------------\n\n1. Enable billing for your Google Cloud project. Learn how to [check if billing is enabled on a project](/billing/docs/how-to/verify-billing-enabled).\n2. Enable the BigQuery, and Dataproc APIs.\n\n [Enable the APIs](https://console.cloud.google.com/flows/enableapi?apiid=bigquery.googleapis.com,dataproc.googleapis.com)\n3. Optional: Understand how [BigLake metastore works](/bigquery/docs/about-blms) and why you should\n use it.\n\n### Required roles\n\n\nTo get the permissions that\nyou need to use Spark and Dataproc with BigLake metastore as a metadata store,\n\nask your administrator to grant you the\nfollowing IAM roles:\n\n- Create BigLake metastore tables in Spark:\n - [Dataproc Worker](/iam/docs/roles-permissions/dataproc#dataproc.worker) (`roles/dataproc.worker`) on the Dataproc Serverless service account in the project\n - [BigQuery Data Editor](/iam/docs/roles-permissions/bigquery#bigquery.dataEditor) (`roles/bigquery.dataEditor`) on the Dataproc Serverless service account in the project\n - [Storage Object Admin](/iam/docs/roles-permissions/storage#storage.objectAdmin) (`roles/storage.objectAdmin`) on the Dataproc Serverless service account in the project\n- Query BigLake metastore tables in BigQuery:\n - [BigQuery Data Viewer](/iam/docs/roles-permissions/bigquery#bigquery.dataViewer) (`roles/bigquery.dataViewer`) on the project\n - [BigQuery User](/iam/docs/roles-permissions/bigquery#bigquery.user) (`roles/bigquery.user`) on the project\n - [Storage Object Viewer](/iam/docs/roles-permissions/storage#storage.objectViewer) (`roles/storage.objectViewer`) on the project\n\n\nFor more information about granting roles, see [Manage access to projects, folders, and organizations](/iam/docs/granting-changing-revoking-access).\n\n\nYou might also be able to get\nthe required permissions through [custom\nroles](/iam/docs/creating-custom-roles) or other [predefined\nroles](/iam/docs/roles-overview#predefined).\n\nConnect to a table\n------------------\n\n1. Create a [dataset](/bigquery/docs/datasets) in the Google Cloud console.\n\n ```googlesql\n CREATE SCHEMA `\u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e`.DATASET_NAME;\n ```\n\n Replace the following:\n - \u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e: the ID of the Google Cloud project to create the dataset.\n - \u003cvar translate=\"no\"\u003eDATASET_NAME\u003c/var\u003e: a name for your dataset.\n2. Create a [Cloud Resource Connection](/bigquery/docs/create-cloud-resource-connection).\n\n3. Create a standard BigQuery table.\n\n ```googlesql\n CREATE TABLE `\u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e`.DATASET_NAME.TABLE_NAME (name STRING,id INT64);\n ```\n\n Replace the following:\n - \u003cvar translate=\"no\"\u003eTABLE_NAME\u003c/var\u003e: a name for your table.\n4. Insert data into the standard BigQuery table.\n\n ```googlesql\n INSERT INTO `\u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e`.DATASET_NAME.TABLE_NAME VALUES ('test_name1', 123),('test_name2', 456),('test_name3', 789);\n ```\n5. Create a\n [BigLake table for Apache Iceberg in BigQuery](/bigquery/docs/iceberg-tables#iceberg-table-workflows).\n\n For example, to create a table, run the following `CREATE` statement. \n\n ```googlesql\n CREATE TABLE `\u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e`.DATASET_NAME.ICEBERG_TABLE_NAME(\n name STRING,id INT64\n )\n WITH CONNECTION `\u003cvar translate=\"no\"\u003eCONNECTION_NAME\u003c/var\u003e`\n OPTIONS (\n file_format = 'PARQUET',\n table_format = 'ICEBERG',\n storage_uri = '\u003cvar translate=\"no\"\u003eSTORAGE_URI\u003c/var\u003e');\n ```\n\n Replace the following:\n - \u003cvar translate=\"no\"\u003eICEBERG_TABLE_NAME\u003c/var\u003e: a name for your BigLake table for Apache Iceberg in BigQuery. For example, `iceberg_managed_table`.\n - \u003cvar translate=\"no\"\u003eCONNECTION_NAME\u003c/var\u003e: the name of your connection. You created this in the previous step. For example, `myproject.us.myconnection`.\n - \u003cvar translate=\"no\"\u003eSTORAGE_URI\u003c/var\u003e: a fully qualified Cloud Storage URI. For example, `gs://mybucket/table`.\n6. Insert data into the BigLake table for Apache Iceberg in BigQuery.\n\n ```googlesql\n INSERT INTO `\u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e`.DATASET_NAME.ICEBERG_TABLE_NAME VALUES ('test_name1', 123),('test_name2', 456),('test_name3', 789);\n ```\n7. Create an [Apache Iceberg external table](/bigquery/docs/iceberg-external-tables).\n\n For example, to create an Iceberg external table,\n run the following `CREATE` statement. \n\n ```googlesql\n CREATE OR REPLACE EXTERNAL TABLE `\u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e`.DATASET_NAME.READONLY_ICEBERG_TABLE_NAME\n WITH CONNECTION `\u003cvar translate=\"no\"\u003eCONNECTION_NAME\u003c/var\u003e`\n OPTIONS (\n format = 'ICEBERG',\n uris =\n ['\u003cvar translate=\"no\"\u003eBUCKET_PATH\u003c/var\u003e'],\n require_partition_filter = FALSE);\n ```\n\n Replace the following:\n - \u003cvar translate=\"no\"\u003eREADONLY_ICEBERG_TABLE_NAME\u003c/var\u003e: a name for your read-only table.\n - \u003cvar translate=\"no\"\u003eBUCKET_PATH\u003c/var\u003e: the path to the Cloud Storage bucket that contains the data for the external table, in the format `['gs://bucket_name/[folder_name/]file_name']`.\n8. From PySpark, query the standard table, BigLake table for Apache Iceberg in BigQuery, and\n Apache Iceberg external table.\n\n ```googlesql\n from pyspark.sql import SparkSession\n\n # Create a spark session\n spark = SparkSession.builder \\\n .appName(\"BigLake Metastore Iceberg\") \\\n .config(\"spark.sql.catalog.\u003cvar translate=\"no\"\u003eCATALOG_NAME\u003c/var\u003e\", \"org.apache.iceberg.spark.SparkCatalog\") \\\n .config(\"spark.sql.catalog.\u003cvar translate=\"no\"\u003eCATALOG_NAME\u003c/var\u003e.catalog-impl\", \"org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog\") \\\n .config(\"spark.sql.catalog.\u003cvar translate=\"no\"\u003eCATALOG_NAME\u003c/var\u003e.gcp_project\", \"\u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e\") \\\n .config(\"spark.sql.catalog.\u003cvar translate=\"no\"\u003eCATALOG_NAME\u003c/var\u003e.gcp_location\", \"\u003cvar translate=\"no\"\u003eLOCATION\u003c/var\u003e\") \\\n .config(\"spark.sql.catalog.\u003cvar translate=\"no\"\u003eCATALOG_NAME\u003c/var\u003e.warehouse\", \"\u003cvar translate=\"no\"\u003eWAREHOUSE_DIRECTORY\u003c/var\u003e\") \\\n .getOrCreate()\n spark.conf.set(\"viewsEnabled\",\"true\")\n\n # Use the blms_catalog\n spark.sql(\"USE `\u003cvar translate=\"no\"\u003eCATALOG_NAME\u003c/var\u003e`;\")\n spark.sql(\"USE NAMESPACE \u003cvar translate=\"no\"\u003eDATASET_NAME\u003c/var\u003e;\")\n\n # Configure spark for temp results\n spark.sql(\"CREATE namespace if not exists \u003cvar translate=\"no\"\u003eMATERIALIZATION_NAMESPACE\u003c/var\u003e\");\n spark.conf.set(\"materializationDataset\",\"\u003cvar translate=\"no\"\u003eMATERIALIZATION_NAMESPACE\u003c/var\u003e\")\n\n # List the tables in the dataset\n df = spark.sql(\"SHOW TABLES;\")\n df.show();\n\n # Query the tables\n sql = \"\"\"SELECT * FROM \u003cvar translate=\"no\"\u003eDATASET_NAME\u003c/var\u003e.\u003cvar translate=\"no\"\u003eTABLE_NAME\u003c/var\u003e\"\"\"\n df = spark.read.format(\"bigquery\").load(sql)\n df.show()\n\n sql = \"\"\"SELECT * FROM \u003cvar translate=\"no\"\u003eDATASET_NAME\u003c/var\u003e.\u003cvar translate=\"no\"\u003eICEBERG_TABLE_NAME\u003c/var\u003e\"\"\"\n df = spark.read.format(\"bigquery\").load(sql)\n df.show()\n\n sql = \"\"\"SELECT * FROM \u003cvar translate=\"no\"\u003eDATASET_NAME\u003c/var\u003e.\u003cvar translate=\"no\"\u003eREADONLY_ICEBERG_TABLE_NAME\u003c/var\u003e\"\"\"\n df = spark.read.format(\"bigquery\").load(sql)\n df.show()\n ```\n\n Replace the following:\n - \u003cvar translate=\"no\"\u003eWAREHOUSE_DIRECTORY\u003c/var\u003e: the URI of the Cloud Storage folder that's connected to your BigLake Iceberg table in BigQuery and your Iceberg external table.\n - \u003cvar translate=\"no\"\u003eCATALOG_NAME\u003c/var\u003e: the name of the catalog that you're using.\n - \u003cvar translate=\"no\"\u003eMATERIALIZATION_NAMESPACE\u003c/var\u003e: the namespace for storing temp results.\n9. Run the PySpark script using Serverless Spark.\n\n ```bash\n gcloud dataproc batches submit pyspark SCRIPT_PATH \\\n --version=2.2 \\\n --project=PROJECT_ID \\\n --region=REGION \\\n --deps-bucket=YOUR_BUCKET \\\n ```\n\n Replace the following:\n - \u003cvar translate=\"no\"\u003eSCRIPT_PATH\u003c/var\u003e: the path to the script that the batch job uses.\n - \u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e: the ID of the Google Cloud project to run the batch job in.\n - \u003cvar translate=\"no\"\u003eREGION\u003c/var\u003e: the region where your workload runs.\n - \u003cvar translate=\"no\"\u003eYOUR_BUCKET\u003c/var\u003e: the location of the Cloud Storage bucket to upload workload dependencies. The `gs://` URI prefix of the bucket is not required. You can specify the bucket path or bucket name, for example, `mybucketname1`.\n\nWhat's next\n-----------\n\n- Set up [optional BigLake metastore features](/bigquery/docs/blms-features)."]]