Organiza tus páginas con colecciones
Guarda y categoriza el contenido según tus preferencias.
Puedes instalar componentes adicionales, como Delta Lake, cuando creas un clúster de Dataproc con la función de componentes opcionales. En esta página, se describe cómo instalar opcionalmente el componente de Delta Lake en un clúster de Dataproc.
Cuando se instala en un clúster de Dataproc, el componente de Delta Lake instala las bibliotecas de Delta Lake y configura Spark y Hive en el clúster para que funcionen con Delta Lake.
Versiones de imágenes de Dataproc compatibles
Puedes instalar el componente de Delta Lake en clústeres de Dataproc creados con la versión de imagen 2.2.46 de Dataproc y versiones posteriores.
Consulta las versiones compatibles de Dataproc para la versión del componente Delta Lake incluida en las versiones de imágenes de Dataproc.
Propiedades relacionadas con Delta Lake
Cuando creas un clúster de Dataproc con el componente de Delta Lake habilitado, se configuran las siguientes propiedades de Spark para que funcionen con Delta Lake.
Archivo de configuración
Propiedad
Valor predeterminado
/etc/spark/conf/spark-defaults.conf
spark.sql.extensions
io.delta.sql.DeltaSparkSessionExtension
/etc/spark/conf/spark-defaults.conf
spark.sql.catalog.spark_catalog
org.apache.spark.sql.delta.catalog.DeltaCatalog
Instala el componente
Instala el componente cuando crees un clúster de Dataproc con la consola de Google Cloud , Google Cloud CLI o la API de Dataproc.
Console
En la consola de Google Cloud , ve a la página Crear un clúster de Dataproc.
En la sección Componentes, en Componentes opcionales, selecciona Delta Lake y otros componentes opcionales para instalar en tu clúster.
gcloud CLI
Para crear un clúster de Dataproc que incluya el componente Delta Lake, usa el comando gcloud dataproc clusters create con la marca --optional-components.
En esta sección, se proporcionan ejemplos de lectura y escritura de datos con tablas de Delta Lake.
Tabla de Delta Lake
Escribe en una tabla de Delta Lake
Puedes usar el DataFrame de Spark para escribir datos en una tabla de Delta Lake. En los siguientes ejemplos, se crea un objeto DataFrame con datos de muestra, se crea una tabla de Delta Lake my_delta_table en Cloud Storage y, luego, se escriben los datos en la tabla de Delta Lake.
PySpark
# Create a DataFrame with sample data.data=spark.createDataFrame([(1,"Alice"),(2,"Bob")],["id","name"])# Create a Delta Lake table in Cloud Storage.spark.sql("""CREATE TABLE IF NOT EXISTS my_delta_table ( id integer, name string)USING deltaLOCATION 'gs://delta-gcs-demo/example-prefix/default/my_delta_table'""")# Write the DataFrame to the Delta Lake table in Cloud Storage.data.writeTo("my_delta_table").append()
Scala
// Create a DataFrame with sample data.valdata=Seq((1,"Alice"),(2,"Bob")).toDF("id","name")// Create a Delta Lake table in Cloud Storage.spark.sql("""CREATE TABLE IF NOT EXISTS my_delta_table ( id integer, name string)USING deltaLOCATION 'gs://delta-gcs-demo/example-prefix/default/my_delta_table'""")// Write the DataFrame to the Delta Lake table in Cloud Storage.data.write.format("delta").mode("append").saveAsTable("my_delta_table")
La clase io.delta.hive.DeltaStorageHandler implementa las APIs de la fuente de datos de Hive. Puede cargar una tabla Delta y extraer sus metadatos. Si el esquema de la tabla en la instrucción CREATE TABLE no coincide con los metadatos subyacentes de Delta Lake, se genera un error.
Lee desde una tabla de Delta Lake en Hive.
Para leer datos de una tabla Delta, usa una instrucción SELECT:
SELECT*FROMdeltaTable;
Descarta una tabla de Delta Lake.
Para descartar una tabla de Delta, usa la instrucción DROP TABLE:
[[["Fácil de comprender","easyToUnderstand","thumb-up"],["Resolvió mi problema","solvedMyProblem","thumb-up"],["Otro","otherUp","thumb-up"]],[["Difícil de entender","hardToUnderstand","thumb-down"],["Información o código de muestra incorrectos","incorrectInformationOrSampleCode","thumb-down"],["Faltan la información o los ejemplos que necesito","missingTheInformationSamplesINeed","thumb-down"],["Problema de traducción","translationIssue","thumb-down"],["Otro","otherDown","thumb-down"]],["Última actualización: 2025-09-04 (UTC)"],[],[],null,["You can install additional components like [Delta Lake](https://delta.io/) when\nyou create a Dataproc cluster using the\n[Optional components](/dataproc/docs/concepts/components/overview#available_optional_components)\nfeature. This page describes how you can optionally install the Delta Lake component\non a Dataproc cluster.\n\nWhen installed on a Dataproc cluster, the Delta Lake component installs\nDelta Lake libraries and configures Spark and Hive in the cluster to work with Delta Lake.\n\nCompatible Dataproc image versions\n\nYou can install the Delta Lake component on Dataproc clusters created with\nDataproc image version [2.2.46](/dataproc/docs/concepts/versioning/dataproc-release-2.2)\nand later image versions.\n\nSee\n[Supported Dataproc versions](/dataproc/docs/concepts/versioning/dataproc-versions#supported_cloud_dataproc_versions)\nfor the Delta Lake component version included in Dataproc image releases.\n\nDelta Lake related properties\n\nWhen you create a Dataproc cluster with the Delta Lake component enabled,\nthe following Spark properties are configured to work with Delta Lake.\n\n| Config file | Property | Default value |\n|---------------------------------------|-----------------------------------|---------------------------------------------------|\n| `/etc/spark/conf/spark-defaults.conf` | `spark.sql.extensions` | `io.delta.sql.DeltaSparkSessionExtension` |\n| `/etc/spark/conf/spark-defaults.conf` | `spark.sql.catalog.spark_catalog` | `org.apache.spark.sql.delta.catalog.DeltaCatalog` |\n\nInstall the component\n\nInstall the component when you create a Dataproc cluster using\nthe Google Cloud console, Google Cloud CLI, or the Dataproc API. \n\nConsole\n\n1. In the Google Cloud console, go to the Dataproc **Create a cluster** page.\n\n [Go to Create a cluster](https://console.cloud.google.com/dataproc/clustersAdd)\n\n The **Set up cluster** panel is selected.\n2. In the **Components** section, under **Optional components** , select **Delta Lake** and other optional components to install on your cluster.\n\ngcloud CLI\n\nTo create a Dataproc cluster that includes the Delta Lake component,\nuse the\n[gcloud dataproc clusters create](/sdk/gcloud/reference/dataproc/clusters/create)\ncommand with the `--optional-components` flag. \n\n```\ngcloud dataproc clusters create CLUSTER_NAME \\\n --optional-components=DELTA \\\n --region=REGION \\\n ... other flags\n```\n\nNotes:\n\n- \u003cvar translate=\"no\"\u003eCLUSTER_NAME\u003c/var\u003e: Specify the name of the cluster.\n- \u003cvar translate=\"no\"\u003eREGION\u003c/var\u003e: Specify a [Compute Engine region](/compute/docs/regions-zones#available) where the cluster will be located.\n\nREST API\n\nThe Delta Lake component can be specified through the Dataproc API using the\n[SoftwareConfig.Component](/dataproc/docs/reference/rest/v1/ClusterConfig#Component)\nas part of a\n[clusters.create](/dataproc/docs/reference/rest/v1/projects.regions.clusters/create)\nrequest.\n\nUsage examples\n\nThis section provides data read and write examples using Delta Lake tables. \n\nDelta Lake table\n\nWrite to a Delta Lake table\n\nYou can use the [Spark DataFrame](https://spark.apache.org/docs/latest/sql-programming-guide.html)\nto write data to a Delta Lake table. The following examples create a `DataFrame`\nwith sample data, create a `my_delta_table` Delta Lake table In\nCloud Storage, and then write the data to the Delta Lake table.\n\nPySpark \n\n # Create a DataFrame with sample data.\n data = spark.createDataFrame([(1, \"Alice\"), (2, \"Bob\")], [\"id\", \"name\"])\n\n # Create a Delta Lake table in Cloud Storage.\n spark.sql(\"\"\"CREATE TABLE IF NOT EXISTS my_delta_table (\n id integer,\n name string)\n USING delta\n LOCATION 'gs://delta-gcs-demo/example-prefix/default/my_delta_table'\"\"\")\n\n # Write the DataFrame to the Delta Lake table in Cloud Storage.\n data.writeTo(\"my_delta_table\").append()\n\nScala \n\n // Create a DataFrame with sample data.\n val data = Seq((1, \"Alice\"), (2, \"Bob\")).toDF(\"id\", \"name\")\n\n // Create a Delta Lake table in Cloud Storage.\n spark.sql(\"\"\"CREATE TABLE IF NOT EXISTS my_delta_table (\n id integer,\n name string)\n USING delta\n LOCATION 'gs://delta-gcs-demo/example-prefix/default/my_delta_table'\"\"\")\n\n // Write the DataFrame to the Delta Lake table in Cloud Storage.\n data.write.format(\"delta\").mode(\"append\").saveAsTable(\"my_delta_table\")\n\nSpark SQL \n\n CREATE TABLE IF NOT EXISTS my_delta_table (\n id integer,\n name string)\n USING delta\n LOCATION 'gs://delta-gcs-demo/example-prefix/default/my_delta_table';\n\n INSERT INTO my_delta_table VALUES (\"1\", \"Alice\"), (\"2\", \"Bob\");\n\nRead from a Delta Lake table\n\nThe following examples read the `my_delta_table` and display its contents.\n\nPySpark \n\n # Read the Delta Lake table into a DataFrame.\n df = spark.table(\"my_delta_table\")\n\n # Display the data.\n df.show()\n\nScala \n\n // Read the Delta Lake table into a DataFrame.\n val df = spark.table(\"my_delta_table\")\n\n // Display the data.\n df.show()\n\nSpark SQL \n\n SELECT * FROM my_delta_table;\n\nHive with Delta Lake\n\nWrite to a Delta Table in Hive.\n\nThe Dataproc Delta Lake optional component is pre-configured to\nwork with Hive external tables.\n\nFor more information, see [Hive connector](https://github.com/delta-io/delta/tree/master/connectors/hive#hive-connector).\n\nRun the examples in a beeline client. \n\n beeline -u jdbc:hive2://\n\nCreate a Spark Delta Lake table.\n\nThe Delta Lake table must be created using Spark before a Hive external table\ncan reference it. \n\n CREATE TABLE IF NOT EXISTS my_delta_table (\n id integer,\n name string)\n USING delta\n LOCATION 'gs://delta-gcs-demo/example-prefix/default/my_delta_table';\n\n INSERT INTO my_delta_table VALUES (\"1\", \"Alice\"), (\"2\", \"Bob\");\n\nCreate a Hive external table. \n\n SET hive.input.format=io.delta.hive.HiveInputFormat;\n SET hive.tez.input.format=io.delta.hive.HiveInputFormat;\n\n CREATE EXTERNAL TABLE deltaTable(id INT, name STRING)\n STORED BY 'io.delta.hive.DeltaStorageHandler'\n LOCATION 'gs://delta-gcs-demo/example-prefix/default/my_delta_table';\n\nNotes:\n\n- The `io.delta.hive.DeltaStorageHandler` class implements the Hive data source APIs. It can load a Delta table and extract its metadata. If the table schema in the `CREATE TABLE` statement is not consistent with the underlying Delta Lake metadata, an error is thrown.\n\nRead from a Delta Lake table in Hive.\n\nTo read data from a Delta table, use a `SELECT` statement: \n\n SELECT * FROM deltaTable;\n\nDrop a Delta Lake table.\n\nTo drop a Delta table, use the `DROP TABLE` statement: \n\n DROP TABLE deltaTable;\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e"]]